I'm trying to implement the ezSearch package to allow customers to my site to search nodes as well as media. It works great except it doesn't seem to be able to retrieve the contents of the PDF documents it indexes.
Upon further investigation of the indexes with Luke, the 'contnets' field seems to contain loads of random info including a GUID, a couple of dates and some file names etc.
In previous versions of umbraco, I successfully used the indexer in the Umbraco.Examine.PDF namespace to index my documents. I think it used iTextSharp under the hood as I had to override some of the error checking a while back and re-compile the dll so that I could get it to index a few '000 docs on a site I was managing then.
Can anyone help me with a solution to index my PDF documents and their contents in Umbraco v7 please?
I've tried three documents now and they're all looking the same in the index. I may have just been really unlucky with the documents I've tried...
Would you expect the ExternalIndexer to index the contents of PDF documents out-of-the-box or do I need to do something like include an explicit PDF indexer as I did in my 4.x projects? Does the Umbraco.examine.PDF namespace with its indexing methods still exist and work do you know?
I thought you were using the pdf indexer? You will need that as externalindexer is content only. If you setup the pdf indexer as you used to then in theory you should get pdfs in the pdf index.
Indexing PDF Media
Hi,
I'm trying to implement the ezSearch package to allow customers to my site to search nodes as well as media. It works great except it doesn't seem to be able to retrieve the contents of the PDF documents it indexes.
Upon further investigation of the indexes with Luke, the 'contnets' field seems to contain loads of random info including a GUID, a couple of dates and some file names etc.
In previous versions of umbraco, I successfully used the indexer in the Umbraco.Examine.PDF namespace to index my documents. I think it used iTextSharp under the hood as I had to override some of the error checking a while back and re-compile the dll so that I could get it to index a few '000 docs on a site I was managing then.
Can anyone help me with a solution to index my PDF documents and their contents in Umbraco v7 please?
Thanks, Matt
Matt,
Are all the pdfs indexed with random info or just some of them. I recall the disclaimer from shannon that not all pdf's can be indexed.
Regards
Ismail
Hi Ismail,
Thanks for the quick reply.
I've tried three documents now and they're all looking the same in the index. I may have just been really unlucky with the documents I've tried...
Would you expect the ExternalIndexer to index the contents of PDF documents out-of-the-box or do I need to do something like include an explicit PDF indexer as I did in my 4.x projects? Does the Umbraco.examine.PDF namespace with its indexing methods still exist and work do you know?
Thanks again, Matt
Matt,
I thought you were using the pdf indexer? You will need that as externalindexer is content only. If you setup the pdf indexer as you used to then in theory you should get pdfs in the pdf index.
Regards
Ismail
is working on a reply...