Hi i was wondering the best way to implement a search including indexed pdfs I've seen stuff like umbraco examine and was wondering how you'd integrate pdfbox or something in to get it to include pdf files as well?
I recently completed both Levels of training and now certified but for the life of me cannot see where I can setup a search that will ALSO search inside PDFS or WORD Documents. I understand that I will have to setup a iFilter for these types but the UmbSearch2 appears to be a dead project and also seem to not work in 4.7.
The XSLT search obviously doesn't search within documents ... so where is this capability. Wouldn't one think that this is something many of us need... or am I the only person?
If you download the latest version of Examine from http://examine.codeplex.com then you'll see that it has support for indexing PDFs as well as Umbraco content. It doesn't do Word documents though. By default you have to have a separate index for your PDF. But you can combine multiple indexes in Examine using a Lucene MultiIndexSearcher.
Its a bit fiddly to set up, but once you get it working its pretty straightforward. The codeplex site has examples of indexing PDF content I think. One trick that you can use to make searching multiple indexes easier is to attach to the NodeIndexing event of the PDF index, and create fields with the samer name as the content index for fields that you want to search on. For example using the media file name as the page title and the contents of the file as the page copy. That way you can do a combined search on a single set of fields.
Before cross index searching searching thats how i did pdf searching with main index. I would now go down 2 index route as latest version of examine can cross search. To index search word docs you would need to create your own indexer / searcher take a look at the code for pdf searcher. To extract out the word content you can as you have already observed use iFilter.
I am desperate to have this site of mine be able to search the site AND within the contents of PDF files ... I am willing to pay for someone to implement what I need to get this done... my site (staging) is at http://crdha.allanlevsen.com - I guess I would need you to make the changes to the examine config files, any DLL I need, and a search control and results page.
I would REALLY appreciate this and need it quickly - my email is [email protected] and I would pay you through paypal if that works.
search documents including pdfs?
Hi i was wondering the best way to implement a search including indexed pdfs I've seen stuff like umbraco examine and was wondering how you'd integrate pdfbox or something in to get it to include pdf files as well?
Regards,
Tom
Have a look at umbSearch2 which has this functionality built-in!
Cheers,
/Dirk
Dirk,
I recently completed both Levels of training and now certified but for the life of me cannot see where I can setup a search that will ALSO search inside PDFS or WORD Documents. I understand that I will have to setup a iFilter for these types but the UmbSearch2 appears to be a dead project and also seem to not work in 4.7.
The XSLT search obviously doesn't search within documents ... so where is this capability. Wouldn't one think that this is something many of us need... or am I the only person?
Could you please point me in the right direction?
Allan
@Allan,
If you download the latest version of Examine from http://examine.codeplex.com then you'll see that it has support for indexing PDFs as well as Umbraco content. It doesn't do Word documents though. By default you have to have a separate index for your PDF. But you can combine multiple indexes in Examine using a Lucene MultiIndexSearcher.
Its a bit fiddly to set up, but once you get it working its pretty straightforward. The codeplex site has examples of indexing PDF content I think. One trick that you can use to make searching multiple indexes easier is to attach to the NodeIndexing event of the PDF index, and create fields with the samer name as the content index for fields that you want to search on. For example using the media file name as the page title and the contents of the file as the page copy. That way you can do a combined search on a single set of fields.
Tim,Allan,
Before cross index searching searching thats how i did pdf searching with main index. I would now go down 2 index route as latest version of examine can cross search. To index search word docs you would need to create your own indexer / searcher take a look at the code for pdf searcher. To extract out the word content you can as you have already observed use iFilter.
Regards
Ismail
Ismail is the cross index searching stuff the same as the MultiIndexSearcher, or something different? If its different, can you provide some details?
:)
@Ismail or @Tim or @Anyone
I am desperate to have this site of mine be able to search the site AND within the contents of PDF files ... I am willing to pay for someone to implement what I need to get this done... my site (staging) is at http://crdha.allanlevsen.com - I guess I would need you to make the changes to the examine config files, any DLL I need, and a search control and results page.
I would REALLY appreciate this and need it quickly - my email is [email protected] and I would pay you through paypal if that works.
Thanks,
Allan
is working on a reply...