Is there any way to get Examine/Lucene to index PDFs that are attached to an Upload field on a Content node? I found CogUmbracoExamineMediaIndexer but it appears to only work in the Media section (I just ended up with an empty index).
It's hard to find the right terms to search for so I haven't had any luck with figuring this one out myself. Does anyone know whether this is possible?
You will need to create an event handler on documents for save or publish then in the event check the type of doc and if it is one with pdf then you can get the pdf and then call method say in cogumbracoexaminemediaindexer then add content to index. I will take a look at method etc you need to call. The cogumbracoexamineindexer will only work with media section items not upload fields in content.
In the cogumbracomediaindexer in class mediaindexer there is internal class see method ParseMedia text you can rip that method and use that to extract out the content of the pdf. So overall i would do the following:
On my doc type that has the upload file create a new field call it say extractedPdfContent.
Implement Document before publish event in there check if current document is of type that contains pdf upload field if it does then get the pdf file. Pass that file to the updated ParseMedia method and that will via apache tika extract out the pdf content for you add this extracted pdf content to the field you created in 1.
After the publish completes Examine indexing will take over and the extracted pdf content will end up in your index. So its doable just needs a bit of work.
Thanks for that. I'd never built an event handler so that was a bit of a challenge (not helped by the documentation providing a code sample but no guidance on what to do with it!) but I got it all working in the end :)
Probably easiest thing to do is create on the document type with pdf upload and extra multi text field say call it PdfExtract. Then create an on save event for the docuement and in that save event do a test for pdf, it its present then get the file extract the contents of the pdf (you can make use of examine pdf extractor or install cogexaminemediaindexer) save the extracted contents into the document. This will then get added to your external index and then you can search on it. By the way I am assuming that when you say meta data you are adding meta data to the umbraco document and want to search on that as well and NOT actual meta in the pdf.
Thanks for your quick reply. That was something I was also thinking about . but one question rises, will the database not be getting big by the extracted pdf content in the multitext field? could it be possible to not save pdf content in the mutlitext field but only in the index? or could this be a problem when i change a pdf content?
I also was thinking of using a multisearch construction (different pdf index) but i think I can not get the desired result by that.
Ok so ignore my suggestion also do not go for separate index. What you could do is implement gatheringnode data event and for the document type with pdf upload extract and inject into the index so then its only in the index. When you update pdf the gatheringnode will again fire becuase its fired on document publish there it will update in the index.
Could you give me some direction in implementing the gatheringnode data event and injection into the index. (some good forum / documentation links). At this moment i am getting a bit confused of all the information i am gathering. I have a vs 2012 solution with umbraco 7, pdf indexer is working etc. So I think i can accomplish this but i a missing some knowledge about this.
I had found your blog earlier today :-) . I did not find your youTube Video. That might be a nice addition to understand the topic better. Is your solution also possible on umbraco 7? I can not find the namespace Umbraco_Site_Extensions.Helpers . Is that a custom addition?
This post is helpful, but I could use a bit more guidance. I've added publishing event handler so that I can update my PDFContent property on my node prior to publishing. Ismail in a previous post you state:
"get the file extract the contents of the pdf (you can make use of examine pdf extractor or install cogexaminemediaindexer) save the extracted contents into the document."
I reviewed your ParseMediaText code in the CogUmbracoExamineMediaIndexer project and found a number of code dependencies. I'm having some trouble siphoning out the peices I need since Tika is also involved. Looks like the Examine PDF Extractor might be an altenative but I'm not sure where to find that source code. Do you have a more concise example of how to get the PDF content into an object that I can save in my PDFContent property?
Examine and uploaded PDFs
Hi,
Is there any way to get Examine/Lucene to index PDFs that are attached to an Upload field on a Content node? I found CogUmbracoExamineMediaIndexer but it appears to only work in the Media section (I just ended up with an empty index).
It's hard to find the right terms to search for so I haven't had any luck with figuring this one out myself. Does anyone know whether this is possible?
Chris,
You will need to create an event handler on documents for save or publish then in the event check the type of doc and if it is one with pdf then you can get the pdf and then call method say in cogumbracoexaminemediaindexer then add content to index. I will take a look at method etc you need to call. The cogumbracoexamineindexer will only work with media section items not upload fields in content.
Regards
Ismail
Chris,
In the cogumbracomediaindexer in class mediaindexer there is internal class see method ParseMedia text you can rip that method and use that to extract out the content of the pdf. So overall i would do the following:
On my doc type that has the upload file create a new field call it say extractedPdfContent.
Implement Document before publish event in there check if current document is of type that contains pdf upload field if it does then get the pdf file. Pass that file to the updated ParseMedia method and that will via apache tika extract out the pdf content for you add this extracted pdf content to the field you created in 1.
After the publish completes Examine indexing will take over and the extracted pdf content will end up in your index. So its doable just needs a bit of work.
Regards
Ismail
Thanks for that. I'd never built an event handler so that was a bit of a challenge (not helped by the documentation providing a code sample but no guidance on what to do with it!) but I got it all working in the end :)
Hi, I have an additional question to this.
First of all, I am pretty new to examine (and umbraco 7..).
Trying to do the following:
Create a search on a documenttype with :
> uploaded pdf documents
> some metadata to it like author, publish date, categories
What is the best way to let the end user search on pdf content and categorize on metadata values ?
Kind regards , Ad
Ad,
Probably easiest thing to do is create on the document type with pdf upload and extra multi text field say call it PdfExtract. Then create an on save event for the docuement and in that save event do a test for pdf, it its present then get the file extract the contents of the pdf (you can make use of examine pdf extractor or install cogexaminemediaindexer) save the extracted contents into the document. This will then get added to your external index and then you can search on it. By the way I am assuming that when you say meta data you are adding meta data to the umbraco document and want to search on that as well and NOT actual meta in the pdf.
Regards
Ismail
Hi Ismail,
Thanks for your quick reply. That was something I was also thinking about . but one question rises, will the database not be getting big by the extracted pdf content in the multitext field? could it be possible to not save pdf content in the mutlitext field but only in the index? or could this be a problem when i change a pdf content?
I also was thinking of using a multisearch construction (different pdf index) but i think I can not get the desired result by that.
Kind regards,
Ad
Ad,
Ok so ignore my suggestion also do not go for separate index. What you could do is implement gatheringnode data event and for the document type with pdf upload extract and inject into the index so then its only in the index. When you update pdf the gatheringnode will again fire becuase its fired on document publish there it will update in the index.
Regards
Ismail
Thanks Ismail,
Could you give me some direction in implementing the gatheringnode data event and injection into the index. (some good forum / documentation links).
At this moment i am getting a bit confused of all the information i am gathering. I have a vs 2012 solution with umbraco 7, pdf indexer is working etc.
So I think i can accomplish this but i a missing some knowledge about this.
Thank you so far!
Kind regards,
Ad
Ad,
Take a look at http://thecogworks.co.uk/blog/posts/2012/november/examiness-hints-and-tips-from-the-trenches-part-2/ I did quite a few blog posts on examine well worth going through, also there is accompanying video https://www.youtube.com/watch?v=6AMb0rrSrJw
Regards
Ismail
Thanks Ismail,
I had found your blog earlier today :-) . I did not find your youTube Video. That might be a nice addition to understand the topic better.
Is your solution also possible on umbraco 7? I can not find the namespace Umbraco_Site_Extensions.Helpers . Is that a custom addition?
I will watch your youtube video first !
Kind regards,
Ad
That namespace is custom addition and this will work with v7 as gatheringnode data is in all umbraco / examine versions
Hi Ismael,
Thank you for your generous help!
I got the GatherNodeDatata event working. I can extract the pdf text with your library and it's added to the index!
Have a nice day
Kind regards,
Ad
Ad,
Well done mate, examine its not that scary lol!!!
This post is helpful, but I could use a bit more guidance. I've added publishing event handler so that I can update my PDFContent property on my node prior to publishing. Ismail in a previous post you state:
"get the file extract the contents of the pdf (you can make use of examine pdf extractor or install cogexaminemediaindexer) save the extracted contents into the document."
I reviewed your ParseMediaText code in the CogUmbracoExamineMediaIndexer project and found a number of code dependencies. I'm having some trouble siphoning out the peices I need since Tika is also involved. Looks like the Examine PDF Extractor might be an altenative but I'm not sure where to find that source code. Do you have a more concise example of how to get the PDF content into an object that I can save in my PDFContent property?
Janet,
Did you get my fix yesterday for the cogexaminemediaindexer? If you want the other pdfindexer it has been moved into umbraco core source and will be coming out wiht 7.2.0 see https://github.com/umbraco/Umbraco-CMS/tree/7.2.0/src/UmbracoExamine.PDF
Regards
Ismail
is working on a reply...