examine and uploaded pdfs

Go to solution

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Chris Mahoney 242 posts 454 karma points

Mar 10, 2014 @ 23:53

0

Examine and uploaded PDFs

Hi,

Is there any way to get Examine/Lucene to index PDFs that are attached to an Upload field on a Content node? I found CogUmbracoExamineMediaIndexer but it appears to only work in the Media section (I just ended up with an empty index).

It's hard to find the right terms to search for so I haven't had any luck with figuring this one out myself. Does anyone know whether this is possible?

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Mar 11, 2014 @ 09:44

0

Chris,

You will need to create an event handler on documents for save or publish then in the event check the type of doc and if it is one with pdf then you can get the pdf and then call method say in cogumbracoexaminemediaindexer then add content to index. I will take a look at method etc you need to call. The cogumbracoexamineindexer will only work with media section items not upload fields in content.

Regards

Ismail

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Mar 11, 2014 @ 10:24
100
Chris,

In the cogumbracomediaindexer in class mediaindexer there is internal class see method ParseMedia text you can rip that method and use that to extract out the content of the pdf. So overall i would do the following:
1. On my doc type that has the upload file create a new field call it say extractedPdfContent.
2. Implement Document before publish event in there check if current document is of type that contains pdf upload field if it does then get the pdf file. Pass that file to the updated ParseMedia method and that will via apache tika extract out the pdf content for you add this extracted pdf content to the field you created in 1.
After the publish completes Examine indexing will take over and the extracted pdf content will end up in your index. So its doable just needs a bit of work.

Regards

Ismail
Copy Link
Chris Mahoney 242 posts 454 karma points

Mar 14, 2014 @ 02:17

1

Thanks for that. I'd never built an event handler so that was a bit of a challenge (not helped by the documentation providing a code sample but no guidance on what to do with it!) but I got it all working in the end :)

Copy Link
ad de Vos 25 posts 46 karma points

Aug 19, 2014 @ 22:39

0

Hi, I have an additional question to this.

First of all, I am pretty new to examine (and umbraco 7..).

Trying to do the following:

Create a search on a documenttype with :

> uploaded pdf documents
> some metadata to it like author, publish date, categories

What is the best way to let the end user search on pdf content and categorize on metadata values ?

Kind regards , Ad

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Aug 20, 2014 @ 12:40

0

Ad,

Probably easiest thing to do is create on the document type with pdf upload and extra multi text field say call it PdfExtract. Then create an on save event for the docuement and in that save event do a test for pdf, it its present then get the file extract the contents of the pdf (you can make use of examine pdf extractor or install cogexaminemediaindexer) save the extracted contents into the document. This will then get added to your external index and then you can search on it. By the way I am assuming that when you say meta data you are adding meta data to the umbraco document and want to search on that as well and NOT actual meta in the pdf.

Regards

Ismail

Copy Link
ad de Vos 25 posts 46 karma points

Aug 20, 2014 @ 12:49

0

Hi Ismail,

Thanks for your quick reply. That was something I was also thinking about . but one question rises, will the database not be getting big by the extracted pdf content in the multitext field? could it be possible to not save pdf content in the mutlitext field but only in the index? or could this be a problem when i change a pdf content?

I also was thinking of using a multisearch construction (different pdf index) but i think I can not get the desired result by that.

Kind regards,

Ad

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Aug 20, 2014 @ 13:20

0

Ad,

Ok so ignore my suggestion also do not go for separate index. What you could do is implement gatheringnode data event and for the document type with pdf upload extract and inject into the index so then its only in the index. When you update pdf the gatheringnode will again fire becuase its fired on document publish there it will update in the index.

Regards

Ismail

Copy Link
ad de Vos 25 posts 46 karma points

Aug 20, 2014 @ 13:25

0

Thanks Ismail,

Could you give me some direction in implementing the gatheringnode data event and injection into the index. (some good forum / documentation links).
At this moment i am getting a bit confused of all the information i am gathering. I have a vs 2012 solution with umbraco 7, pdf indexer is working etc.
So I think i can accomplish this but i a missing some knowledge about this.

Thank you so far!

Kind regards,

Ad

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Aug 20, 2014 @ 16:16

0

Ad,

Take a look at http://thecogworks.co.uk/blog/posts/2012/november/examiness-hints-and-tips-from-the-trenches-part-2/ I did quite a few blog posts on examine well worth going through, also there is accompanying video https://www.youtube.com/watch?v=6AMb0rrSrJw

Regards

Ismail

Copy Link
ad de Vos 25 posts 46 karma points

Aug 20, 2014 @ 16:44

0

Thanks Ismail,

I had found your blog earlier today :-) . I did not find your youTube Video. That might be a nice addition to understand the topic better.
Is your solution also possible on umbraco 7? I can not find the namespace Umbraco_Site_Extensions.Helpers . Is that a custom addition?

I will watch your youtube video first !

Kind regards,

Ad

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Aug 20, 2014 @ 17:00

0

That namespace is custom addition and this will work with v7 as gatheringnode data is in all umbraco / examine versions

Copy Link
ad de Vos 25 posts 46 karma points

Aug 21, 2014 @ 00:13

1

Hi Ismael,

Thank you for your generous help!

I got the GatherNodeDatata event working. I can extract the pdf text with your library and it's added to the index!

Have a nice day

Kind regards,

Ad

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Aug 21, 2014 @ 09:52

0

Ad,

Well done mate, examine its not that scary lol!!!

Copy Link
Janet Kirklen 102 posts 212 karma points

Oct 02, 2014 @ 00:27

0

This post is helpful, but I could use a bit more guidance. I've added publishing event handler so that I can update my PDFContent property on my node prior to publishing. Ismail in a previous post you state:

"get the file extract the contents of the pdf (you can make use of examine pdf extractor or install cogexaminemediaindexer) save the extracted contents into the document."

I reviewed your ParseMediaText code in the CogUmbracoExamineMediaIndexer project and found a number of code dependencies. I'm having some trouble siphoning out the peices I need since Tika is also involved. Looks like the Examine PDF Extractor might be an altenative but I'm not sure where to find that source code. Do you have a more concise example of how to get the PDF content into an object that I can save in my PDFContent property?

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Oct 02, 2014 @ 10:12

0

Janet,

Did you get my fix yesterday for the cogexaminemediaindexer? If you want the other pdfindexer it has been moved into umbraco core source and will be coming out wiht 7.2.0 see https://github.com/umbraco/Umbraco-CMS/tree/7.2.0/src/UmbracoExamine.PDF

Regards

Ismail

Copy Link
is working on a reply...

Please Sign in or register to post replies

Flag this post as spam?