indexing pdf files in content nodes rather than media nodes

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Roger Sutton 52 posts 124 karma points

Feb 17, 2014 @ 18:36

0

Indexing PDF files in Content nodes rather than Media nodes

We have a document type that supports a list of PDFs to display in a page. Ideally we would like to include these in search indexes but the default PDF indexer only looks at files that are on Media nodes. Does anyone have any experience of indexing PDFs in content nodes?

Thanks in advance

Roger

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Feb 17, 2014 @ 18:41

0

Roger,

You will need to implement a page save event and for that doctype get the pdf load it up into third party library and extract the pdf content then save that into another field on the document. I did something similar years ago. You could use something like https://pdfapi.codeplex.com/ or just google and find some other .net opensource library to extract the content.

Regards

Ismail

Copy Link
Roger Sutton 52 posts 124 karma points

Feb 18, 2014 @ 10:24

0

Thanks for the reply Ismail.

Sadly I don't think this will be a runner as there are about 20 - 30 PDFs listed for each page and that's just too much data. Does the indexing not read the actual PDF file itself in the same way we used to catalogue PDFs in the old MS index server?

Supplemental question, does it happens this way when they are in the media area, ie. the data is recorded onto the PDF node as well as in the file itself?

Or, hopefully, have I missed a point somewhere?

Rog

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Feb 18, 2014 @ 16:01

100

Roger,

It will not read the pdf. Even in the media section the contents of the pdf are not indexed you have to index yourself which is why i wrote http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer this will index content of all sorts however its not very fast as it uses apache tika so if you have loads of media it can chug a little. This will not get round your issue. Only way I can see is with document save event get the pdf data then and add to field on the document. So long as you are not constantly saving i.e they are not updated often performance should not be an issue?

Regard

Ismail

Copy Link
Roger Sutton 52 posts 124 karma points

Feb 18, 2014 @ 16:52

0

Ismail,

Thank you very much, I was under the impression the PDF indexer read the content and have been faffing around a lot because of that. I'll look at http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer and try from there.

Regards

Roger

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Feb 18, 2014 @ 18:18

0

Roger,

I misread your question the pdf indexer in media should read pdf content. If you have pdfs attached to content then it would be indexed as part of the content. Give my pdf indexer a whirl it should work.

Regards

Ismail

Copy Link
Akshatha 1 post 71 karma points

Apr 04, 2016 @ 03:18

0

Hi Roger,

I've a similar requirement in my Umbraco project, where I need to index only those media files which are attached to the content node. Could you please share how have you implemented it.

Thanks heaps !

Regards, Akshatha

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

Indexing PDF files in Content nodes rather than Media nodes