Indexing PDF files in Content nodes rather than Media nodes
We have a document type that supports a list of PDFs to display in a page. Ideally we would like to include these in search indexes but the default PDF indexer only looks at files that are on Media nodes. Does anyone have any experience of indexing PDFs in content nodes?
You will need to implement a page save event and for that doctype get the pdf load it up into third party library and extract the pdf content then save that into another field on the document. I did something similar years ago. You could use something like https://pdfapi.codeplex.com/ or just google and find some other .net opensource library to extract the content.
Sadly I don't think this will be a runner as there are about 20 - 30 PDFs listed for each page and that's just too much data. Does the indexing not read the actual PDF file itself in the same way we used to catalogue PDFs in the old MS index server?
Supplemental question, does it happens this way when they are in the media area, ie. the data is recorded onto the PDF node as well as in the file itself?
It will not read the pdf. Even in the media section the contents of the pdf are not indexed you have to index yourself which is why i wrote http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer this will index content of all sorts however its not very fast as it uses apache tika so if you have loads of media it can chug a little. This will not get round your issue. Only way I can see is with document save event get the pdf data then and add to field on the document. So long as you are not constantly saving i.e they are not updated often performance should not be an issue?
I misread your question the pdf indexer in media should read pdf content. If you have pdfs attached to content then it would be indexed as part of the content. Give my pdf indexer a whirl it should work.
I've a similar requirement in my Umbraco project, where I need to index only those media files which are attached to the content node. Could you please share how have you implemented it.
Indexing PDF files in Content nodes rather than Media nodes
We have a document type that supports a list of PDFs to display in a page. Ideally we would like to include these in search indexes but the default PDF indexer only looks at files that are on Media nodes. Does anyone have any experience of indexing PDFs in content nodes?
Thanks in advance
Roger
Roger,
You will need to implement a page save event and for that doctype get the pdf load it up into third party library and extract the pdf content then save that into another field on the document. I did something similar years ago. You could use something like https://pdfapi.codeplex.com/ or just google and find some other .net opensource library to extract the content.
Regards
Ismail
Thanks for the reply Ismail.
Sadly I don't think this will be a runner as there are about 20 - 30 PDFs listed for each page and that's just too much data. Does the indexing not read the actual PDF file itself in the same way we used to catalogue PDFs in the old MS index server?
Supplemental question, does it happens this way when they are in the media area, ie. the data is recorded onto the PDF node as well as in the file itself?
Or, hopefully, have I missed a point somewhere?
Rog
Roger,
It will not read the pdf. Even in the media section the contents of the pdf are not indexed you have to index yourself which is why i wrote http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer this will index content of all sorts however its not very fast as it uses apache tika so if you have loads of media it can chug a little. This will not get round your issue. Only way I can see is with document save event get the pdf data then and add to field on the document. So long as you are not constantly saving i.e they are not updated often performance should not be an issue?
Regard
Ismail
Ismail,
Thank you very much, I was under the impression the PDF indexer read the content and have been faffing around a lot because of that. I'll look at http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer and try from there.
Regards
Roger
Roger,
I misread your question the pdf indexer in media should read pdf content. If you have pdfs attached to content then it would be indexed as part of the content. Give my pdf indexer a whirl it should work.
Regards
Ismail
Hi Roger,
I've a similar requirement in my Umbraco project, where I need to index only those media files which are attached to the content node. Could you please share how have you implemented it.
Thanks heaps !
Regards, Akshatha
is working on a reply...