Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Roger Sutton 52 posts 124 karma points
    Feb 17, 2014 @ 18:36
    Roger Sutton
    0

    Indexing PDF files in Content nodes rather than Media nodes

    We have a document type that supports a list of PDFs to display in a page. Ideally we would like to include these in search indexes but the default PDF indexer only looks at files that are on Media nodes. Does anyone have any experience of indexing PDFs in content nodes?

    Thanks in advance

    Roger

     

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Feb 17, 2014 @ 18:41
    Ismail Mayat
    0

    Roger,

    You will need to implement a page save event and for that doctype get the pdf load it up into third party library and extract the pdf content then save that into another field on the document. I did something similar years ago. You could use something like https://pdfapi.codeplex.com/ or just google and find some other .net opensource library to extract the content.

    Regards

    Ismail

  • Roger Sutton 52 posts 124 karma points
    Feb 18, 2014 @ 10:24
    Roger Sutton
    0

    Thanks for the reply Ismail.

    Sadly I don't think this will be a runner as there are about 20 - 30 PDFs listed for each page and that's just too much data. Does the indexing not read the actual PDF file itself in the same way we used to catalogue PDFs in the old MS index server?

    Supplemental question, does it happens this way when they are in the media area, ie. the data is recorded onto the PDF node as well as in the file itself?

    Or, hopefully, have I missed a point somewhere?

    Rog

     

     

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Feb 18, 2014 @ 16:01
    Ismail Mayat
    100

    Roger,

    It will not read the pdf. Even in the media section the contents of the pdf are not indexed you have to index yourself which is why i wrote http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer this will index content of all sorts however its not very fast as it uses apache tika so if you have loads of media it can chug a little. This will not get round your issue. Only way I can see is with document save event get the pdf data then and add to field on the document. So long as you are not constantly saving i.e they are not updated often performance should not be an issue?

    Regard

    Ismail

  • Roger Sutton 52 posts 124 karma points
    Feb 18, 2014 @ 16:52
    Roger Sutton
    0

    Ismail,

    Thank you very much, I was under the impression the PDF indexer read the content and have been faffing around a lot because of that. I'll look at http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer and try from there.

    Regards

    Roger

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Feb 18, 2014 @ 18:18
    Ismail Mayat
    0

    Roger,

    I misread your question the pdf indexer in media should read pdf content. If you have pdfs attached to content then it would be indexed as part of the content. Give my pdf indexer a whirl it should work.

    Regards

    Ismail

  • Akshatha 1 post 71 karma points
    Apr 04, 2016 @ 03:18
    Akshatha
    0

    Hi Roger,

    I've a similar requirement in my Umbraco project, where I need to index only those media files which are attached to the content node. Could you please share how have you implemented it.

    Thanks heaps !

    Regards, Akshatha

Please Sign in or register to post replies

Write your reply to:

Draft