Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Ranjit J. Vaity 66 posts 109 karma points
    Apr 25, 2010 @ 22:19
    Ranjit J. Vaity
    0

    Indexing contents from PDF files.

    Hi,

    Currently I have a requirement in our project, where in I are Editors can upload PDF documents using Umbraco admin section.

    Later if user searches for a keyword, along with content on the umbraco pages it should also search the content of the PDF docs.

    In here, I need to index contents from PDF documents from media folder.

    After couple of hours of research, I have seen we can Index using Lucene but can only be done on text files. So now I need to extract text from PDFs files now. And later index on those text files.

    I am currently looking in to iTextSharp. But not able to find any method to extract from PDF pages.

    Any suggestions, advice and help will be appreciated.

    Thank you in advance.

    /Ranjit J. Vaity

     

     

  • Dirk De Grave 4541 posts 6021 karma points MVP 3x admin c-trib
    Apr 26, 2010 @ 09:32
    Dirk De Grave
    0

    Hi Ranjit,

    Check this project and see if that fits your needs.

     

    Cheers,

    /Dirk

  • Darren Ferguson 1022 posts 3259 karma points MVP c-trib
    Apr 26, 2010 @ 10:21
    Darren Ferguson
    0

    Also consider generating PDF's from within Umbraco - http://our.umbraco.org/projects/xsl-pdf-creator

    The content nodes should then be indexed OOTB by Examine in Umbraco 4.1

     

     

     

  • Ranjit J. Vaity 66 posts 109 karma points
    May 02, 2010 @ 17:48
    Ranjit J. Vaity
    0

    Hi Dirk / Darren,

    The link provided by you got excellent stuff, really I can use.

    Thanks you for your reply, guys.

    Cheers,

    /Ranjit J. Vaity

     

     

     

Please Sign in or register to post replies

Write your reply to:

Draft