Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Ian Robinson 79 posts 143 karma points
    Oct 19, 2011 @ 13:30
    Ian Robinson
    0

    Reading contents of indexed PDF

    Hi,

    I've set up Examine with a PDF indexer and searcher and it's returning the results I need.  The indexing is using iText to read the PDF.

    When I search on content nodes, I can return a short part of the content, i.e. 300 characters, by reading from the bodyText property.  If I read 300 characters of the FileTextContent property of the PDF document I get a string of characters that aren't very readable. e.g. 

    "Thisisatestdocumentfortestingthesearchingfacility.OliverTwistisbornintoalifeofpovertyand"

    Does any know how to read this in and tidy it up, or should I be reading in a different property?

    The other thing I've considered, is opening up each PDF with iText and then reading in the first 300 characters, but I'd like to avoid the overhead of doing this if I can.

    I'm sure someone else must have come across this issue in the past and solved it?

  • Rodion Novoselov 694 posts 859 karma points
    Oct 19, 2011 @ 13:40
    Rodion Novoselov
    0

    Hi. If performace is your concern then probably, you can simply store sort of  a 'preview' of a pdf document along with itself and show it later. The preview then can be generated while the document is being uploaded or in background by a scheduler - it depends on particular scenario.

  • Ian Robinson 79 posts 143 karma points
    Oct 19, 2011 @ 13:45
    Ian Robinson
    0

    Thanks Rodion, that's a good suggestion, I hadn't thought of that.

    If I could get the information from the FileTextContent field in a nicer format, that would save me creating the preview, and it would also be more relevant I think because a preview would contain the first 300 characters of the PDF which might not necessarily contain the search term the user had submitted.

    Your suggestion is a good option though, thanks.

Please Sign in or register to post replies

Write your reply to:

Draft