reading contents of indexed pdf

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Ian Robinson 79 posts 143 karma points

Oct 19, 2011 @ 13:30

0

Reading contents of indexed PDF

Hi,

I've set up Examine with a PDF indexer and searcher and it's returning the results I need. The indexing is using iText to read the PDF.

When I search on content nodes, I can return a short part of the content, i.e. 300 characters, by reading from the bodyText property. If I read 300 characters of the FileTextContent property of the PDF document I get a string of characters that aren't very readable. e.g.

"Thisisatestdocumentfortestingthesearchingfacility.OliverTwistisbornintoalifeofpovertyand"

Does any know how to read this in and tidy it up, or should I be reading in a different property?

The other thing I've considered, is opening up each PDF with iText and then reading in the first 300 characters, but I'd like to avoid the overhead of doing this if I can.

I'm sure someone else must have come across this issue in the past and solved it?

Copy Link
Rodion Novoselov 694 posts 859 karma points

Oct 19, 2011 @ 13:40

0

Hi. If performace is your concern then probably, you can simply store sort of a 'preview' of a pdf document along with itself and show it later. The preview then can be generated while the document is being uploaded or in background by a scheduler - it depends on particular scenario.

Copy Link
Ian Robinson 79 posts 143 karma points

Oct 19, 2011 @ 13:45

0

Thanks Rodion, that's a good suggestion, I hadn't thought of that.

If I could get the information from the FileTextContent field in a nicer format, that would save me creating the preview, and it would also be more relevant I think because a preview would contain the first 300 characters of the PDF which might not necessarily contain the search term the user had submitted.

Your suggestion is a good option though, thanks.

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

Reading contents of indexed PDF