I've set up Examine with a PDF indexer and searcher and it's returning the results I need. The indexing is using iText to read the PDF.
When I search on content nodes, I can return a short part of the content, i.e. 300 characters, by reading from the bodyText property. If I read 300 characters of the FileTextContent property of the PDF document I get a string of characters that aren't very readable. e.g.
Does any know how to read this in and tidy it up, or should I be reading in a different property?
The other thing I've considered, is opening up each PDF with iText and then reading in the first 300 characters, but I'd like to avoid the overhead of doing this if I can.
I'm sure someone else must have come across this issue in the past and solved it?
Hi. If performace is your concern then probably, you can simply store sort of a 'preview' of a pdf document along with itself and show it later. The preview then can be generated while the document is being uploaded or in background by a scheduler - it depends on particular scenario.
Thanks Rodion, that's a good suggestion, I hadn't thought of that.
If I could get the information from the FileTextContent field in a nicer format, that would save me creating the preview, and it would also be more relevant I think because a preview would contain the first 300 characters of the PDF which might not necessarily contain the search term the user had submitted.
Reading contents of indexed PDF
Hi,
I've set up Examine with a PDF indexer and searcher and it's returning the results I need. The indexing is using iText to read the PDF.
When I search on content nodes, I can return a short part of the content, i.e. 300 characters, by reading from the bodyText property. If I read 300 characters of the FileTextContent property of the PDF document I get a string of characters that aren't very readable. e.g.
"Thisisatestdocumentfortestingthesearchingfacility.OliverTwistisbornintoalifeofpovertyand"
Does any know how to read this in and tidy it up, or should I be reading in a different property?
The other thing I've considered, is opening up each PDF with iText and then reading in the first 300 characters, but I'd like to avoid the overhead of doing this if I can.
I'm sure someone else must have come across this issue in the past and solved it?
Hi. If performace is your concern then probably, you can simply store sort of a 'preview' of a pdf document along with itself and show it later. The preview then can be generated while the document is being uploaded or in background by a scheduler - it depends on particular scenario.
Thanks Rodion, that's a good suggestion, I hadn't thought of that.
If I could get the information from the FileTextContent field in a nicer format, that would save me creating the preview, and it would also be more relevant I think because a preview would contain the first 300 characters of the PDF which might not necessarily contain the search term the user had submitted.
Your suggestion is a good option though, thanks.
is working on a reply...