exclude header and footer data from pdf with examinefileindexer

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

David Amri 214 posts 740 karma points

Dec 18, 2018 @ 14:26

0

Exclude header and footer data from PDF with ExamineFileIndexer

Hi, is it somehow possible to exclude the header and footer content of a PDF file with ExamineFileIndexer? Has anyone tried this or perhaps able to push me in the right direction?

Best regards /David

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Dec 19, 2018 @ 09:16

1

David,

You will need to do this yourself, I would modify the source code and if it's pdf remove header and footer. You will need to look at https://github.com/thecogworks/examinefileindexer/blob/master/src/Cogworks.ExamineFileIndexer/MediaParser.cs under the hood its using tikadotnet which uses apache tika see https://github.com/KevM/tikaondotnet/blob/master/src/TikaOnDotnet.TextExtractor/TextExtractor.cs not sure if there is anything within that lib that can help. I suspect not as the extract is calling the underlying java lib and getting all the content.

If you are only indexing pdf I would use examine pdf indexer which uses itextsharp under the hood and that will give you apis to remove things after extraction?

Regards

Ismail

Copy Link
David Amri 214 posts 740 karma points

Dec 19, 2018 @ 10:06

0

Hi Ismail,

I'm currently trying out both "examine pdf indexer" and "examine file indexer" in order to compare the two.

Currently I'm leaning towards your indexer as I'm able to retrieve more data from the indexed pdf out of the box. But If I undertand you correct it would be a better choice to modify the source code of the "examine pdf indexer" to do this. And yes, I do only need to index pdf files.

Best regards /David

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Dec 19, 2018 @ 10:35

100

David,

My one gives you more data but to get at the pdf and not extract the header footer is a bit more messing around. With pdfindexer you have pdfsharp lib and you could use that to remove header / footer before extraction.

However using either you will need to modify the source so that you only get back what you want from the pdf.

Regards

Ismail

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

Exclude header and footer data from PDF with ExamineFileIndexer