Exclude header and footer data from PDF with ExamineFileIndexer
Hi, is it somehow possible to exclude the header and footer content of a PDF file with ExamineFileIndexer? Has anyone tried this or perhaps able to push me in the right direction?
If you are only indexing pdf I would use examine pdf indexer which uses itextsharp under the hood and that will give you apis to remove things after extraction?
I'm currently trying out both "examine pdf indexer" and "examine file indexer" in order to compare the two.
Currently I'm leaning towards your indexer as I'm able to retrieve more data from the indexed pdf out of the box. But If I undertand you correct it would be a better choice to modify the source code of the "examine pdf indexer" to do this. And yes, I do only need to index pdf files.
My one gives you more data but to get at the pdf and not extract the header footer is a bit more messing around. With pdfindexer you have pdfsharp lib and you could use that to remove header / footer before extraction.
However using either you will need to modify the source so that you only get back what you want from the pdf.
Exclude header and footer data from PDF with ExamineFileIndexer
Hi, is it somehow possible to exclude the header and footer content of a PDF file with ExamineFileIndexer? Has anyone tried this or perhaps able to push me in the right direction?
Best regards /David
David,
You will need to do this yourself, I would modify the source code and if it's pdf remove header and footer. You will need to look at https://github.com/thecogworks/examinefileindexer/blob/master/src/Cogworks.ExamineFileIndexer/MediaParser.cs under the hood its using tikadotnet which uses apache tika see https://github.com/KevM/tikaondotnet/blob/master/src/TikaOnDotnet.TextExtractor/TextExtractor.cs not sure if there is anything within that lib that can help. I suspect not as the extract is calling the underlying java lib and getting all the content.
If you are only indexing pdf I would use examine pdf indexer which uses itextsharp under the hood and that will give you apis to remove things after extraction?
Regards
Ismail
Hi Ismail,
I'm currently trying out both "examine pdf indexer" and "examine file indexer" in order to compare the two.
Currently I'm leaning towards your indexer as I'm able to retrieve more data from the indexed pdf out of the box. But If I undertand you correct it would be a better choice to modify the source code of the "examine pdf indexer" to do this. And yes, I do only need to index pdf files.
Best regards /David
David,
My one gives you more data but to get at the pdf and not extract the header footer is a bit more messing around. With pdfindexer you have pdfsharp lib and you could use that to remove header / footer before extraction.
However using either you will need to modify the source so that you only get back what you want from the pdf.
Regards
Ismail
is working on a reply...