Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • David Amri 214 posts 740 karma points
    Dec 18, 2018 @ 14:26
    David Amri
    0

    Exclude header and footer data from PDF with ExamineFileIndexer

    Hi, is it somehow possible to exclude the header and footer content of a PDF file with ExamineFileIndexer? Has anyone tried this or perhaps able to push me in the right direction?

    Best regards /David

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Dec 19, 2018 @ 09:16
    Ismail Mayat
    1

    David,

    You will need to do this yourself, I would modify the source code and if it's pdf remove header and footer. You will need to look at https://github.com/thecogworks/examinefileindexer/blob/master/src/Cogworks.ExamineFileIndexer/MediaParser.cs under the hood its using tikadotnet which uses apache tika see https://github.com/KevM/tikaondotnet/blob/master/src/TikaOnDotnet.TextExtractor/TextExtractor.cs not sure if there is anything within that lib that can help. I suspect not as the extract is calling the underlying java lib and getting all the content.

    If you are only indexing pdf I would use examine pdf indexer which uses itextsharp under the hood and that will give you apis to remove things after extraction?

    Regards

    Ismail

  • David Amri 214 posts 740 karma points
    Dec 19, 2018 @ 10:06
    David Amri
    0

    Hi Ismail,

    I'm currently trying out both "examine pdf indexer" and "examine file indexer" in order to compare the two.

    Currently I'm leaning towards your indexer as I'm able to retrieve more data from the indexed pdf out of the box. But If I undertand you correct it would be a better choice to modify the source code of the "examine pdf indexer" to do this. And yes, I do only need to index pdf files.

    Best regards /David

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Dec 19, 2018 @ 10:35
    Ismail Mayat
    100

    David,

    My one gives you more data but to get at the pdf and not extract the header footer is a bit more messing around. With pdfindexer you have pdfsharp lib and you could use that to remove header / footer before extraction.

    However using either you will need to modify the source so that you only get back what you want from the pdf.

    Regards

    Ismail

Please Sign in or register to post replies

Write your reply to:

Draft