indexing contents from pdf files

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Ranjit J. Vaity 66 posts 109 karma points

Apr 25, 2010 @ 22:19

0

Indexing contents from PDF files.

Hi,

Currently I have a requirement in our project, where in I are Editors can upload PDF documents using Umbraco admin section.

Later if user searches for a keyword, along with content on the umbraco pages it should also search the content of the PDF docs.

In here, I need to index contents from PDF documents from media folder.

After couple of hours of research, I have seen we can Index using Lucene but can only be done on text files. So now I need to extract text from PDFs files now. And later index on those text files.

I am currently looking in to iTextSharp. But not able to find any method to extract from PDF pages.

Any suggestions, advice and help will be appreciated.

Thank you in advance.

/Ranjit J. Vaity

Copy Link
Dirk De Grave 4541 posts 6021 karma points MVP 3x admin c-trib

Apr 26, 2010 @ 09:32

0

Hi Ranjit,

Check this project and see if that fits your needs.

Cheers,

/Dirk

Copy Link
Darren Ferguson 1022 posts 3259 karma points MVP c-trib

Apr 26, 2010 @ 10:21

0

Also consider generating PDF's from within Umbraco - http://our.umbraco.org/projects/xsl-pdf-creator

The content nodes should then be indexed OOTB by Examine in Umbraco 4.1

Copy Link
Ranjit J. Vaity 66 posts 109 karma points

May 02, 2010 @ 17:48

0

Hi Dirk / Darren,

The link provided by you got excellent stuff, really I can use.

Thanks you for your reply, guys.

Cheers,

/Ranjit J. Vaity

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

Indexing contents from PDF files.