Currently I have a requirement in our project, where in I are Editors can upload PDF documents using Umbraco admin section.
Later if user searches for a keyword, along with content on the umbraco pages it should also search the content of the PDF docs.
In here, I need to index contents from PDF documents from media folder.
After couple of hours of research, I have seen we can Index using Lucene but can only be done on text files. So now I need to extract text from PDFs files now. And later index on those text files.
I am currently looking in to iTextSharp. But not able to find any method to extract from PDF pages.
Any suggestions, advice and help will be appreciated.
Indexing contents from PDF files.
Hi,
Currently I have a requirement in our project, where in I are Editors can upload PDF documents using Umbraco admin section.
Later if user searches for a keyword, along with content on the umbraco pages it should also search the content of the PDF docs.
In here, I need to index contents from PDF documents from media folder.
After couple of hours of research, I have seen we can Index using Lucene but can only be done on text files. So now I need to extract text from PDFs files now. And later index on those text files.
I am currently looking in to iTextSharp. But not able to find any method to extract from PDF pages.
Any suggestions, advice and help will be appreciated.
Thank you in advance.
/Ranjit J. Vaity
Hi Ranjit,
Check this project and see if that fits your needs.
Cheers,
/Dirk
Also consider generating PDF's from within Umbraco - http://our.umbraco.org/projects/xsl-pdf-creator
The content nodes should then be indexed OOTB by Examine in Umbraco 4.1
Hi Dirk / Darren,
The link provided by you got excellent stuff, really I can use.
Thanks you for your reply, guys.
Cheers,
/Ranjit J. Vaity
is working on a reply...