Umbraco Examine: Indexing PDF and Open XML Office documents gives incomplete index
Hello all,
On our web site, we need to include media files in search results. "Media files" include: PDF, xlsx, docx, pptx.
We're using a third-party component which performs indexing. But details about the third-party component seem irrelevant, so I'll miss it this time.
We’ve encountered a strange issue: indexing media files (PDF,
xlsx, docx and pptx) seem fragile and very sensitive to errors during indexing.
To be precise, it seems that some errors during indexing of files (for example,
a corrupt PDF document) are causing either an incomplete Lucene index, or no
index at all. This happens on application startup, as well as on manual
re-indexing of files. As a consequence, search on our web site doesn’t include
most of the media files.
When I manually remove all corrupted (or potentially
“dangerous”) files on the server, at certain point the indexing goes well and
the Lucene indexing output is generated ok. But,
this is obviously not possible in the production environment, because the
client is updating media content without our intervention.
It seems that, for some reason, Lucene segments get lost during optimization, or don't get created at all.
There are about 500 MB of mentioned document types on the server.
When an incomplete index is created, the index folder contains three files:
a .cfs file (for example, _2.cfs);
segments.gen;
segments_x
As already mentioned, the .cfs file contains just a small subset (say, five or 10) of total documents on web site.
Questions:
does this situation sound familiar?
can I control behavior during in dexing, and basically tell the indexer to "don't break on error, regardless of the error severity, but continue with indexing"? I haven't found any such setting for Examine BaseIndexProvider, IndexWriter or other classes I've looked into...
I've tried to handle different indexing events at application startup, but none of them seem to give me what I need:
public class InitializationEvents : ApplicationEventHandler { #region ApplicationStarted protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext) { base.ApplicationStarted(umbracoApplication, applicationContext);
I won't add any entries from ExamineSettings.config and ExamineIndex.config: as I already said, after removing corrupt PDF files the index gets created correctly, so I'm pretty confident everything is configured correctly. But I'll supply this info in a next post, if there is a request.
Used to index PDF content in Umbraco's media section.
**** NOTE: Not all PDFs can have text read from them!!! ****
This shows the PDF specific configuration and the default values applied when
they are not specified.
I've already read through the material you referenced. And I'm convinced everything is configured correctly, regarding Examine indexers.
Below I'm submitting an excerpt from ExamineSettings.config (there are other indexers/searchers, I've left only the most relevant). But a small explanation: in the config below you'll se that we're using a third-party component, XfsSearch. This component internally uses iTextSharp and IFilters for parsing PDFs and Office documents.
Unfortunately, I'm not sure if the problem is connected to the component or it is a problem with Examine/iTextSharp/IFilters.
Another interesting fact: I've enabled detailed logging, and log files reveal a few different exceptions. But it seems that some exceptions are "non-fatal" (index is still being created) but some others are "fatal" (index is incomplete, or completely missing).
Exception message and stack trace, for a "non-fatal" exception:
Could not read PDF '>' not expected at file pointer 3343 at iTextSharp.text.pdf.PRTokeniser.ThrowError(String error) at iTextSharp.text.pdf.PRTokeniser.NextToken() at Xuntos.Xfs.helpers.PdfIndexHelper.PDFParser.ParsePdfText(String sourcePDF, Action`1 onError)
Exception message and stack trace, for the "fatal" exception (this is caused by the corrupt PDF):
Error indexing queue items Rebuild failed: trailer not found.; Original message: PDF startxref not found. at iTextSharp.text.pdf.PdfReader.ReadPdf() at iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[] ownerPassword) at Xuntos.Xfs.helpers.PdfIndexHelper.PDFParser.ParsePdfText(String sourcePDF, Action`1 onError) at Xuntos.Xfs.MediaIndexer.ExtractTextFromPdfFile(FileInfo file) at Xuntos.Xfs.MediaIndexer.GetDataToIndex(XElement node, String type) at Examine.LuceneEngine.Providers.LuceneIndexer.ProcessIndexQueueItem(IndexOperation op, IndexWriter writer) at Examine.LuceneEngine.Providers.LuceneIndexer.ForceProcessQueueItems()
I have in the past indexed pdf using the examine pdf indexer and found that with some pdfs i got errors. I then wrote my own media indexer which under the hood uses Tika see https://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer, that fixed my issue. Please note that was written for v6 but should work for v7 if not you can download source and update. One issue with it is speed when you have alot of media content as it uses java ikvm which wraps around the tika libraries.
@Ismail: Thanks for the input, and thank you so much for reminding me of your component!
After initial testing, it seems that it suits our needs perfectly, so we'll probably use CogUmbracoExamineMediaIndexer for indexing files, and XfsSearch for everything else related the search functionality.
I've marked your comment as the solution, because it gave us a way to circumvent the original issue. But my initial question is still relevant, and it would be nice if someone else from Umbraco community could provide more information regarding the problem and possible solution...
Just a small reminder for myself and any future visitor: it's not enough to just do "right-click > Save As" for "tikka-app-1.2.dll" link on the product page (it would save a file under correct name, but completely wrong content)... Instead, click the link and click appropriate button on the Dropbox page. The "tikka" dll should be about 28MB in size. ;)
Regards, Dragan.
PS Sorry for the very late response, but for some reason Umbraco forum didn't accept my comment earlier.
I've got a similar situation with one of my clients, on 7.2.2. Every now and again, the PDF index with clear itself completely, and the only way to get the index to rebuild is to kick the App Pool. When the index rebuilds, it doesn't add all of the files back in, you have to manually re-publish them all to get them to go into the index correctly.
I'm still looking into this, but I'll let you know if I get to the bottom of the issue.
Umbraco Examine: Indexing PDF and Open XML Office documents gives incomplete index
Hello all,
On our web site, we need to include media files in search results. "Media files" include: PDF, xlsx, docx, pptx.
We're using a third-party component which performs indexing. But details about the third-party component seem irrelevant, so I'll miss it this time.
We’ve encountered a strange issue: indexing media files (PDF, xlsx, docx and pptx) seem fragile and very sensitive to errors during indexing. To be precise, it seems that some errors during indexing of files (for example, a corrupt PDF document) are causing either an incomplete Lucene index, or no index at all. This happens on application startup, as well as on manual re-indexing of files. As a consequence, search on our web site doesn’t include most of the media files.
When I manually remove all corrupted (or potentially “dangerous”) files on the server, at certain point the indexing goes well and the Lucene indexing output is generated ok. But, this is obviously not possible in the production environment, because the client is updating media content without our intervention.
It seems that, for some reason, Lucene segments get lost during optimization, or don't get created at all.
There are about 500 MB of mentioned document types on the server.
When an incomplete index is created, the index folder contains three files:
As already mentioned, the .cfs file contains just a small subset (say, five or 10) of total documents on web site.
Questions:
I've tried to handle different indexing events at application startup, but none of them seem to give me what I need:
I won't add any entries from ExamineSettings.config and ExamineIndex.config: as I already said, after removing corrupt PDF files the index gets created correctly, so I'm pretty confident everything is configured correctly. But I'll supply this info in a next post, if there is a request.
Any help would be appreciated!
Hi Dragan,
Did you read this doc :
https://our.umbraco.org/Documentation/Reference/Searching/Examine/full-configuration
http://24days.in/umbraco/2013/getting-started-with-examine/
There are some info about PDF indexing.
Thanks, Alex
Hi Alex,
thanks for the response.
I've already read through the material you referenced. And I'm convinced everything is configured correctly, regarding Examine indexers.
Below I'm submitting an excerpt from ExamineSettings.config (there are other indexers/searchers, I've left only the most relevant). But a small explanation: in the config below you'll se that we're using a third-party component, XfsSearch. This component internally uses iTextSharp and IFilters for parsing PDFs and Office documents.
Unfortunately, I'm not sure if the problem is connected to the component or it is a problem with Examine/iTextSharp/IFilters.
Another interesting fact: I've enabled detailed logging, and log files reveal a few different exceptions. But it seems that some exceptions are "non-fatal" (index is still being created) but some others are "fatal" (index is incomplete, or completely missing).
Exception message and stack trace, for a "non-fatal" exception:
Exception message and stack trace, for the "fatal" exception (this is caused by the corrupt PDF):
Dragan,
I have in the past indexed pdf using the examine pdf indexer and found that with some pdfs i got errors. I then wrote my own media indexer which under the hood uses Tika see https://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer, that fixed my issue. Please note that was written for v6 but should work for v7 if not you can download source and update. One issue with it is speed when you have alot of media content as it uses java ikvm which wraps around the tika libraries.
Regards
Ismail
@Ismail: Thanks for the input, and thank you so much for reminding me of your component!
After initial testing, it seems that it suits our needs perfectly, so we'll probably use CogUmbracoExamineMediaIndexer for indexing files, and XfsSearch for everything else related the search functionality. I've marked your comment as the solution, because it gave us a way to circumvent the original issue. But my initial question is still relevant, and it would be nice if someone else from Umbraco community could provide more information regarding the problem and possible solution...
Just a small reminder for myself and any future visitor: it's not enough to just do "right-click > Save As" for "tikka-app-1.2.dll" link on the product page (it would save a file under correct name, but completely wrong content)... Instead, click the link and click appropriate button on the Dropbox page. The "tikka" dll should be about 28MB in size. ;)
Regards, Dragan.
PS Sorry for the very late response, but for some reason Umbraco forum didn't accept my comment earlier.
I've got a similar situation with one of my clients, on 7.2.2. Every now and again, the PDF index with clear itself completely, and the only way to get the index to rebuild is to kick the App Pool. When the index rebuilds, it doesn't add all of the files back in, you have to manually re-publish them all to get them to go into the index correctly.
I'm still looking into this, but I'll let you know if I get to the bottom of the issue.
Hiya,
I've found that this issue goes away if you limit the index to just contain items with the "File" node type. Here's an example config:
That has stopped the incomplete indexes for me, and the indexes rebuild correctly now.
is working on a reply...