I've got PDF indexing working fine on localhost but now on the dev server I get the following error. I'm guessing this has something to do with an invalid or corrupt PDF and will investigate with that in mind, but if anyone's seen this before and could provide some context i'd appreciate it.
InvalidPdfException
Error loading IApplication: System.Reflection.TargetInvocationException:
Exception has been thrown by the target of an invocation. --->
iTextSharp.text.exceptions.InvalidPdfException: Error reading string at
file pointer 4698
at iTextSharp.text.pdf.PRTokeniser.ThrowError(String error)
at iTextSharp.text.pdf.PRTokeniser.NextToken()
at UmbracoExamine.PDF.PDFIndexer.PDFParser.ParsePdfText(String
sourcePDF) in C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 193
at UmbracoExamine.PDF.PDFIndexer.ExtractTextFromFile(FileInfo file)
in C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 89
at UmbracoExamine.PDF.PDFIndexer.GetDataToIndex(XElement node, String
type) in C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 122
at
Examine.LuceneEngine.Providers.LuceneIndexer.AddNodesToIndex(IEnumerable`1
nodes, String type) in C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line
496
at UmbracoExamine.BaseUmbracoIndexer.AddNodesToIndex(String xPath,
String type) in C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 240
at UmbracoExamine.BaseUmbracoIndexer.PerformIndexAll(String type) in
C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 194
at Examine.LuceneEngine.Providers.LuceneIndexer.IndexAll(String type)
in C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line
465
at UmbracoExamine.BaseUmbracoIndexer.PerformIndexRebuild() in
C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 125
at Examine.LuceneEngine.Providers.LuceneIndexer.RebuildIndex() in
C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line
425
at UmbracoExamine.UmbracoEventManager.EnsureIndexesExist() in
C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\UmbracoExamine\UmbracoEventManager.cs:line 72
at UmbracoExamine.UmbracoEventManager..ctor() in
C:\Users\Shannon\Documents\Visual Studio
2008\Projects\Examine\UmbracoExamine\UmbracoEventManager.cs:line 39
--- End of inner exception stack trace ---
at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean
publicOnly, Boolean noCheck, Boolean& canBeCached,
RuntimeMethodHandle& ctor, Boolean& bNeedSecurityCheck)
at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean
fillCache)
at System.RuntimeType.CreateInstanceImpl(Boolean publicOnly, Boolean
skipVisibilityChecks, Boolean fillCache)
at System.Activator.CreateInstance(Type type, Boolean nonPublic)
at System.Activator.CreateInstance(Type type)
at umbraco.BusinessLogic.Application.RegisterIApplications()
Only certain ones. I compiled my own PDF Indexer and, inexpertly, added the following code to catch the indexing errors and report where they were happening, starting after line 201 in UmbracoExamine/PDFIndexer.cs. I couldn't figure out how to log an umbraco error inside the ParsePdfText method, so I just tested for the error message in the outer method and logged that.
Turns out we had some very old PDFs, and some of their pages iTextSharp just couldn't handle. This is fine in my case, it's OK to skip some documents as long as most are indexed, which they are now.
try { while (token.NextToken()) { tknType = token.TokenType; tknValue = token.StringValue; if ((tknType == PRTokeniser.TokType.STRING)) { foreach (var s in tknValue) { //strip out unsupported characters, based on unicode tables. if (!m_UnsupportedRange.Contains(s)) { sb.Append(s); } }
I have similar issue your code stops the error and alllows indexing to continue. @slace the problem pdfs i have contains some graphs not sure if they are causing it to barf. I may try updating the pdfindexer to do a test for IFilter and use that to extract the pdf text if that dont work then default to ITextSharp, in the old umbSearch I did this and IFilter used to work alot better.
Modern PDFs are no problem, it's just a set of very very old ones, saved from the Quark Xpress page layout program for Macintosh 6 years ago in a very old version of PDF. Still I agree that a more structured document interchange format would be most welcome.
Error while indexing PDF docs
Hello,
I've got PDF indexing working fine on localhost but now on the dev server I get the following error. I'm guessing this has something to do with an invalid or corrupt PDF and will investigate with that in mind, but if anyone's seen this before and could provide some context i'd appreciate it.
InvalidPdfException
Error loading IApplication: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> iTextSharp.text.exceptions.InvalidPdfException: Error reading string at file pointer 4698 at iTextSharp.text.pdf.PRTokeniser.ThrowError(String error)
at iTextSharp.text.pdf.PRTokeniser.NextToken()
at UmbracoExamine.PDF.PDFIndexer.PDFParser.ParsePdfText(String sourcePDF) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 193
at UmbracoExamine.PDF.PDFIndexer.ExtractTextFromFile(FileInfo file) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 89
at UmbracoExamine.PDF.PDFIndexer.GetDataToIndex(XElement node, String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 122
at Examine.LuceneEngine.Providers.LuceneIndexer.AddNodesToIndex(IEnumerable`1 nodes, String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 496
at UmbracoExamine.BaseUmbracoIndexer.AddNodesToIndex(String xPath, String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 240
at UmbracoExamine.BaseUmbracoIndexer.PerformIndexAll(String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 194
at Examine.LuceneEngine.Providers.LuceneIndexer.IndexAll(String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 465
at UmbracoExamine.BaseUmbracoIndexer.PerformIndexRebuild() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 125
at Examine.LuceneEngine.Providers.LuceneIndexer.RebuildIndex() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 425
at UmbracoExamine.UmbracoEventManager.EnsureIndexesExist() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\UmbracoEventManager.cs:line 72
at UmbracoExamine.UmbracoEventManager..ctor() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\UmbracoEventManager.cs:line 39 --- End of inner exception stack trace --- at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandle& ctor, Boolean& bNeedSecurityCheck) at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean fillCache) at System.RuntimeType.CreateInstanceImpl(Boolean publicOnly, Boolean skipVisibilityChecks, Boolean fillCache) at System.Activator.CreateInstance(Type type, Boolean nonPublic) at System.Activator.CreateInstance(Type type) at umbraco.BusinessLogic.Application.RegisterIApplications()
Does it happen with all PDFs that you have tested with but only certain ones?
Only certain ones. I compiled my own PDF Indexer and, inexpertly, added the following code to catch the indexing errors and report where they were happening, starting after line 201 in UmbracoExamine/PDFIndexer.cs. I couldn't figure out how to log an umbraco error inside the ParsePdfText method, so I just tested for the error message in the outer method and logged that.
Turns out we had some very old PDFs, and some of their pages iTextSharp just couldn't handle. This is fine in my case, it's OK to skip some documents as long as most are indexed, which they are now.
Yeah PDF reading is not an easy thing to achieve, the problem is that it's some-what vector based as a format, so everything is technically an image.
Kind of sad that such a shoddy format made it as a standard for documents. Wish we could just move the XPS ;)
Andrew,
I have similar issue your code stops the error and alllows indexing to continue. @slace the problem pdfs i have contains some graphs not sure if they are causing it to barf. I may try updating the pdfindexer to do a test for IFilter and use that to extract the pdf text if that dont work then default to ITextSharp, in the old umbSearch I did this and IFilter used to work alot better.
Regards
Ismail
I'm not sure that IFilter is medium trust though, and since we want to maintain medium trust it'd be out
slace,
ah k no worries.
Regards
Ismail
Modern PDFs are no problem, it's just a set of very very old ones, saved from the Quark Xpress page layout program for Macintosh 6 years ago in a very old version of PDF. Still I agree that a more structured document interchange format would be most welcome.
is working on a reply...