error while indexing pdf docs

Andrew Waegel 126 posts 126 karma points

Nov 05, 2010 @ 18:08

Error while indexing PDF docs

Hello,

I've got PDF indexing working fine on localhost but now on the dev server I get the following error. I'm guessing this has something to do with an invalid or corrupt PDF and will investigate with that in mind, but if anyone's seen this before and could provide some context i'd appreciate it.

InvalidPdfException

Error loading IApplication: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> iTextSharp.text.exceptions.InvalidPdfException: Error reading string at file pointer 4698 at iTextSharp.text.pdf.PRTokeniser.ThrowError(String error)

at iTextSharp.text.pdf.PRTokeniser.NextToken()

at UmbracoExamine.PDF.PDFIndexer.PDFParser.ParsePdfText(String sourcePDF) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 193

at UmbracoExamine.PDF.PDFIndexer.ExtractTextFromFile(FileInfo file) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 89

at UmbracoExamine.PDF.PDFIndexer.GetDataToIndex(XElement node, String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 122

at Examine.LuceneEngine.Providers.LuceneIndexer.AddNodesToIndex(IEnumerable`1 nodes, String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 496

at UmbracoExamine.BaseUmbracoIndexer.AddNodesToIndex(String xPath, String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 240

at UmbracoExamine.BaseUmbracoIndexer.PerformIndexAll(String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 194

at Examine.LuceneEngine.Providers.LuceneIndexer.IndexAll(String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 465

at UmbracoExamine.BaseUmbracoIndexer.PerformIndexRebuild() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 125

at Examine.LuceneEngine.Providers.LuceneIndexer.RebuildIndex() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 425

at UmbracoExamine.UmbracoEventManager.EnsureIndexesExist() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\UmbracoEventManager.cs:line 72

at UmbracoExamine.UmbracoEventManager..ctor() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\UmbracoEventManager.cs:line 39 --- End of inner exception stack trace --- at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandle& ctor, Boolean& bNeedSecurityCheck) at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean fillCache) at System.RuntimeType.CreateInstanceImpl(Boolean publicOnly, Boolean skipVisibilityChecks, Boolean fillCache) at System.Activator.CreateInstance(Type type, Boolean nonPublic) at System.Activator.CreateInstance(Type type) at umbraco.BusinessLogic.Application.RegisterIApplications()

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Nov 07, 2010 @ 02:33

Does it happen with all PDFs that you have tested with but only certain ones?

Copy Link

Andrew Waegel 126 posts 126 karma points

Nov 08, 2010 @ 04:01

Only certain ones. I compiled my own PDF Indexer and, inexpertly, added the following code to catch the indexing errors and report where they were happening, starting after line 201 in UmbracoExamine/PDFIndexer.cs. I couldn't figure out how to log an umbraco error inside the ParsePdfText method, so I just tested for the error message in the outer method and logged that.

Turns out we had some very old PDFs, and some of their pages iTextSharp just couldn't handle. This is fine in my case, it's OK to skip some documents as long as most are indexed, which they are now.

                      try
                        {
                            while (token.NextToken())
                            {
                                tknType = token.TokenType;
                                tknValue = token.StringValue;
                                if ((tknType == PRTokeniser.TokType.STRING))
                                {
                                    foreach (var s in tknValue)
                                    {
                                        //strip out unsupported characters, based on unicode tables.
                                        if (!m_UnsupportedRange.Contains(s))
                                        {
                                            sb.Append(s);
                                        }
                                    }

                                }
                            }
                        }
                        catch (Exception ex)
                        {
                            sb.Append(string.Format("pdf parsing error reading page {0} in {1}",i,sourcePDF));
                        }

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Nov 08, 2010 @ 07:41

Yeah PDF reading is not an easy thing to achieve, the problem is that it's some-what vector based as a format, so everything is technically an image.

Kind of sad that such a shoddy format made it as a standard for documents. Wish we could just move the XPS ;)

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Nov 08, 2010 @ 10:30

Andrew,

I have similar issue your code stops the error and alllows indexing to continue. @slace the problem pdfs i have contains some graphs not sure if they are causing it to barf. I may try updating the pdfindexer to do a test for IFilter and use that to extract the pdf text if that dont work then default to ITextSharp, in the old umbSearch I did this and IFilter used to work alot better.

Regards

Ismail

Copy Link

Aaron Powell 1708 posts 3046 karma points c-trib

Nov 08, 2010 @ 10:55

I'm not sure that IFilter is medium trust though, and since we want to maintain medium trust it'd be out

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Nov 08, 2010 @ 11:28

slace,

ah k no worries.

Regards

Ismail

Copy Link

Andrew Waegel 126 posts 126 karma points

Nov 08, 2010 @ 17:59

Modern PDFs are no problem, it's just a set of very very old ones, saved from the Quark Xpress page layout program for Macintosh 6 years ago in a very old version of PDF. Still I agree that a more structured document interchange format would be most welcome.

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

Error while indexing PDF docs