Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Andrew Waegel 126 posts 126 karma points
    Nov 05, 2010 @ 18:08
    Andrew Waegel
    0

    Error while indexing PDF docs

    Hello,

    I've got PDF indexing working fine on localhost but now on the dev server I get the following error. I'm guessing this has something to do with an invalid or corrupt PDF and will investigate with that in mind, but if anyone's seen this before and could provide some context i'd appreciate it.

    InvalidPdfException

    Error loading IApplication: System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> iTextSharp.text.exceptions.InvalidPdfException: Error reading string at file pointer 4698 at iTextSharp.text.pdf.PRTokeniser.ThrowError(String error)

    at iTextSharp.text.pdf.PRTokeniser.NextToken()

    at UmbracoExamine.PDF.PDFIndexer.PDFParser.ParsePdfText(String sourcePDF) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 193

    at UmbracoExamine.PDF.PDFIndexer.ExtractTextFromFile(FileInfo file) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 89

    at UmbracoExamine.PDF.PDFIndexer.GetDataToIndex(XElement node, String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine.PDF\PDFIndexer.cs:line 122

    at Examine.LuceneEngine.Providers.LuceneIndexer.AddNodesToIndex(IEnumerable`1 nodes, String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 496

    at UmbracoExamine.BaseUmbracoIndexer.AddNodesToIndex(String xPath, String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 240

    at UmbracoExamine.BaseUmbracoIndexer.PerformIndexAll(String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 194

    at Examine.LuceneEngine.Providers.LuceneIndexer.IndexAll(String type) in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 465

    at UmbracoExamine.BaseUmbracoIndexer.PerformIndexRebuild() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\BaseUmbracoIndexer.cs:line 125

    at Examine.LuceneEngine.Providers.LuceneIndexer.RebuildIndex() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\Examine\LuceneEngine\Providers\LuceneIndexer.cs:line 425

    at UmbracoExamine.UmbracoEventManager.EnsureIndexesExist() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\UmbracoEventManager.cs:line 72

    at UmbracoExamine.UmbracoEventManager..ctor() in C:\Users\Shannon\Documents\Visual Studio 2008\Projects\Examine\UmbracoExamine\UmbracoEventManager.cs:line 39 --- End of inner exception stack trace --- at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandle& ctor, Boolean& bNeedSecurityCheck) at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean fillCache) at System.RuntimeType.CreateInstanceImpl(Boolean publicOnly, Boolean skipVisibilityChecks, Boolean fillCache) at System.Activator.CreateInstance(Type type, Boolean nonPublic) at System.Activator.CreateInstance(Type type) at umbraco.BusinessLogic.Application.RegisterIApplications()

  • Aaron Powell 1708 posts 3046 karma points c-trib
    Nov 07, 2010 @ 02:33
    Aaron Powell
    0

    Does it happen with all PDFs that you have tested with but only certain ones?

  • Andrew Waegel 126 posts 126 karma points
    Nov 08, 2010 @ 04:01
    Andrew Waegel
    0

    Only certain ones. I compiled my own PDF Indexer and, inexpertly, added the following code to catch the indexing errors and report where they were happening, starting after line 201 in UmbracoExamine/PDFIndexer.cs. I couldn't figure out how to log an umbraco error inside the ParsePdfText method, so I just tested for the error message in the outer method and logged that.

    Turns out we had some very old PDFs, and some of their pages iTextSharp just couldn't handle. This is fine in my case, it's OK to skip some documents as long as most are indexed, which they are now.

     

                          try
                            {
                                while (token.NextToken())
                                {
                                    tknType = token.TokenType;
                                    tknValue = token.StringValue;
                                    if ((tknType == PRTokeniser.TokType.STRING))
                                    {
                                        foreach (var s in tknValue)
                                        {
                                            //strip out unsupported characters, based on unicode tables.
                                            if (!m_UnsupportedRange.Contains(s))
                                            {
                                                sb.Append(s);
                                            }
                                        }

                                    }
                                }
                            }
                            catch (Exception ex)
                            {
                                sb.Append(string.Format("pdf parsing error reading page {0} in {1}",i,sourcePDF));
                            }
  • Aaron Powell 1708 posts 3046 karma points c-trib
    Nov 08, 2010 @ 07:41
    Aaron Powell
    0

    Yeah PDF reading is not an easy thing to achieve, the problem is that it's some-what vector based as a format, so everything is technically an image.

    Kind of sad that such a shoddy format made it as a standard for documents. Wish we could just move the XPS ;)

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Nov 08, 2010 @ 10:30
    Ismail Mayat
    0

    Andrew,

    I have similar issue your code stops the error and alllows indexing to continue. @slace the problem pdfs i have contains some graphs not sure if they are causing it to barf.  I may try updating the pdfindexer to do a test for IFilter and use that to extract the pdf text if that dont work then default to ITextSharp, in the old umbSearch I did this and IFilter used to work alot better.

    Regards

    Ismail

  • Aaron Powell 1708 posts 3046 karma points c-trib
    Nov 08, 2010 @ 10:55
    Aaron Powell
    0

    I'm not sure that IFilter is medium trust though, and since we want to maintain medium trust it'd be out

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Nov 08, 2010 @ 11:28
    Ismail Mayat
    0

    slace,

    ah k no worries.

    Regards

    Ismail

  • Andrew Waegel 126 posts 126 karma points
    Nov 08, 2010 @ 17:59
    Andrew Waegel
    0

    Modern PDFs are no problem, it's just a set of very very old ones, saved from the Quark Xpress page layout program for Macintosh 6 years ago in a very old version of PDF. Still I agree that a more structured document interchange format would be most welcome.

Please Sign in or register to post replies

Write your reply to:

Draft