Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Nicholas Westby 2054 posts 7103 karma points c-trib
    Apr 11, 2020 @ 00:37
    Nicholas Westby
    0

    Errors: Provider SAXParserFactoryImpl / TransformerFactoryImpl not found

    I'm trying to modify the code so I can work with PDF's outside of the media folder, so I've copied your code to my own project to try and get it to work.

    However, I get an exception during the index process around this line:

    Error in Indexing Code

    I say "around this line" rather than just "this line" because I've actually tried two different approaches. The approaches I tried were:

    • Copy the external DLL's from the repo to my bin folder (aside from the ones that were already there, such as those for Examine/UmbracoExamine/Lucene).
    • Reference the "TikaOnDotnet.TextExtractor" NuGet package.

    With your DLL's, I got this error:

    Provider com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl not found

    With the NuGet package code, I got this error:

    Provider com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl not found

    Are there additional steps to installing this that I may be missing (e.g., installing Java if I don't have it, or using a specific version of Java, etc.)?

    I'm on Umbraco 7.15.1 in case that is relevant.

  • Nicholas Westby 2054 posts 7103 karma points c-trib
    Apr 11, 2020 @ 00:41
    Nicholas Westby
    0

    Here's one post I came across that seems to imply I may need an older version of Java: https://www.ibm.com/support/pages/javaxxmltransformtransformerfactoryconfigurationerror-provider-orgapachexalanxsltctraxtransformerfactoryimpl-not-found

    It mentions something was removed in Java 6 and I seem to have Java 10:

    Java About Screen Showing Java 10

    What version of Java has worked for you?

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Apr 11, 2020 @ 09:10
    Ismail Mayat
    0

    Nicholas,

    Not sure to be honest, however I recently used this on a V8 site that is on azure webapp and that works fine. Not sure what version of Java.

    On Nuget can look at the version of TikaOnDotnet.TextExtractor I used for the original package and does that have dependancy on Java? I don't think you need java installed? I need to check on my webapp see if java is present.

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Apr 11, 2020 @ 09:58
    Ismail Mayat
    0

    Nicholas,

    Did a quick check on my webapp its using 1.8.0 and locally i also have same version.

    Regards

    Ismail

  • Nicholas Westby 2054 posts 7103 karma points c-trib
    Apr 11, 2020 @ 22:19
    Nicholas Westby
    0

    Thanks for getting back to me so quickly.

    FYI, I tried installing CogUmbracoExamineMediaIndexer into a fresh install of Umbraco 7.15.1 and it seems to be working as expected, so I must be doing something wrong in my custom code.

    I'll dig in and post back here with my findings.

    As a side note, just wanted to mention that the package seems to have installed a "MediaSearcher" search provider in ExamineSettings.config that didn't seem to work (caused a 500 error). I imagine this has to do with specific Umbraco versions. I commented out and replaced it with one that did work. Both are shown here for your reference (e.g., in case you want to update the documentation or code or something of that sort):

    Examine MediaSearcher Screenshot

    Here's the code version:

    <!--
    <add
      name="MediaSearcher"
      type="UmbracoExamine.LuceneExamineSearcher, UmbracoExamine"
      indexSet="MediaIndexSet"
      analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />-->
    <add
      name="MediaSearcher"
      type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
      indexSet="MediaIndexSet"
      analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
    

    (For easier copying/pasting and Google indexing.)

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Apr 12, 2020 @ 16:32
    Ismail Mayat
    0

    Nicholas,

    In your custom code am I right in assuming you are writing your own custom indexer which is reading a bunch of files in a folder? If so then step through and see if the path to the pdf that you are looking to extract is correct.

    Also, the indexer I have written is not a custom indexer but an examine indexer which works in the same way the original pdf indexer works.

    Regards

    Ismail

  • Nicholas Westby 2054 posts 7103 karma points c-trib
    Apr 14, 2020 @ 21:41
    Nicholas Westby
    100

    Turns out I was missing this DLL: IKVM.OpenJDK.XML.Transform.dll

    It was referenced from NuGet in my app project, but I guess it didn't get copied over to my web project during the build process. I added the Tika NuGet packages to my web project and problem solved.

    I still have a bit more work to do, but at least I'm seeing the text extracted during the indexing phase.

Please Sign in or register to post replies

Write your reply to:

Draft