Errors: Provider SAXParserFactoryImpl / TransformerFactoryImpl not found
I'm trying to modify the code so I can work with PDF's outside of the media folder, so I've copied your code to my own project to try and get it to work.
However, I get an exception during the index process around this line:
I say "around this line" rather than just "this line" because I've actually tried two different approaches. The approaches I tried were:
Copy the external DLL's from the repo to my bin folder (aside from the ones that were already there, such as those for Examine/UmbracoExamine/Lucene).
Reference the "TikaOnDotnet.TextExtractor" NuGet package.
With your DLL's, I got this error:
Provider com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl not found
With the NuGet package code, I got this error:
Provider com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl not found
Are there additional steps to installing this that I may be missing (e.g., installing Java if I don't have it, or using a specific version of Java, etc.)?
Not sure to be honest, however I recently used this on a V8 site that is on azure webapp and that works fine. Not sure what version of Java.
On Nuget can look at the version of TikaOnDotnet.TextExtractor I used for the original package and does that have dependancy on Java? I don't think you need java installed? I need to check on my webapp see if java is present.
FYI, I tried installing CogUmbracoExamineMediaIndexer into a fresh install of Umbraco 7.15.1 and it seems to be working as expected, so I must be doing something wrong in my custom code.
I'll dig in and post back here with my findings.
As a side note, just wanted to mention that the package seems to have installed a "MediaSearcher" search provider in ExamineSettings.config that didn't seem to work (caused a 500 error). I imagine this has to do with specific Umbraco versions. I commented out and replaced it with one that did work. Both are shown here for your reference (e.g., in case you want to update the documentation or code or something of that sort):
In your custom code am I right in assuming you are writing your own custom indexer which is reading a bunch of files in a folder? If so then step through and see if the path to the pdf that you are looking to extract is correct.
Also, the indexer I have written is not a custom indexer but an examine indexer which works in the same way the original pdf indexer works.
Turns out I was missing this DLL: IKVM.OpenJDK.XML.Transform.dll
It was referenced from NuGet in my app project, but I guess it didn't get copied over to my web project during the build process. I added the Tika NuGet packages to my web project and problem solved.
I still have a bit more work to do, but at least I'm seeing the text extracted during the indexing phase.
Errors: Provider SAXParserFactoryImpl / TransformerFactoryImpl not found
I'm trying to modify the code so I can work with PDF's outside of the media folder, so I've copied your code to my own project to try and get it to work.
However, I get an exception during the index process around this line:
I say "around this line" rather than just "this line" because I've actually tried two different approaches. The approaches I tried were:
With your DLL's, I got this error:
With the NuGet package code, I got this error:
Are there additional steps to installing this that I may be missing (e.g., installing Java if I don't have it, or using a specific version of Java, etc.)?
I'm on Umbraco 7.15.1 in case that is relevant.
Here's one post I came across that seems to imply I may need an older version of Java: https://www.ibm.com/support/pages/javaxxmltransformtransformerfactoryconfigurationerror-provider-orgapachexalanxsltctraxtransformerfactoryimpl-not-found
It mentions something was removed in Java 6 and I seem to have Java 10:
What version of Java has worked for you?
Nicholas,
Not sure to be honest, however I recently used this on a V8 site that is on azure webapp and that works fine. Not sure what version of Java.
On Nuget can look at the version of TikaOnDotnet.TextExtractor I used for the original package and does that have dependancy on Java? I don't think you need java installed? I need to check on my webapp see if java is present.
Nicholas,
Did a quick check on my webapp its using 1.8.0 and locally i also have same version.
Regards
Ismail
Thanks for getting back to me so quickly.
FYI, I tried installing CogUmbracoExamineMediaIndexer into a fresh install of Umbraco 7.15.1 and it seems to be working as expected, so I must be doing something wrong in my custom code.
I'll dig in and post back here with my findings.
As a side note, just wanted to mention that the package seems to have installed a "MediaSearcher" search provider in ExamineSettings.config that didn't seem to work (caused a 500 error). I imagine this has to do with specific Umbraco versions. I commented out and replaced it with one that did work. Both are shown here for your reference (e.g., in case you want to update the documentation or code or something of that sort):
Here's the code version:
(For easier copying/pasting and Google indexing.)
Nicholas,
In your custom code am I right in assuming you are writing your own custom indexer which is reading a bunch of files in a folder? If so then step through and see if the path to the pdf that you are looking to extract is correct.
Also, the indexer I have written is not a custom indexer but an examine indexer which works in the same way the original pdf indexer works.
Regards
Ismail
Turns out I was missing this DLL: IKVM.OpenJDK.XML.Transform.dll
It was referenced from NuGet in my app project, but I guess it didn't get copied over to my web project during the build process. I added the Tika NuGet packages to my web project and problem solved.
I still have a bit more work to do, but at least I'm seeing the text extracted during the indexing phase.
is working on a reply...