I was wondering if there is some kind of guide on how to use this indexer? e.g. what file types it understands, is there anything to be done in umbraco itself, what gets indexed exactly, etc?
Also, I had quite a few errors come up for missing dlls. I went on IKVM's page to get them, but it might be worth including them in the package for others.
With regards to missing dlls again its on the home page I could not upload all dlls due to 10mb file restriction on our.umbraco. Which dlls were missing? I can then update the installation instructions and include links to those.
Thanks for the reply and apologies for taking so long to reply!
I can't remember which ones were missing, but the ones I have in my bin folder are the following (assume they are all prefixed by IKVM.OpenJDK except for Runtime, which is just prefixed by IKVM):
As for getting the index working, I've done all of that. My question was more about what file types can be indexed beyond docx and pdf. Also what and how is indexed from the file content?
With regards to file types whatever the file type extensions are on apache tika page. So .csv,.ppt,.xsl,.xlsx etc etc. For non music and video and image files the file contents and any meta data attributes are indexed. So for a pdf as well as the contents any associated meta data like title,author date created will also end up in the index in separate fields. With music,video and images only the meta data ends up in the index so for an image any exif meta data embedded in the image will end up in the index.
How-to
Hi Ismail,
I was wondering if there is some kind of guide on how to use this indexer? e.g. what file types it understands, is there anything to be done in umbraco itself, what gets indexed exactly, etc?
Also, I had quite a few errors come up for missing dlls. I went on IKVM's page to get them, but it might be worth including them in the package for others.
Cheers,
Steven
Steven,
There is link on project home page as to which formats are supported see http://tika.apache.org/1.2/formats.html#Audio_formats in the ExamineSettings.config you can pass in csv list of files you want to index
<add name="MediaIndexer"
type="CogUmbracoExamineMediaIndexer.MediaIndexer, CogUmbracoExamineMediaIndexer"
extensions=".pdf,.docx"
umbracoFileProperty="umbracoFile"/>
set the extensions attribute.
With regards to missing dlls again its on the home page I could not upload all dlls due to 10mb file restriction on our.umbraco. Which dlls were missing? I can then update the installation instructions and include links to those.
Regards
Ismail
Hi Ismail,
Thanks for the reply and apologies for taking so long to reply!
I can't remember which ones were missing, but the ones I have in my bin folder are the following (assume they are all prefixed by IKVM.OpenJDK except for Runtime, which is just prefixed by IKVM):
Beans, Charsets, Corba, Core, Jdbc, Management, Media, Naming, Remoting, Security, SwingAWT, Text, Util, XML.API, XML.Parse, XML.Transform, XML.XPath and Runtime.
As for getting the index working, I've done all of that. My question was more about what file types can be indexed beyond docx and pdf. Also what and how is indexed from the file content?
Cheers,
Steven
Steven,
With regards to file types whatever the file type extensions are on apache tika page. So .csv,.ppt,.xsl,.xlsx etc etc. For non music and video and image files the file contents and any meta data attributes are indexed. So for a pdf as well as the contents any associated meta data like title,author date created will also end up in the index in separate fields. With music,video and images only the meta data ends up in the index so for an image any exif meta data embedded in the image will end up in the index.
Regards
Ismail
Hi Ismail,
Thanks for your response. Sorry, I hadn't realised it's just a port of Tika. I'll have a look at the project page for more info.
Thanks for the package and all the info!
Cheers,
Steven
is working on a reply...