I'm working on a Umbraco 6.0.5 website. I've installed successfully the CogUmbracoExamineIndexer package with all the assemblies needed. Il have no execution error. But when i inspect the Examine Indexes with Luke or Examine Inspector, the field FileTextContent is empty. I've tried with a .pdf and a .docx file, same issue.
"I've seen this in the logs that the medaindexset was defined twice. It's fixed, but FileTextContent still empty :/"
Did you originally have 2 entries for this in your examine config file but then you removed one? Also are you currently only trying to index one pdf or do you get issue with all pdfs?
I've just updated my Umbraco Website to version 6.1.3. The Examine Management from Developper section now find the files in the media tree, and tells me that they are indexed. But when i make a search of a word contained in the PDF files, it returns me nothing.
Ismail.
I've fixed the config files, but still not working. I've tride with several PDF files.
Actually, my config files look like this :
ExamineSettings.config :
<?xml version="1.0"?>
<!--
Umbraco examine is an extensible indexer and search engine.
This configuration file can be extended to add your own search/index providers.
Index sets can be defined in the ExamineIndex.config if you're using the standard provider model.
Hi Yarik. Yes, i'm already using Luke to verify my index. When I add a file in the Media section, le file is correctly indexed. but the FileTextContent still remains empty (count=0). I tried with several PDF and Word files...
Can you send me the assemblies you are using : the CogUmbracoExamineMediaIndexer.dl, the tika-app.dll and the IKVM.*.dll
I also have problems getting PDF-files' content indexed. I even tried with a TXT-file just to rectify the PDF-files might have been created in a way which is not indexable by Tika.
First i tried this on my Umbraco 6.1.4 (nightly build), and as i read through this post, I suddenly expected it to be the nightly build which was causing some problems.
So instead I tried with a totally fresh Umbraco CMS 4.7.2 version, though the CogUmbracoExamineMediaIndexer project page doesn't state which versions of Umbraco the project is compatible with: http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer. Still no dice. FileContent is still empty unfortunately.
Besides analyzing with the Java version of Luke, I also tried the .Net port (http://luke.codeplex.com/), which also shows that FileContent is empty:
I know Windows blocks certain content in ZIP-files, but I made sure to right-click the CogUmbracoExamineMediaIndexer ZIP --> Properties --> Click "Unblock"-botton. Could some of the DLL's still be blocked perhaps? I did the same thing for the tika-app-1.2.dll, but still no luck.
I am running the site on a localhost using Webmatrix, but I doubt that is an issue?
If anyone has a step-by-step guide on how to get the package indexing files, I would highly appreciate getting a copy of it :)
CogUmbracoExamineMediaIndexer.Tika.TextExtractor extractor = new CogUmbracoExamineMediaIndexer.Tika.TextExtractor();
var result = extractor.Extract(HttpContext.Current.Server.MapPath("~/media/1004/c********10.pdf"));
}
}
}
Surprise ! I have the following BadImageFormatException error :
Could not load file or assembly 'tika-app-1.2, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. The module was expected to contain an assembly manifest.
File content never indexed
Hi. I'm facing a weird problem.
I'm working on a Umbraco 6.0.5 website. I've installed successfully the CogUmbracoExamineIndexer package with all the assemblies needed. Il have no execution error. But when i inspect the Examine Indexes with Luke or Examine Inspector, the field FileTextContent is empty. I've tried with a .pdf and a .docx file, same issue.
Here is ma config files :
ExamineSettings.config :
<Examine>
<ExamineIndexProviders>
<providers>
<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
<add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
<!-- default external indexer, which excludes protected and unpublished pages-->
<add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="10" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
<add name="BootstrapENIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="30" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
<add name="BootstrapESIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="30" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
<add name="BootstrapHEIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="30" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
<add name="MediaIndexer" type="CogUmbracoExamineMediaIndexer.MediaIndexer, CogUmbracoExamineMediaIndexer" extensions=".pdf,.docx" umbracoFileProperty="umbracoFile" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
</providers>
</ExamineIndexProviders>
<ExamineSearchProviders defaultProvider="ExternalSearcher">
<providers>
<add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />
<add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcards="true" />
<add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true" />
<add name="BootstrapENSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
<add name="BootstrapESSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
<add name="BootstrapHESearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
<add name="MediaSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" indexSet="MediaIndexSet" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
</providers>
</ExamineSearchProviders>
</Examine>
ExamineIndes.config
<?xml version="1.0"?>
<!--
Umbraco examine is an extensible indexer and search engine.
This configuration file can be extended to create your own index sets.
Index/Search providers can be defined in the UmbracoSettings.config
More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com
-->
<ExamineLuceneIndexSets>
<!-- The internal index set used by Umbraco back-office - DO NOT REMOVE -->
<IndexSet SetName="InternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/" />
<!-- The internal index set used by Umbraco back-office for indexing members - DO NOT REMOVE -->
<IndexSet SetName="InternalMemberIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/InternalMember/">
<IndexAttributeFields>
<add Name="id" />
<add Name="nodeName" />
<add Name="updateDate" />
<add Name="writerName" />
<add Name="loginName" />
<add Name="email" />
<add Name="nodeTypeAlias" />
</IndexAttributeFields>
</IndexSet>
<!-- Default Indexset for external searches, this indexes all fields on all types of nodes-->
<IndexSet SetName="ExternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/External/" />
<IndexSet SetName="BootstrapENIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/BootstrapENIndexSet/" IndexParentId="1076">
<IndexAttributeFields>
<add Name="id" />
<add Name="nodeName" />
<add Name="updateDate" />
<add Name="writerName" />
<add Name="path" />
<add Name="nodeTypeAlias" />
<add Name="parentID" />
</IndexAttributeFields>
<IndexUserFields>
<add Name="headerText" />
<add Name="bodyText" />
</IndexUserFields>
<IncludeNodeTypes>
<add Name="Homepage" />
<add Name="Textpage" />
<add Name="Newspage" />
</IncludeNodeTypes>
<ExcludeNodeTypes />
</IndexSet>
<IndexSet SetName="BootstrapESIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/BootstrapESIndexSet/" IndexParentId="1106">
<IndexAttributeFields>
<add Name="id" />
<add Name="nodeName" />
<add Name="updateDate" />
<add Name="writerName" />
<add Name="path" />
<add Name="nodeTypeAlias" />
<add Name="parentID" />
</IndexAttributeFields>
<IndexUserFields>
<add Name="headerText" />
<add Name="bodyText" />
</IndexUserFields>
<IncludeNodeTypes>
<add Name="Homepage" />
<add Name="Textpage" />
<add Name="Newspage" />
</IncludeNodeTypes>
<ExcludeNodeTypes />
</IndexSet>
<IndexSet SetName="MediaIndexSet" IndexPath="~/App_Data/MediaIndexSet">
<IndexAttributeFields>
<add Name="id" />
<add Name="nodeName" />
<add Name="updateDate" />
<add Name="writerName" />
<add Name="path" />
<add Name="nodeTypeAlias" />
<add Name="parentID" />
</IndexAttributeFields>
<IncludeNodeTypes>
<add Name="File" />
</IncludeNodeTypes>
</IndexSet>
<IndexSet SetName="BootstrapHEIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/BootstrapHEIndexSet/" IndexParentId="1135">
<IndexAttributeFields>
<add Name="id" />
<add Name="nodeName" />
<add Name="updateDate" />
<add Name="writerName" />
<add Name="path" />
<add Name="nodeTypeAlias" />
<add Name="parentID" />
</IndexAttributeFields>
<IndexUserFields>
<add Name="headerText" />
<add Name="bodyText" />
</IndexUserFields>
<IncludeNodeTypes>
<add Name="Homepage" />
<add Name="Textpage" />
<add Name="Newspage" />
</IncludeNodeTypes>
<ExcludeNodeTypes />
</IndexSet>
<IndexSet SetName="MediaIndexSet" IndexPath="~/App_Data/MediaIndexSet" >
<IndexAttributeFields>
<add Name="id" />
<add Name="nodeName" />
<add Name="updateDate" />
<add Name="writerName" />
<add Name="path" />
<add Name="nodeTypeAlias" />
<add Name="parentID" />
</IndexAttributeFields>
<IncludeNodeTypes>
<add Name="File" />
</IncludeNodeTypes>
</IndexSet>
</ExamineLuceneIndexSets>
BOUTEBEL,
Can you take a look at the umbraco log file its in app_data or you could install Livelogger (http://our.umbraco.org/projects/backoffice-extensions/live-logger) package then update a media file see if that generates any errors.
One thing which index are you looking at with luke or examine inspector? The media indexer will only add to the mediaindexset.
Regards
Ismail
Well, i've opened the folder /App_Data/MediaIndexSet with Luke.
Idem, with Examine Inspector, i choose the MediaIndexSet.
In the logs, what am i supposed to search ?
I've seen this in the logs that the medaindexset was defined twice. It's fixed, but FileTextContent still empty :/
Hi. I'm still stuck with the same issue. Any idea ?
I have the same issue in umbraco 6.1.2
Hi Yarik. I'm still stuck with this issue. Did you found any alternative ?
Boutebel,
You say
"I've seen this in the logs that the medaindexset was defined twice. It's fixed, but FileTextContent still empty :/"
Did you originally have 2 entries for this in your examine config file but then you removed one? Also are you currently only trying to index one pdf or do you get issue with all pdfs?
Regards
Ismail
No.
I think that the problem is in Umbraco.
The standart UmbracoExamine.UmbracoContentIndexer also not index the media tree.
So i downloaded the source code of CogUmbracoExamineMediaIndexer and will check what's going on.
Be in touch.
Yarik
Boutebel,
Just upgrade umbraco to version 6.1.3 .
Examine Management Dashboard: Rebuilding index removes all media items from the index
There is a bug in old versions that was fixed in 6.1.3
Yarik :
I've just updated my Umbraco Website to version 6.1.3. The Examine Management from Developper section now find the files in the media tree, and tells me that they are indexed. But when i make a search of a word contained in the PDF files, it returns me nothing.
Ismail.
I've fixed the config files, but still not working. I've tride with several PDF files.
Actually, my config files look like this :
ExamineSettings.config :
<?xml version="1.0"?>
<!--
Umbraco examine is an extensible indexer and search engine.
This configuration file can be extended to add your own search/index providers.
Index sets can be defined in the ExamineIndex.config if you're using the standard provider model.
More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com
-->
<Examine>
<ExamineIndexProviders>
<providers>
<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
supportUnpublished="true"
supportProtected="true"
analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>
<add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine"
supportUnpublished="true"
supportProtected="true"
analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/>
<!-- default external indexer, which excludes protected and unpublished pages-->
<add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"/>
<add name="MediaIndexer" type="CogUmbracoExamineMediaIndexer.MediaIndexer, CogUmbracoExamineMediaIndexer" extensions=".pdf,.docx" umbracoFileProperty="umbracoFile" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
</providers>
</ExamineIndexProviders>
<ExamineSearchProviders defaultProvider="ExternalSearcher">
<providers>
<add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>
<add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />
<add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>
<add name="MediaSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" indexSet="MediaIndexSet" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
</providers>
</ExamineSearchProviders>
</Examine>
Only thing I can suggest is download the source and step through it see if it errors.
Regards
Ismail
Ok. Thanks for the help.
I can't step through the sources to find the solution cause i have no time :(.
Is there another way to index files like PDF and Doc in Umbraco ?
Hi BOUTEBEL
Can you open your MediaIndexer with Luke and verify you index? (Luke you can download here http://www.getopt.org/luke/)
Hi Yarik. Yes, i'm already using Luke to verify my index. When I add a file in the Media section, le file is correctly indexed. but the FileTextContent still remains empty (count=0). I tried with several PDF and Word files...
Can you send me the assemblies you are using : the CogUmbracoExamineMediaIndexer.dl, the tika-app.dll and the IKVM.*.dll
Perhaps it's an installation issue...
I also have problems getting PDF-files' content indexed. I even tried with a TXT-file just to rectify the PDF-files might have been created in a way which is not indexable by Tika.
First i tried this on my Umbraco 6.1.4 (nightly build), and as i read through this post, I suddenly expected it to be the nightly build which was causing some problems.
So instead I tried with a totally fresh Umbraco CMS 4.7.2 version, though the CogUmbracoExamineMediaIndexer project page doesn't state which versions of Umbraco the project is compatible with: http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer. Still no dice. FileContent is still empty unfortunately.
Besides analyzing with the Java version of Luke, I also tried the .Net port (http://luke.codeplex.com/), which also shows that FileContent is empty:
I know Windows blocks certain content in ZIP-files, but I made sure to right-click the CogUmbracoExamineMediaIndexer ZIP --> Properties --> Click "Unblock"-botton. Could some of the DLL's still be blocked perhaps? I did the same thing for the tika-app-1.2.dll, but still no luck.
I am running the site on a localhost using Webmatrix, but I doubt that is an issue?
If anyone has a step-by-step guide on how to get the package indexing files, I would highly appreciate getting a copy of it :)
Kind regards,
Kenneth
Hi. The assembly is not blocked from Windows. There is something else, but i can't figure out.
Hi there.
I 'v just made a little test :
An event handler Media_AfterSave to test the assemblies.
using Umbraco.Core;
using umbraco.BusinessLogic;
using umbraco.cms.businesslogic;
using umbraco.cms.businesslogic.web;
using umbraco.presentation.nodeFactory;
using umbraco.cms.businesslogic.member;
using System.Web.Security;
using System.Linq;
using umbraco.MacroEngines;
using Umbraco.Core.Models;
using Umbraco.Core.Services;
using System.Collections.Generic;
using UmbracoExamine;
using System.Xml.XPath;
using System.Web;
namespace Umbraco.Extensions.EventHandlers
{
public class PA_events : ApplicationEventHandler
{
protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)
{
MediaService.Saved += Media_AfterSave;
//Document.AfterPublish += Document_AfterPublish;
}
private void Media_AfterSave(IMediaService sender, Umbraco.Core.Events.SaveEventArgs<IMedia> e)
{
CogUmbracoExamineMediaIndexer.Tika.TextExtractor extractor = new CogUmbracoExamineMediaIndexer.Tika.TextExtractor();
var result = extractor.Extract(HttpContext.Current.Server.MapPath("~/media/1004/c********10.pdf"));
}
}
}
Surprise ! I have the following BadImageFormatException error :
Could not load file or assembly 'tika-app-1.2, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. The module was expected to contain an assembly manifest.
Anyone can explain me what happend ?
is working on a reply...