Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Mohammed BOUTEBEL 64 posts 103 karma points
    Jul 01, 2013 @ 17:15
    Mohammed BOUTEBEL
    0

    File content never indexed

    Hi. I'm facing a weird problem.

     

    I'm working on a Umbraco 6.0.5 website. I've installed successfully the CogUmbracoExamineIndexer package with all the assemblies needed. Il have no execution error. But when i inspect the Examine Indexes with Luke or Examine Inspector, the field FileTextContent is empty. I've tried with a .pdf and a .docx file, same issue.

     

    Here is ma config files :

    ExamineSettings.config :

    <Examine>

      <ExamineIndexProviders>

        <providers>

          <add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />

          <add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine" supportUnpublished="true" supportProtected="true" interval="10" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

          <!-- default external indexer, which excludes protected and unpublished pages-->

          <add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="10" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />

          <add name="BootstrapENIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="30" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

          <add name="BootstrapESIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="30" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

          <add name="BootstrapHEIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" supportUnpublished="false" supportProtected="false" interval="30" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

          <add name="MediaIndexer" type="CogUmbracoExamineMediaIndexer.MediaIndexer, CogUmbracoExamineMediaIndexer" extensions=".pdf,.docx" umbracoFileProperty="umbracoFile"  analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

        </providers>

      </ExamineIndexProviders>

      <ExamineSearchProviders defaultProvider="ExternalSearcher">

        <providers>

          <add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" />

          <add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcards="true" />

          <add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true" />

          <add name="BootstrapENSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

          <add name="BootstrapESSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

          <add name="BootstrapHESearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

          <add name="MediaSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" indexSet="MediaIndexSet" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

        </providers>

      </ExamineSearchProviders>

    </Examine>

     

    ExamineIndes.config

     

    <?xml version="1.0"?>

    <!-- 

    Umbraco examine is an extensible indexer and search engine.

    This configuration file can be extended to create your own index sets.

    Index/Search providers can be defined in the UmbracoSettings.config

     

    More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com

    -->

    <ExamineLuceneIndexSets>

      <!-- The internal index set used by Umbraco back-office - DO NOT REMOVE -->

      <IndexSet SetName="InternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/" />

      <!-- The internal index set used by Umbraco back-office for indexing members - DO NOT REMOVE -->

      <IndexSet SetName="InternalMemberIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/InternalMember/">

        <IndexAttributeFields>

          <add Name="id" />

          <add Name="nodeName" />

          <add Name="updateDate" />

          <add Name="writerName" />

          <add Name="loginName" />

          <add Name="email" />

          <add Name="nodeTypeAlias" />

        </IndexAttributeFields>

      </IndexSet>

      <!-- Default Indexset for external searches, this indexes all fields on all types of nodes-->

      <IndexSet SetName="ExternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/External/" />

      <IndexSet SetName="BootstrapENIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/BootstrapENIndexSet/" IndexParentId="1076">

        <IndexAttributeFields>

          <add Name="id" />

          <add Name="nodeName" />

          <add Name="updateDate" />

          <add Name="writerName" />

          <add Name="path" />

          <add Name="nodeTypeAlias" />

          <add Name="parentID" />

        </IndexAttributeFields>

        <IndexUserFields>

          <add Name="headerText" />

          <add Name="bodyText" />

        </IndexUserFields>

        <IncludeNodeTypes>

          <add Name="Homepage" />

          <add Name="Textpage" />

          <add Name="Newspage" />

     

        </IncludeNodeTypes>

        <ExcludeNodeTypes />

      </IndexSet>

      <IndexSet SetName="BootstrapESIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/BootstrapESIndexSet/" IndexParentId="1106">

        <IndexAttributeFields>

          <add Name="id" />

          <add Name="nodeName" />

          <add Name="updateDate" />

          <add Name="writerName" />

          <add Name="path" />

          <add Name="nodeTypeAlias" />

          <add Name="parentID" />

        </IndexAttributeFields>

        <IndexUserFields>

          <add Name="headerText" />

          <add Name="bodyText" />

        </IndexUserFields>

        <IncludeNodeTypes>

          <add Name="Homepage" />

          <add Name="Textpage" />

          <add Name="Newspage" />

        </IncludeNodeTypes>

        <ExcludeNodeTypes />

      </IndexSet>

        <IndexSet SetName="MediaIndexSet" IndexPath="~/App_Data/MediaIndexSet">

            <IndexAttributeFields>

                <add Name="id" />

                <add Name="nodeName" />

                <add Name="updateDate" />

                <add Name="writerName" />

                <add Name="path" />

                <add Name="nodeTypeAlias" />

                <add Name="parentID" />

            </IndexAttributeFields>

            <IncludeNodeTypes>

                <add Name="File" />

            </IncludeNodeTypes>

        </IndexSet>

      <IndexSet SetName="BootstrapHEIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/BootstrapHEIndexSet/" IndexParentId="1135">

        <IndexAttributeFields>

          <add Name="id" />

          <add Name="nodeName" />

          <add Name="updateDate" />

          <add Name="writerName" />

          <add Name="path" />

          <add Name="nodeTypeAlias" />

          <add Name="parentID" />

        </IndexAttributeFields>

        <IndexUserFields>

          <add Name="headerText" />

          <add Name="bodyText" />

        </IndexUserFields>

        <IncludeNodeTypes>

          <add Name="Homepage" />

          <add Name="Textpage" />

          <add Name="Newspage" />

        </IncludeNodeTypes>

        <ExcludeNodeTypes />

      </IndexSet>

      <IndexSet SetName="MediaIndexSet" IndexPath="~/App_Data/MediaIndexSet" >

        <IndexAttributeFields>

          <add Name="id" />

          <add Name="nodeName" />

          <add Name="updateDate" />

          <add Name="writerName" />

          <add Name="path" />

          <add Name="nodeTypeAlias" />

          <add Name="parentID" />

        </IndexAttributeFields>

        <IncludeNodeTypes>

          <add Name="File" />

        </IncludeNodeTypes>

      </IndexSet>

    </ExamineLuceneIndexSets>

     

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Jul 01, 2013 @ 17:23
    Ismail Mayat
    0

    BOUTEBEL,

    Can you take a look at the umbraco log file its in app_data or you could install Livelogger (http://our.umbraco.org/projects/backoffice-extensions/live-logger) package then update a media file see if that generates any errors.

    One thing which index are you looking at with luke or examine inspector? The media indexer will only add to the mediaindexset.

    Regards

    Ismail

  • Mohammed BOUTEBEL 64 posts 103 karma points
    Jul 01, 2013 @ 17:35
    Mohammed BOUTEBEL
    0

    Well, i've opened the folder /App_Data/MediaIndexSet with Luke.

    Idem, with Examine Inspector, i choose the MediaIndexSet.

    In the logs, what am i supposed to search ?

     


  • Mohammed BOUTEBEL 64 posts 103 karma points
    Jul 01, 2013 @ 18:12
    Mohammed BOUTEBEL
    0

    I've seen this in the logs that the medaindexset was defined twice. It's fixed, but FileTextContent still empty :/

     

  • Mohammed BOUTEBEL 64 posts 103 karma points
    Jul 05, 2013 @ 10:12
    Mohammed BOUTEBEL
    0

    Hi. I'm still stuck with the same issue. Any idea ?

  • Yarik Goldvarg 35 posts 84 karma points
    Jul 28, 2013 @ 19:42
    Yarik Goldvarg
    0

    I have the same issue in umbraco 6.1.2

  • Mohammed BOUTEBEL 64 posts 103 karma points
    Jul 29, 2013 @ 09:53
    Mohammed BOUTEBEL
    0

    Hi Yarik. I'm still stuck with this issue. Did you found any alternative ?

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Jul 29, 2013 @ 09:59
    Ismail Mayat
    0

    Boutebel,

    You say 

    "I've seen this in the logs that the medaindexset was defined twice. It's fixed, but FileTextContent still empty :/"

    Did you originally have 2 entries for this in your examine config file but then you removed one? Also are you currently only trying to  index one pdf or do you get issue with all pdfs?

    Regards

    Ismail

  • Yarik Goldvarg 35 posts 84 karma points
    Jul 29, 2013 @ 10:00
    Yarik Goldvarg
    0

    No.

    I think that the problem is in Umbraco.

    The standart UmbracoExamine.UmbracoContentIndexer also not index the media tree.

    So i downloaded the source code of CogUmbracoExamineMediaIndexer and will check what's going on.

    Be in touch.

     

    Yarik

     

  • Yarik Goldvarg 35 posts 84 karma points
    Jul 29, 2013 @ 10:40
    Yarik Goldvarg
    0

    Boutebel,

    Just upgrade umbraco to version 6.1.3 .

    Examine Management Dashboard: Rebuilding index removes all media items from the index

     

    There is a bug in old versions that was fixed in 6.1.3

  • Mohammed BOUTEBEL 64 posts 103 karma points
    Aug 14, 2013 @ 15:43
    Mohammed BOUTEBEL
    0

    Yarik :

    I've just updated my Umbraco Website to version 6.1.3. The Examine Management from Developper section now find the files in the media tree, and tells me that they are indexed. But when i make a search of a word contained in the PDF files, it returns me nothing.

     

    Ismail.

    I've fixed the config files, but still not working. I've tride with several PDF files.

     

    Actually, my config files look like this :

     

    ExamineSettings.config :

    <?xml version="1.0"?>

    <!-- 

    Umbraco examine is an extensible indexer and search engine.

    This configuration file can be extended to add your own search/index providers.

    Index sets can be defined in the ExamineIndex.config if you're using the standard provider model.

     

    More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com

    -->

    <Examine>

      <ExamineIndexProviders>

        <providers>

          <add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"

               supportUnpublished="true"

               supportProtected="true"

               analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

     

          <add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine"

               supportUnpublished="true"

               supportProtected="true"

               analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/>

     

            <!-- default external indexer, which excludes protected and unpublished pages-->

            <add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"/>

            <add name="MediaIndexer" type="CogUmbracoExamineMediaIndexer.MediaIndexer, CogUmbracoExamineMediaIndexer" extensions=".pdf,.docx" umbracoFileProperty="umbracoFile"  analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

            

          

        </providers>

      </ExamineIndexProviders>

     

      <ExamineSearchProviders defaultProvider="ExternalSearcher">

        <providers>

          <add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"

               analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

            

          <add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />

          

          <add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"

               analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>

            <add name="MediaSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" indexSet="MediaIndexSet" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />

        </providers>

      </ExamineSearchProviders>

     

    </Examine>

    ExamineIndex.config
    <?xml version="1.0"?>
    <!-- 
    Umbraco examine is an extensible indexer and search engine.
    This configuration file can be extended to create your own index sets.
    Index/Search providers can be defined in the UmbracoSettings.config
    More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com
    -->
    <ExamineLuceneIndexSets>
      <!-- The internal index set used by Umbraco back-office - DO NOT REMOVE -->
      <IndexSet SetName="InternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/"/>
      <!-- The internal index set used by Umbraco back-office for indexing members - DO NOT REMOVE -->
      <IndexSet SetName="InternalMemberIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/InternalMember/">
        <IndexAttributeFields>
          <add Name="id" />
          <add Name="nodeName"/>
          <add Name="updateDate" />
          <add Name="writerName" />
          <add Name="loginName" />
          <add Name="email" />
          <add Name="nodeTypeAlias" />
        </IndexAttributeFields>
      </IndexSet>
        <IndexSet SetName="MediaIndexSet" IndexPath="~/App_Data/MediaIndexSet" >
            <IndexAttributeFields>
                <add Name="id" />
                <add Name="nodeName" />
                <add Name="updateDate" />
                <add Name="writerName" />
                <add Name="path" />
                <add Name="nodeTypeAlias" />
                <add Name="parentID" />
            </IndexAttributeFields>
            <IncludeNodeTypes>
                <add Name="File" />
            </IncludeNodeTypes>
        </IndexSet>
      <!-- Default Indexset for external searches, this indexes all fields on all types of nodes-->
      <IndexSet SetName="ExternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/External/" />
    </ExamineLuceneIndexSets>

     

  • Ismail Mayat 4511 posts 10091 karma points MVP 2x admin c-trib
    Aug 14, 2013 @ 15:46
    Ismail Mayat
    0

    Only thing I can suggest is download the source and step through it see if it errors.

    Regards

    Ismail

  • Mohammed BOUTEBEL 64 posts 103 karma points
    Aug 14, 2013 @ 16:33
    Mohammed BOUTEBEL
    0

    Ok. Thanks for the help. 

    I can't step through the sources to find the solution cause i have no time :(.

    Is there another way to index files like PDF and Doc in Umbraco ?

  • Yarik Goldvarg 35 posts 84 karma points
    Aug 15, 2013 @ 08:19
    Yarik Goldvarg
    0

    Hi BOUTEBEL

    Can you open your MediaIndexer with Luke and verify you index? (Luke you can download here http://www.getopt.org/luke/)

     

     

  • Mohammed BOUTEBEL 64 posts 103 karma points
    Aug 19, 2013 @ 10:25
    Mohammed BOUTEBEL
    0

    Hi Yarik. Yes, i'm already using Luke to verify my index. When I add a file in the Media section, le file is correctly indexed. but the FileTextContent still remains empty (count=0). I tried with several PDF and Word files...

    Can you send me the assemblies you are using : the CogUmbracoExamineMediaIndexer.dl, the tika-app.dll and the IKVM.*.dll

    Perhaps it's an installation issue...

  • khm1985 8 posts 54 karma points
    Aug 27, 2013 @ 14:34
    khm1985
    0

    I also have problems getting PDF-files' content indexed. I even tried with a TXT-file just to rectify the PDF-files might have been created in a way which is not indexable by Tika.

    First i tried this on my Umbraco 6.1.4 (nightly build), and as i read through this post, I suddenly expected it to be the nightly build which was causing some problems.

    So instead I tried with a totally fresh Umbraco CMS 4.7.2 version, though the CogUmbracoExamineMediaIndexer project page doesn't state which versions of Umbraco the project is compatible with: http://our.umbraco.org/projects/website-utilities/cogumbracoexaminemediaindexer. Still no dice. FileContent is still empty unfortunately.

    Besides analyzing with the Java version of Luke, I also tried the .Net port (http://luke.codeplex.com/), which also shows that FileContent is empty:
     

    I know Windows blocks certain content in ZIP-files, but I made sure to right-click the CogUmbracoExamineMediaIndexer ZIP --> Properties --> Click "Unblock"-botton. Could some of the DLL's still be blocked perhaps? I did the same thing for the tika-app-1.2.dll, but still no luck.

    I am running the site on a localhost using Webmatrix, but I doubt that is an issue?

    If anyone has a step-by-step guide on how to get the package indexing files, I would highly appreciate getting a copy of it :)

     

    Kind regards,
    Kenneth 

  • Mohammed BOUTEBEL 64 posts 103 karma points
    Sep 16, 2013 @ 09:15
    Mohammed BOUTEBEL
    0

    Hi. The assembly is not blocked from Windows. There is something else, but i can't figure out.

  • Mohammed BOUTEBEL 64 posts 103 karma points
    Oct 07, 2013 @ 16:07
    Mohammed BOUTEBEL
    0

    Hi there. 

    I 'v just made a little test :

    An event handler Media_AfterSave to test the assemblies.

     

    using Umbraco.Core;

    using umbraco.BusinessLogic;

    using umbraco.cms.businesslogic;

    using umbraco.cms.businesslogic.web;

    using umbraco.presentation.nodeFactory;

    using umbraco.cms.businesslogic.member;

    using System.Web.Security;

    using System.Linq;

    using umbraco.MacroEngines;

    using Umbraco.Core.Models;

    using Umbraco.Core.Services;

    using System.Collections.Generic;

     

    using UmbracoExamine;

    using System.Xml.XPath;

    using System.Web;

     

    namespace Umbraco.Extensions.EventHandlers

    {

        public class PA_events : ApplicationEventHandler

        {

            protected override void ApplicationStarted(UmbracoApplicationBase umbracoApplication, ApplicationContext applicationContext)

            {

                MediaService.Saved += Media_AfterSave;

                //Document.AfterPublish += Document_AfterPublish; 

             

            }

     

            private void Media_AfterSave(IMediaService sender, Umbraco.Core.Events.SaveEventArgs<IMedia> e)

            {

                CogUmbracoExamineMediaIndexer.Tika.TextExtractor extractor = new CogUmbracoExamineMediaIndexer.Tika.TextExtractor();

                var result = extractor.Extract(HttpContext.Current.Server.MapPath("~/media/1004/c********10.pdf"));

    }

        }

    }

     

    Surprise ! I have the following BadImageFormatException error :

    Could not load file or assembly 'tika-app-1.2, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. The module was expected to contain an assembly manifest.

    Anyone can explain me what happend ?

     

Please Sign in or register to post replies

Write your reply to:

Draft