Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Simon Dingley 1474 posts 3431 karma points c-trib
    Mar 13, 2013 @ 16:37
    Simon Dingley
    1

    Examine/Lucene Searching for Multi-Lingual Site

    I am finishing up on what is probably my largest and most complex development to date. It is a multi-site, multi-language install with more to follow after delivery. One of my last remaning issues is with regards to the search facility on non-english sites, in particular the French.

    We have a tag search that is returning the tags without the original punctuation so Oeuf d'or becomes Oeuf dor  which is obviously not the same thing. We are using the StandardAnalyzer which I understood to support such punctuation?

    We have subscribed to the GatheringNodeData event in order to insert tags without the delimeters and to replace spaces for indexing as follow:

              if (!string.IsNullOrEmpty(e.Fields["tags"]))
              {
                e.Fields["tags"] = e.Fields["tags"].Replace(" ", "_").Replace(",", " ");
              }

    So as you can see we are not changing the original tags in any way other than to replace spaces with an underscore and remove comma delimters.

    I should probably also mention that there is a single search index for the site and the configuration is as follows:

    ExamineIndex.config

      <ExamineLuceneIndexSets>
    
      <IndexSet SetName="SiteSearchIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/SiteSearch/">
        <IndexAttributeFields>
          <add Name="id" />
          <add Name="nodeName"/>
          <add Name="path" />
        </IndexAttributeFields>
        <IndexUserFields>
          <add Name="title"/>
          <add Name="summary"/>
          <add Name="body"/>
          <add Name="metaDescription" />
          <add Name="metaKeywords" />
          <add Name="siteId"/>
          <add Name="tags"/>
          <add Name="file" />
        </IndexUserFields>
        <IncludeNodeTypes />
        <ExcludeNodeTypes>
          <add Name="CalloutFolder" />
          <add Name="PanelDonate" />
          <add Name="PanelFeature" />
          <add Name="SiteContainer" />
          <add Name="SlideShow" />
          <add Name="SlideShowSlide" />
        </ExcludeNodeTypes>
      </IndexSet>
    
    </ExamineLuceneIndexSets>

    ExamineSettings.config

    <Examine>
      <ExamineIndexProviders>
        <providers>
    
          <add name="SiteSearchIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
               runAsync="true"
               supportUnpublished="false"
               supportProtected="false"
               interval="10"
               analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>
    
        </providers>
      </ExamineIndexProviders>
    
      <ExamineSearchProviders defaultProvider="ExternalSearcher">
        <providers>
    
          <add name="SiteSearchSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
                           analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>
    
        </providers>
      </ExamineSearchProviders>
    
    </Examine>

    Any help would be much appreciated as I'm sure I'm not the first to encounter this but the documentation for searching with Examine is quite fragmented so as yet I've not found a solution.

    Thanks, Simon

  • Simon Dingley 1474 posts 3431 karma points c-trib
    Mar 13, 2013 @ 16:39
    Simon Dingley
    0

    Unable to edit the post however the version above shows WhitespaceAnalyzer following something I was testing but the current version is actually  using Lucene.Net.Analysis.Standard.StandardAnalyzer.

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Mar 13, 2013 @ 16:44
    Ismail Mayat
    0

    simon,

    Ideally each language in the site have its own index? Also what language is it? It will probably need its own analyser for that language. 

    Regards

    Ismail

  • Simon Dingley 1474 posts 3431 karma points c-trib
    Mar 13, 2013 @ 16:49
    Simon Dingley
    0

    It's French in this case however German, Dutch and Italian will follow closely behind. The reason for having it all in one index is that they are all part of a "group" and the group site will end up aggregating the data from all others so with a single index we can either use the siteId as a filter or grab all tags regardless of which site they originated.

    What is the need for seperate indexes? To be able to use different Analyzers per index?

    Thanks Ismail

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Mar 13, 2013 @ 18:33
    Ismail Mayat
    1

    Simon,

    It could be that replace is having encoding issues? The foreign chars in the content should also be in index as far as i am aware. The analysers are more for ignoring stop words. Step through the code and see what your before and after is.

    Regards

    Ismail

  • Simon Dingley 1474 posts 3431 karma points c-trib
    Mar 14, 2013 @ 09:25
    Simon Dingley
    0

    Morning,

    The foreign characters are going into the index and coming back out fine it's punctuation which is not, in this specific case the apostrophe. I'll step through the code shortly and confirm back the result.

    Cheers, Simon

  • Simon Dingley 1474 posts 3431 karma points c-trib
    Mar 18, 2013 @ 09:18
    Simon Dingley
    100

    Problem solved, perhaps indirectly by changing the analyzer and then rebuilding the index again.

    Thanks for the pointers Ismail.

  • Flavio Spezi 129 posts 315 karma points
    Sep 24, 2013 @ 17:41
    Flavio Spezi
    0

    Hi Simon Dingley, I am trying to use Examine to search documents, but they aren't in english, but in italian.
    Do you found a ItalianAnalyzer or something like that?

    Thanks

  • Flavio Spezi 129 posts 315 karma points
    Sep 24, 2013 @ 18:02
    Flavio Spezi
    0

    Simon, my search result is not good. For example, I have a node with this name: "Festa dell'aquilone", this can be translate step-by-step in "festival" "of the" kite". With WhitespaceAnalyzer, if I try to search with "aquilone" text, I have not results. Otherwise with "dell'aquilone" I can find the node.

    Another issue is "stress mark": à é è ì ò ù. I can find the node "Identità" with the same texh, but not with "identita".

    In italian (like in any language) there are words too many commonly: il lo la i gli le di a da in con su per tra fra (like "in for as is are where when this that the..."). It is better that Lucene do ignoring these words when users do search.

    How do you solved these issues?

    Thanks very much

  • Simon Dingley 1474 posts 3431 karma points c-trib
    Sep 24, 2013 @ 18:24
    Simon Dingley
    0

    I'm no expert on this but can you try opening your index with Luke and seeing if you can achieve the desired results?

    https://code.google.com/p/luke/

  • Flavio Spezi 129 posts 315 karma points
    Sep 24, 2013 @ 18:47
    Flavio Spezi
    0

    Ok, I looking the index with Luke.
    But... I don't understand: what do I look?
    I look that the name field contains "dell'aquilone", and many times of "di", "la", "del" terms.

Please Sign in or register to post replies

Write your reply to:

Draft