examinelucene searching for multi lingual site - General

Go to solution

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Simon Dingley 1474 posts 3431 karma points c-trib

Mar 13, 2013 @ 16:37

Examine/Lucene Searching for Multi-Lingual Site

I am finishing up on what is probably my largest and most complex development to date. It is a multi-site, multi-language install with more to follow after delivery. One of my last remaning issues is with regards to the search facility on non-english sites, in particular the French.

We have a tag search that is returning the tags without the original punctuation so Oeuf d'or becomes Oeuf dor which is obviously not the same thing. We are using the StandardAnalyzer which I understood to support such punctuation?

We have subscribed to the GatheringNodeData event in order to insert tags without the delimeters and to replace spaces for indexing as follow:

          if (!string.IsNullOrEmpty(e.Fields["tags"]))
          {
            e.Fields["tags"] = e.Fields["tags"].Replace(" ", "_").Replace(",", " ");
          }

So as you can see we are not changing the original tags in any way other than to replace spaces with an underscore and remove comma delimters.

I should probably also mention that there is a single search index for the site and the configuration is as follows:

ExamineIndex.config

  <ExamineLuceneIndexSets>

  <IndexSet SetName="SiteSearchIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/SiteSearch/">
    <IndexAttributeFields>
      <add Name="id" />
      <add Name="nodeName"/>
      <add Name="path" />
    </IndexAttributeFields>
    <IndexUserFields>
      <add Name="title"/>
      <add Name="summary"/>
      <add Name="body"/>
      <add Name="metaDescription" />
      <add Name="metaKeywords" />
      <add Name="siteId"/>
      <add Name="tags"/>
      <add Name="file" />
    </IndexUserFields>
    <IncludeNodeTypes />
    <ExcludeNodeTypes>
      <add Name="CalloutFolder" />
      <add Name="PanelDonate" />
      <add Name="PanelFeature" />
      <add Name="SiteContainer" />
      <add Name="SlideShow" />
      <add Name="SlideShowSlide" />
    </ExcludeNodeTypes>
  </IndexSet>

</ExamineLuceneIndexSets>

ExamineSettings.config

<Examine>
  <ExamineIndexProviders>
    <providers>

      <add name="SiteSearchIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
           runAsync="true"
           supportUnpublished="false"
           supportProtected="false"
           interval="10"
           analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

    </providers>
  </ExamineIndexProviders>

  <ExamineSearchProviders defaultProvider="ExternalSearcher">
    <providers>

      <add name="SiteSearchSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
                       analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>

    </providers>
  </ExamineSearchProviders>

</Examine>

Any help would be much appreciated as I'm sure I'm not the first to encounter this but the documentation for searching with Examine is quite fragmented so as yet I've not found a solution.

Thanks, Simon

Copy Link

Simon Dingley 1474 posts 3431 karma points c-trib

Mar 13, 2013 @ 16:39

0

Unable to edit the post however the version above shows WhitespaceAnalyzer following something I was testing but the current version is actually using Lucene.Net.Analysis.Standard.StandardAnalyzer.

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Mar 13, 2013 @ 16:44

0

simon,

Ideally each language in the site have its own index? Also what language is it? It will probably need its own analyser for that language.

Regards

Ismail

Copy Link
Simon Dingley 1474 posts 3431 karma points c-trib

Mar 13, 2013 @ 16:49

0

It's French in this case however German, Dutch and Italian will follow closely behind. The reason for having it all in one index is that they are all part of a "group" and the group site will end up aggregating the data from all others so with a single index we can either use the siteId as a filter or grab all tags regardless of which site they originated.

What is the need for seperate indexes? To be able to use different Analyzers per index?

Thanks Ismail

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Mar 13, 2013 @ 18:33

1

Simon,

It could be that replace is having encoding issues? The foreign chars in the content should also be in index as far as i am aware. The analysers are more for ignoring stop words. Step through the code and see what your before and after is.

Regards

Ismail

Copy Link
Simon Dingley 1474 posts 3431 karma points c-trib

Mar 14, 2013 @ 09:25

0

Morning,

The foreign characters are going into the index and coming back out fine it's punctuation which is not, in this specific case the apostrophe. I'll step through the code shortly and confirm back the result.

Cheers, Simon

Copy Link
Simon Dingley 1474 posts 3431 karma points c-trib

Mar 18, 2013 @ 09:18

100

Problem solved, perhaps indirectly by changing the analyzer and then rebuilding the index again.

Thanks for the pointers Ismail.

Copy Link
Flavio Spezi 129 posts 315 karma points

Sep 24, 2013 @ 17:41

0

Hi Simon Dingley, I am trying to use Examine to search documents, but they aren't in english, but in italian.
Do you found a ItalianAnalyzer or something like that?

Thanks

Copy Link
Flavio Spezi 129 posts 315 karma points

Sep 24, 2013 @ 18:02

0

Simon, my search result is not good. For example, I have a node with this name: "Festa dell'aquilone", this can be translate step-by-step in "festival" "of the" kite". With WhitespaceAnalyzer, if I try to search with "aquilone" text, I have not results. Otherwise with "dell'aquilone" I can find the node.

Another issue is "stress mark": à é è ì ò ù. I can find the node "Identità" with the same texh, but not with "identita".

In italian (like in any language) there are words too many commonly: il lo la i gli le di a da in con su per tra fra (like "in for as is are where when this that the..."). It is better that Lucene do ignoring these words when users do search.

How do you solved these issues?

Thanks very much

Copy Link
Simon Dingley 1474 posts 3431 karma points c-trib

Sep 24, 2013 @ 18:24

0

I'm no expert on this but can you try opening your index with Luke and seeing if you can achieve the desired results?

https://code.google.com/p/luke/

Copy Link
Flavio Spezi 129 posts 315 karma points

Sep 24, 2013 @ 18:47

0

Ok, I looking the index with Luke.
But... I don't understand: what do I look?
I look that the name field contains "dell'aquilone", and many times of "di", "la", "del" terms.

Copy Link
is working on a reply...

Please Sign in or register to post replies