Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Berto 105 posts 177 karma points
    Jan 18, 2011 @ 13:27
    Berto
    1

    Examine and accents (for portuguese language)

    Hi!

    So, i'm using examine for umbraco for my website search. So far so good, all results are showing ok.

    The problem is when the search terms have accents, for example: logística (its logistics if you're wondering).

    If i search with the í, no problem, but if i try with i (logistica) it doesn't find anything. How can make the search Accent Insensitive? I've seaching for hours in here and google, but i didn't find a solution (maybe it's beacause i've got only 3 hours of bed time). 

    Thx

     

  • Berto 105 posts 177 karma points
    Jan 18, 2011 @ 20:19
    Berto
    4

    So, after mutch digging, reflector came to the rescue (and a good friend that had all the ideias). 

    the lucene documentation is inexistence or very, very hard to find, if you know where it is, please post it

    I just leave my solution for the problem, that turned out to be very easy, if i'm doing it wrong or you know a better way, please share it.

    I create a Class Library Project, Added the reference to Lucene.net and created the following class (copy-paste of my cs file):

     

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using Lucene.Net.Analysis;
    using Lucene.Net.Analysis.Standard;

    namespace MassiveLuceneAnalyser
    {
        public class CIAIAnalyser : Analyzer
        {
            public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
            {
                StandardTokenizer tokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);

                tokenizer.SetMaxTokenLength(255);
                TokenStream stream = new StandardFilter(tokenizer);
                stream = new LowerCaseFilter(stream);
                return new ASCIIFoldingFilter(stream);

            }
        }
    }

     

    It's mostly a copy of the StandartFilter in Lucene.net assembly, but using a difrente filter. CIAI is for Case Insensitive, Accent Insensitive

    Happy coding!

    Berto

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Jan 19, 2011 @ 11:17
    Ismail Mayat
    0

    Berto,

    If creating this library and copying over to the bin did you have todo anything else? Or does lucene indexer know what to do by loading up the analyzer becuase it inherits from Analyzer?

    Many thanks

     

    Ismail

  • Berto 105 posts 177 karma points
    Jan 19, 2011 @ 11:37
    Berto
    3

    Hi Ismail,

    I forgot to post above the change in the configuration file (config/ExamineSettings.config).

    After you place the assembly in bin, you have to change the analyzer in the Examine settings (I'm only posting my providers, don't delete the the umbraco providers):

    Before

     

    <Examine>
      <ExamineIndexProviders>
        <providers>
         <!-- ... snipp of the providers default of umbraco... -->

            <add name="MySiteIndexer" type="UmbracoExamine.LuceneExamineIndexer, UmbracoExamine"
         runAsync="true"
         supportUnpublished="false"
         supportProtected="true"
         interval="10"
         analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/>

        </providers>

      </ExamineIndexProviders>
      <ExamineSearchProviders defaultProvider="InternalSearcher">
        <providers>
    <!-- ... snipp of the providers default of umbraco... -->
            <add name="MySiteSearcher" type="UmbracoExamine.LuceneExamineSearcher, UmbracoExamine"
               analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>
        </providers>
      </ExamineSearchProviders>
    </Examine>
     

     

     

    After

    <Examine>
      <ExamineIndexProviders>
        <providers>
         <!-- ... snipp of the providers default of umbraco... -->

            <add name="MySiteIndexer" type="UmbracoExamine.LuceneExamineIndexer, UmbracoExamine"
         runAsync="true"
         supportUnpublished="false"
         supportProtected="true"
         interval="10"
         analyzer="MassiveLuceneAnalyser.CIAIAnalyser, MassiveLuceneAnalyser"/>

        </providers>


      </ExamineIndexProviders>
      <ExamineSearchProviders defaultProvider="InternalSearcher">
        <providers>
          <!-- ... snipp of the providers default of umbraco... -->

            <add name="MySiteSearcher" type="UmbracoExamine.LuceneExamineSearcher, UmbracoExamine"
               analyzer="MassiveLuceneAnalyser.CIAIAnalyser, MassiveLuceneAnalyser" enableLeadingWildcards="true"/>
        </providers>
      </ExamineSearchProviders>
    </Examine>

     

    The only change is the analyzer keyword, where you change it to your assmebly 

    analyzer="[Namespace].[Class], [AssemblyWithoutDotDll]"

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Jan 19, 2011 @ 11:50
    Ismail Mayat
    0

    Berto,

    Brilliant post no doubt will prove to be very useful for people doing non english searching.

    Regards

    Ismail

  • Aaron Powell 1708 posts 3046 karma points c-trib
    Jan 19, 2011 @ 12:10
    Aaron Powell
    0

    Refer to the Lucene Java docs, it's just as useful because there are identical APIs.

    Also, the StandardAnalyzer (and other default Lucene analyzers) are all for the English language, so yes you do need to write your own.

    On an interesting note I believe Manning has a sale on today for the Lucene in Action Second Edition book (which I have a copy of and it's awesome)

  • Berto 105 posts 177 karma points
    Jan 19, 2011 @ 12:43
    Berto
    0

    Hi Slace, 

    Can you post the links to the Lucene Java docs? Yesterday I went there, but i didn't find anything useful for a noob like me. I think i'm loosing my google it skills...

    Nice blog by the way ;) It help me to understand what i had to do for this analyzer

  • Aaron Powell 1708 posts 3046 karma points c-trib
    Jan 19, 2011 @ 12:44
    Aaron Powell
    0

    Here's the 2.9.2 docs - http://lucene.apache.org/java/2_9_2/api/all/index.html

    That's the latest version ported to Lucene.Net

  • Nuno Lourenço 4 posts 30 karma points
    May 27, 2011 @ 19:38
    Nuno Lourenço
    1

    Hi Berto.
    Excellent post about Examine/Lucene.

    I've tried your solution and it worked great :)
    In my case it had a minor issue. Although i wanted to do exactly has you've done, I had to add extra functionality to be able to search for the same word with and without the accent.
    For instance, logística and logistica did not returned the same results, due to the indexer not containing the accent.
    For being able to give the same results for both words, in my case I've added:

            public string RemoveDiacritics(string input)
            {
                // Indicates that a Unicode string is normalized using full canonical decomposition.
                string inputInFormD = input.Normalize(NormalizationForm.FormD);
                var sb = new StringBuilder();

                for (int idx = 0; idx < inputInFormD.Length; idx++)
                {
                    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(inputInFormD[idx]);
                    if (uc != UnicodeCategory.NonSpacingMark)
                    {
                        sb.Append(inputInFormD[idx]);
                    }
                }

                return (sb.ToString().Normalize(NormalizationForm.FormC));
            }

    This helper function is used on the keyword being searched to remove all accents from it, and to be able to give the same results to both words.

    Hope that everyone has understood the issue, and that this helps :)
    Cheers!

  • Berto 105 posts 177 karma points
    May 27, 2011 @ 20:02
    Berto
    0

    Hi Nuno! A Portuguese in these forum!!!!! Ta a ver que não encontrava nenhum ;)

    I think i didn't had that problem (i have to check it...), but either way, here it is my remove diacritics (it's an extension method)

     

    public static string RemoveAccent(this string txt)
    {
    byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt);
    return System.Text.Encoding.ASCII.GetString(bytes);
    }

    I'm going to check my search function to see if i had to use it...

    Aquele Abraço!

    Berto

  • Bendik Engebretsen 105 posts 202 karma points
    Feb 02, 2017 @ 12:42
    Bendik Engebretsen
    0

    Just wanted to say: Thank you guys! This solved my case for a site where I had to implement searching for Greek places;-) I had to use both Berto's analyzer mod and Nuno's RemoveDiacritics. Saved my day!

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Feb 02, 2017 @ 12:46
    Ismail Mayat
    0

    Bendik,

    You using greek analyser? Also are you doing wildcard searches? Reason i ask is if you are using analyser it should get ascii folded. When you query if you do not do wilcard then it will also ascii fold and search should work.

    I found if i was doing wildcard for say germany then any word with umlaut was not working. This is becuase in index its ascii folded. However when querying it was not ascii folded.

    Regards

    Ismail

  • Harsheet 71 posts 302 karma points
    Mar 17, 2017 @ 04:59
    Harsheet
    0

    Hi,

    I am getting this error

    Unable to cast object of type 'MassiveLuceneAnalyser.CiaiAnalyser' to type 'Lucene.Net.Analysis.Analyzer'.

    Thanks

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Mar 17, 2017 @ 08:37
    Ismail Mayat
    0

    Harsheet,

    What is MassiveLuceneAnalyser.CiaiAnalyser something custom?

    Regards

    Ismail

  • Harsheet 71 posts 302 karma points
    Mar 19, 2017 @ 22:11
    Harsheet
    0

    Hi,

    Its the class library I created.

    namespace MassiveLuceneAnalyser
    {
        public class CIAIAnalyser : Analyzer
         {}
    }
    
  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Mar 20, 2017 @ 10:37
    Ismail Mayat
    0

    Harsheet,

    Can you paste your examinesettings.config file. Looks like you may have something incorrect there?

    Also at what point do you get the error when the site loads?

    There error states

    Unable to cast object of type 'MassiveLuceneAnalyser.CiaiAnalyser' to type 'Lucene.Net.Analysis.Analyzer'.

    But you have

    CIAIAnalyser

    May be case issue?

    Regards

    Ismail

  • Harsheet 71 posts 302 karma points
    Mar 21, 2017 @ 01:52
    Harsheet
    0

    Hey, its a typo actually. Its CiaiAnalyser everywhere in my code. But still I am getting an error

  • Harsheet 71 posts 302 karma points
    Mar 21, 2017 @ 03:38
    Harsheet
    0

    Another problem is that I am not able to do this.

    tokenizer.SetMaxTokenLength(255);

    One more error is coming. See the screenshot attached.

    Thanksenter image description here

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Mar 21, 2017 @ 08:45
    Ismail Mayat
    0

    Harsheet,

    That error can be misleading. You have another issue somewhere with your index and settings config files. I would double check those. Also try commenting out the CiaiAnalyser one does that cause the site to load. If so then you have some issue with that part of the config.

    Regards

    Ismail

  • Marco Teodoro 72 posts 147 karma points c-trib
    Apr 28, 2017 @ 17:53
    Marco Teodoro
    0

    Hi Berto, i know this is a very old post, yet i'm trying to implement the solution that you and Nuno show and i've the following exception.

    Provider must implement the class 'Examine.Providers.BaseSearchProvider'.

    enter image description here

    my custom provider

    namespace DoublePT.UmbracoExamineSearch
    

    { public class CIAIAnalyser : StandardAnalyzer { public CIAIAnalyser() : base(Lucene.Net.Util.Version.LUCENE24, StopAnalyzer.ENGLISHSTOPWORDSSET) { }

        public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
        {
            StandardTokenizer tokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
    
            tokenizer.SetMaxTokenLength(255);
            TokenStream stream = new StandardFilter(tokenizer);
            stream = new LowerCaseFilter(stream);
            return new ASCIIFoldingFilter(stream);
        }
    }
    

    }

    and finally examine settings.

    enter image description here

  • Victor 25 posts 146 karma points
    Jul 05, 2017 @ 19:11
    Victor
    0

    I'm having problem with this search terms as well, does it work with .TypedSearch("string") or do I need to use SearchCriteria?

    EDIT: Made a post about it: https://our.umbraco.org/forum/extending-umbraco-and-using-the-api/86765-umbracotypedsearch-using-searchterms-with-accents-other-languages#comment-274995

  • Güray 1 post 71 karma points
    Mar 29, 2018 @ 06:54
    Güray
    0

    Hello all,

    Does umbraco use lucene on backoffice search. The same problem exists on the backoffice content search I' ve tried the CIAIAnalyser solution, however it have made no difference. enter image description here The problem occurs on the Turkish I character. When I searched content it sends an ajax request like this:

    GET /umbraco/backoffice/UmbracoApi/Content/GetChildren?id=1164&pageNumber=1&pageSize=10&orderBy=sortOrder&orderDirection=Ascending&orderBySystemField=true&filter=seç

    So I downloaded the source code and examine the controller. Probably it does not use examine, it brings results from db.

    Any other solution or content searcher plugin you can offer would be great.

    Edit: Stackoverflow Link of my question

Please Sign in or register to post replies

Write your reply to:

Draft