examine and accents for portuguese language

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Berto 105 posts 177 karma points

Jan 18, 2011 @ 13:27

1

Examine and accents (for portuguese language)

Hi!

So, i'm using examine for umbraco for my website search. So far so good, all results are showing ok.

The problem is when the search terms have accents, for example: logística (its logistics if you're wondering).

If i search with the í, no problem, but if i try with i (logistica) it doesn't find anything. How can make the search Accent Insensitive? I've seaching for hours in here and google, but i didn't find a solution (maybe it's beacause i've got only 3 hours of bed time).

Thx

Copy Link

Berto 105 posts 177 karma points

Jan 18, 2011 @ 20:19

Berto

So, after mutch digging, reflector came to the rescue (and a good friend that had all the ideias).

the lucene documentation is inexistence or very, very hard to find, if you know where it is, please post it

I just leave my solution for the problem, that turned out to be very easy, if i'm doing it wrong or you know a better way, please share it.

I create a Class Library Project, Added the reference to Lucene.net and created the following class (copy-paste of my cs file):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
 
namespace MassiveLuceneAnalyser
{
    public class CIAIAnalyser : Analyzer
    {
        public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
        {
            StandardTokenizer tokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
 
            tokenizer.SetMaxTokenLength(255);
            TokenStream stream = new StandardFilter(tokenizer);
            stream = new LowerCaseFilter(stream);
            return new ASCIIFoldingFilter(stream);
 
        }
    }
}

It's mostly a copy of the StandartFilter in Lucene.net assembly, but using a difrente filter. CIAI is for Case Insensitive, Accent Insensitive

Happy coding!

Berto

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 19, 2011 @ 11:17

0

Berto,

If creating this library and copying over to the bin did you have todo anything else? Or does lucene indexer know what to do by loading up the analyzer becuase it inherits from Analyzer?

Many thanks

Ismail

Copy Link

Berto 105 posts 177 karma points

Jan 19, 2011 @ 11:37

Berto

Hi Ismail,

I forgot to post above the change in the configuration file (config/ExamineSettings.config).

After you place the assembly in bin, you have to change the analyzer in the Examine settings (I'm only posting my providers, don't delete the the umbraco providers):

Before

<Examine>
  <ExamineIndexProviders>
    <providers>
     <!-- ... snipp of the providers default of umbraco... -->
 
        <add name="MySiteIndexer" type="UmbracoExamine.LuceneExamineIndexer, UmbracoExamine"
     runAsync="true"
     supportUnpublished="false"
     supportProtected="true"
     interval="10"
     analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/>
 
    </providers>
 
  </ExamineIndexProviders>
  <ExamineSearchProviders defaultProvider="InternalSearcher">
    <providers>
    <!-- ... snipp of the providers default of umbraco... -->
        <add name="MySiteSearcher" type="UmbracoExamine.LuceneExamineSearcher, UmbracoExamine"
           analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcards="true"/>
    </providers>
  </ExamineSearchProviders>
</Examine>

After

<Examine>
  <ExamineIndexProviders>
    <providers>
     <!-- ... snipp of the providers default of umbraco... -->
 
        <add name="MySiteIndexer" type="UmbracoExamine.LuceneExamineIndexer, UmbracoExamine"
     runAsync="true"
     supportUnpublished="false"
     supportProtected="true"
     interval="10"
     analyzer="MassiveLuceneAnalyser.CIAIAnalyser, MassiveLuceneAnalyser"/>
 
    </providers>
 
 
  </ExamineIndexProviders>
  <ExamineSearchProviders defaultProvider="InternalSearcher">
    <providers>
      <!-- ... snipp of the providers default of umbraco... -->
 
        <add name="MySiteSearcher" type="UmbracoExamine.LuceneExamineSearcher, UmbracoExamine"
           analyzer="MassiveLuceneAnalyser.CIAIAnalyser, MassiveLuceneAnalyser" enableLeadingWildcards="true"/>
    </providers>
  </ExamineSearchProviders>
</Examine>

The only change is the analyzer keyword, where you change it to your assmebly

analyzer="[Namespace].[Class], [AssemblyWithoutDotDll]"

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 19, 2011 @ 11:50

0

Berto,

Brilliant post no doubt will prove to be very useful for people doing non english searching.

Regards

Ismail

Copy Link
Aaron Powell 1708 posts 3046 karma points c-trib

Jan 19, 2011 @ 12:10

0

Refer to the Lucene Java docs, it's just as useful because there are identical APIs.

Also, the StandardAnalyzer (and other default Lucene analyzers) are all for the English language, so yes you do need to write your own.

On an interesting note I believe Manning has a sale on today for the Lucene in Action Second Edition book (which I have a copy of and it's awesome)

Copy Link
Berto 105 posts 177 karma points

Jan 19, 2011 @ 12:43

0

Hi Slace,

Can you post the links to the Lucene Java docs? Yesterday I went there, but i didn't find anything useful for a noob like me. I think i'm loosing my google it skills...

Nice blog by the way ;) It help me to understand what i had to do for this analyzer

Copy Link
Aaron Powell 1708 posts 3046 karma points c-trib

Jan 19, 2011 @ 12:44

0

Here's the 2.9.2 docs - http://lucene.apache.org/java/2_9_2/api/all/index.html

That's the latest version ported to Lucene.Net

Copy Link

Nuno Lourenço 4 posts 30 karma points

May 27, 2011 @ 19:38

Hi Berto.
Excellent post about Examine/Lucene.

I've tried your solution and it worked great :)
In my case it had a minor issue. Although i wanted to do exactly has you've done, I had to add extra functionality to be able to search for the same word with and without the accent.
For instance, logística and logistica did not returned the same results, due to the indexer not containing the accent.
For being able to give the same results for both words, in my case I've added:

        public string RemoveDiacritics(string input)
        {
            // Indicates that a Unicode string is normalized using full canonical decomposition.
            string inputInFormD = input.Normalize(NormalizationForm.FormD);
            var sb = new StringBuilder();

            for (int idx = 0; idx < inputInFormD.Length; idx++)
            {
                UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(inputInFormD[idx]);
                if (uc != UnicodeCategory.NonSpacingMark)
                {
                    sb.Append(inputInFormD[idx]);
                }
            }

            return (sb.ToString().Normalize(NormalizationForm.FormC));
        }

This helper function is used on the keyword being searched to remove all accents from it, and to be able to give the same results to both words.

Hope that everyone has understood the issue, and that this helps :)
Cheers!

Berto 105 posts 177 karma points

May 27, 2011 @ 20:02
0
Hi Nuno! A Portuguese in these forum!!!!! Ta a ver que não encontrava nenhum ;)

I think i didn't had that problem (i have to check it...), but either way, here it is my remove diacritics (it's an extension method)
```
public static string RemoveAccent(this string txt)
{
    byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt);
 return System.Text.Encoding.ASCII.GetString(bytes);
}
```
I'm going to check my search function to see if i had to use it...

Aquele Abraço!

Berto
Copy Link
Bendik Engebretsen 105 posts 202 karma points

Feb 02, 2017 @ 12:42

0

Just wanted to say: Thank you guys! This solved my case for a site where I had to implement searching for Greek places;-) I had to use both Berto's analyzer mod and Nuno's RemoveDiacritics. Saved my day!

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Feb 02, 2017 @ 12:46

0

Bendik,

You using greek analyser? Also are you doing wildcard searches? Reason i ask is if you are using analyser it should get ascii folded. When you query if you do not do wilcard then it will also ascii fold and search should work.

I found if i was doing wildcard for say germany then any word with umlaut was not working. This is becuase in index its ascii folded. However when querying it was not ascii folded.

Regards

Ismail

Copy Link
Harsheet 71 posts 302 karma points

Mar 17, 2017 @ 04:59

0

Hi,

I am getting this error

Unable to cast object of type 'MassiveLuceneAnalyser.CiaiAnalyser' to type 'Lucene.Net.Analysis.Analyzer'.

Thanks

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Mar 17, 2017 @ 08:37

0

Harsheet,

What is MassiveLuceneAnalyser.CiaiAnalyser something custom?

Regards

Ismail

Copy Link
Harsheet 71 posts 302 karma points

Mar 19, 2017 @ 22:11
0
Hi,

Its the class library I created.
```
namespace MassiveLuceneAnalyser
{
    public class CIAIAnalyser : Analyzer
     {}
}
```
Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Mar 20, 2017 @ 10:37

0

Harsheet,

Can you paste your examinesettings.config file. Looks like you may have something incorrect there?

Also at what point do you get the error when the site loads?

There error states

Unable to cast object of type 'MassiveLuceneAnalyser.CiaiAnalyser' to type 'Lucene.Net.Analysis.Analyzer'.

But you have

CIAIAnalyser

May be case issue?

Regards

Ismail

Copy Link
Harsheet 71 posts 302 karma points

Mar 21, 2017 @ 01:52

0

Hey, its a typo actually. Its CiaiAnalyser everywhere in my code. But still I am getting an error

Copy Link
Harsheet 71 posts 302 karma points

Mar 21, 2017 @ 03:38

0

Another problem is that I am not able to do this.

tokenizer.SetMaxTokenLength(255);

One more error is coming. See the screenshot attached.

Thanks

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Mar 21, 2017 @ 08:45

0

Harsheet,

That error can be misleading. You have another issue somewhere with your index and settings config files. I would double check those. Also try commenting out the CiaiAnalyser one does that cause the site to load. If so then you have some issue with that part of the config.

Regards

Ismail

Copy Link
Marco Teodoro 74 posts 149 karma points c-trib

Apr 28, 2017 @ 17:53
0
Hi Berto, i know this is a very old post, yet i'm trying to implement the solution that you and Nuno show and i've the following exception.

Provider must implement the class 'Examine.Providers.BaseSearchProvider'.

my custom provider
```
namespace DoublePT.UmbracoExamineSearch
```
{ public class CIAIAnalyser : StandardAnalyzer { public CIAIAnalyser() : base(Lucene.Net.Util.Version.LUCENE24, StopAnalyzer.ENGLISHSTOPWORDSSET) { }
```
    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
        StandardTokenizer tokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);

        tokenizer.SetMaxTokenLength(255);
        TokenStream stream = new StandardFilter(tokenizer);
        stream = new LowerCaseFilter(stream);
        return new ASCIIFoldingFilter(stream);
    }
}
```
}

and finally examine settings.
Copy Link
Victor 25 posts 146 karma points

Jul 05, 2017 @ 19:11

0

I'm having problem with this search terms as well, does it work with .TypedSearch("string") or do I need to use SearchCriteria?

EDIT: Made a post about it: https://our.umbraco.org/forum/extending-umbraco-and-using-the-api/86765-umbracotypedsearch-using-searchterms-with-accents-other-languages#comment-274995

Copy Link
Güray 1 post 71 karma points

Mar 29, 2018 @ 06:54

0

Hello all,

Does umbraco use lucene on backoffice search. The same problem exists on the backoffice content search I' ve tried the CIAIAnalyser solution, however it have made no difference. The problem occurs on the Turkish I character. When I searched content it sends an ajax request like this:

GET /umbraco/backoffice/UmbracoApi/Content/GetChildren?id=1164&pageNumber=1&pageSize=10&orderBy=sortOrder&orderDirection=Ascending&orderBySystemField=true&filter=seç

So I downloaded the source code and examine the controller. Probably it does not use examine, it brings results from db.

Any other solution or content searcher plugin you can offer would be great.

Edit: Stackoverflow Link of my question

Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies