examine with swedish characters

Daniel Gustafsson 13 posts 93 karma points

Aug 16, 2019 @ 09:39

Examine with swedish characters

Hi,

Im trying to build a search function for a site that is in swedish. I am able to search, but when i search with swedish characters ( Å Ä Ö ) it does not work.

For example if i search for Göteborg i get 0 hits, but if i instead use the term Goteborg it works.

Anybody got a solution? Do i need to configure the index for multilanguage?

Thanks in advance!

Copy Link

ErikC 2 posts 72 karma points

Sep 02, 2019 @ 15:08

Hello Daniel! I'm having the same issue for swedish characters? Were you able to solve this?

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 02, 2019 @ 16:53

Are you doing a wildcard search? So during indexing it will run through standard analyser (thats if you have not changed it to another analyser) and it will ascii flatten characters so ( Å Ä Ö ) will go in as (a a o) also during searching it will do same thing so it should all work.

If you are doing wildcard searching then if i remember rightly it wont ascii flatten the query so it searches literally on those characters but in examine / lucene it has the flattened characters.

I recall covering this or having this in the notes on examine course so the code I have is:

public class AsciiFoldingFilter
{
    private readonly Analyzer _analyzer;
    // We are analyzing the query before adding the wildcards 
    // This way the words containg diactrics (characters specific to a language)
    // will be folded to ASCII character set.
    // e.g. word "weiß Glückwunsch" will be flattened into "weiss gluckwunsch"
    //
    // When the wildcards are added before analyzing, then  the text will not be analyzed
    // https://issues.apache.org/jira/browse/LUCENENET-486 
    // http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
    public AsciiFoldingFilter(BaseSearchProvider baseSearchProvider)
    {
        var luceneSearch = (BaseLuceneSearcher)baseSearchProvider;
        _analyzer = luceneSearch.IndexingAnalyzer;
    }
    public AsciiFoldingFilter(Analyzer analyzer)
    {
        _analyzer = analyzer;
    }
    public string FlattenToAscii(string stringToFold)
    {
        var parser = new QueryParser(
            Lucene.Net.Util.Version.LUCENE_29,
            string.Empty,
            _analyzer);


        var query = parser.Parse(stringToFold.Trim());
        return query.ToString();
    }
}

On the query side before you wildcard it run the query through this AsciiFoldingFilter then wildcard and it should work.

Regards

Isamil

Copy Link

Daniel Gustafsson 13 posts 93 karma points

Sep 05, 2019 @ 08:31

Hi,

Thanks for the reply. After some investigation on my own i found out that i was indeed the wildcard search that did flatten the swedish characters. I did try it with a fuzzy search and it worked aswell.

Thanks for the code. I will try that solution out.

/Daniel

Copy Link

Markus Johansson 1945 posts 5898 karma points MVP 2x c-trib

Mar 05, 2020 @ 14:40

Hi!

Thanks for sharing a potential solution Ismail!

I'm facing the same issue here as well, I've set up my indexes to use the StandardAnalyser and I need to do Wildcard-searches. I've tried to pass my search-word to your AsciiFoldingFilter-example above to try to parse it but it still returns the Swedish characters, like åäö.

Ie. I'm trying to search for the word "små" which gives 0 hits, but searching for "sma" works fine.

Looking at the index with Luke shows that the "Term Vector" contains the word "sma" so in some way it was flattened in the right way during indexing.

So when using the QueryParser-implementation above, StandardAnalyzer it term is not parsed correctly.... what I'm I missing here... should'nt the QueryParser apply the same "processing" that the Indexer does?

Copy Link

Markus Johansson 1945 posts 5898 karma points MVP 2x c-trib

Mar 05, 2020 @ 15:18

My indexes inherit's from UmbracoContentIndex and it seems like some stuff that are index is using the "CultureInvariantStandardAnalyzer" which is the analyser that Umbraco uses for it's indexes.

So it might be that my indexing-process is using that Analyzer while my search is using the configured StandardAnalyser, ie. this Analyser is hardcoded here in the FullTextType, https://github.com/Shazwazza/Examine/blob/515620ac8da1abd60404890cc0359cd53cda6079/src/Examine/LuceneEngine/Indexing/FullTextType.cs

I did not spend a lot of time to figure out exactlly what's going on but changing my indexes to use the CultureInvariantStandardAnalyzer solved the issues.

Copy Link

Jan A 63 posts 268 karma points

Apr 07, 2020 @ 14:11

Hi

I'm facing the same problem and trying to figure out where to find the index settings to test to change to CultureInvariantStandardAnalyzer as well. Can I configure it somewhere or do I need to create a custom index?

What I find strange is that if I try and search for åäö in backoffice (settings > Examine Management > Extarnal Index) it will find the posts.

Think my search is pretty standard.

if (!String.IsNullOrEmpty(searchTerm) && ExamineManager.Instance.TryGetIndex("ExternalIndex", out var index))
        {
            var searcher = index.GetSearcher();

            var criteria = searcher.CreateQuery("content", BooleanOperation.And)
            .GroupedOr(new List<string> { "combinedField" }, searchTerm.ToLower().MultipleCharacterWildcard())
            .And()
            .Field("searchablePath", Model.HomeNode.Id.ToString())
            .Not()
            .Field("umbracoNaviHide", "1");


            var searchList = criteria.Execute();
            var result = searchList.ToPublishedSearchResults(UmbracoContext.PublishedSnapshot.Content);
}

Copy Link

Michael Nielsen 82 posts 362 karma points

May 20, 2020 @ 07:59

I have the same problem, but using the AsciiFoldingFilter does not seem to work.

The value of the search string is the same before and after getting sent through the filter.

var s = Request.QueryString["s"];

// s value = æøå

if (!_examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out IIndex index))
{
    throw new InvalidOperationException($"No index found by name {Constants.UmbracoIndexes.ExternalIndexName}");
}

var searcher = (BaseLuceneSearcher)index.GetSearcher();

var asciiFilter = new AsciiFoldingFilter(searcher);
s = asciiFilter.FlattenToAscii(s);

// s value still = æøå, not aeoeaa as expected

Copy Link

Jan A 63 posts 268 karma points

May 20, 2020 @ 08:19

Hi

What I ended up doing was a extended method. This is for umbraco 8 (where swedish åäö is replaced with a a o and not aa ae oe

    public static string SearchFriendlyString(this string q)
    {
        byte[] tempBytes;
        tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(q);
        return System.Text.Encoding.UTF8.GetString(tempBytes);

    }

So on my search I just call

        query = query.SearchFriendlyString();

Copy Link

Michael Nielsen 82 posts 362 karma points

May 20, 2020 @ 08:40

Ok, unfortunately danish need ae, oe, aa for æ, ø å, so that solution won't work either.

Copy Link

Rasmus Pedersen 4 posts 74 karma points

Oct 28, 2020 @ 08:23

This worked for me in danish. ø -> o, which was fine for my index. Running Umbraco 8.6.1. Thanks Jan!

Copy Link

Thomas 319 posts 606 karma points c-trib

Mar 11, 2021 @ 08:14

But for that if you search for "Løn" would answers and "Lone" come back if wildcard are on ?

Copy Link

Thomas 319 posts 606 karma points c-trib

Mar 11, 2021 @ 08:19

Did you found a solution ?

Copy Link

Thomas 319 posts 606 karma points c-trib

Sep 16, 2020 @ 09:27

Did you find any solutions ? Have problems with ÆØÅ search

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

Examine with swedish characters