Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Daniel Gustafsson 13 posts 93 karma points
    Aug 16, 2019 @ 09:39
    Daniel Gustafsson
    0

    Examine with swedish characters

    Hi,

    Im trying to build a search function for a site that is in swedish. I am able to search, but when i search with swedish characters ( Å Ä Ö ) it does not work.

    For example if i search for Göteborg i get 0 hits, but if i instead use the term Goteborg it works.

    Anybody got a solution? Do i need to configure the index for multilanguage?

    Thanks in advance!

  • ErikC 2 posts 72 karma points
    Sep 02, 2019 @ 15:08
    ErikC
    0

    Hello Daniel! I'm having the same issue for swedish characters? Were you able to solve this?

  • Ismail Mayat 4511 posts 10056 karma points MVP 2x admin c-trib
    Sep 02, 2019 @ 16:53
    Ismail Mayat
    0

    Are you doing a wildcard search? So during indexing it will run through standard analyser (thats if you have not changed it to another analyser) and it will ascii flatten characters so ( Å Ä Ö ) will go in as (a a o) also during searching it will do same thing so it should all work.

    If you are doing wildcard searching then if i remember rightly it wont ascii flatten the query so it searches literally on those characters but in examine / lucene it has the flattened characters.

    I recall covering this or having this in the notes on examine course so the code I have is:

    public class AsciiFoldingFilter
    {
        private readonly Analyzer _analyzer;
        // We are analyzing the query before adding the wildcards 
        // This way the words containg diactrics (characters specific to a language)
        // will be folded to ASCII character set.
        // e.g. word "weiß Glückwunsch" will be flattened into "weiss gluckwunsch"
        //
        // When the wildcards are added before analyzing, then  the text will not be analyzed
        // https://issues.apache.org/jira/browse/LUCENENET-486 
        // http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
        public AsciiFoldingFilter(BaseSearchProvider baseSearchProvider)
        {
            var luceneSearch = (BaseLuceneSearcher)baseSearchProvider;
            _analyzer = luceneSearch.IndexingAnalyzer;
        }
        public AsciiFoldingFilter(Analyzer analyzer)
        {
            _analyzer = analyzer;
        }
        public string FlattenToAscii(string stringToFold)
        {
            var parser = new QueryParser(
                Lucene.Net.Util.Version.LUCENE_29,
                string.Empty,
                _analyzer);
    
    
            var query = parser.Parse(stringToFold.Trim());
            return query.ToString();
        }
    }
    

    On the query side before you wildcard it run the query through this AsciiFoldingFilter then wildcard and it should work.

    Regards

    Isamil

  • Daniel Gustafsson 13 posts 93 karma points
    Sep 05, 2019 @ 08:31
    Daniel Gustafsson
    0

    Hi,

    Thanks for the reply. After some investigation on my own i found out that i was indeed the wildcard search that did flatten the swedish characters. I did try it with a fuzzy search and it worked aswell.

    Thanks for the code. I will try that solution out.

    /Daniel

  • Markus Johansson 1657 posts 4724 karma points c-trib
    Mar 05, 2020 @ 14:40
    Markus Johansson
    0

    Hi!

    Thanks for sharing a potential solution Ismail!

    I'm facing the same issue here as well, I've set up my indexes to use the StandardAnalyser and I need to do Wildcard-searches. I've tried to pass my search-word to your AsciiFoldingFilter-example above to try to parse it but it still returns the Swedish characters, like åäö.

    Ie. I'm trying to search for the word "små" which gives 0 hits, but searching for "sma" works fine.

    Looking at the index with Luke shows that the "Term Vector" contains the word "sma" so in some way it was flattened in the right way during indexing.

    So when using the QueryParser-implementation above, StandardAnalyzer it term is not parsed correctly.... what I'm I missing here... should'nt the QueryParser apply the same "processing" that the Indexer does?

  • Markus Johansson 1657 posts 4724 karma points c-trib
    Mar 05, 2020 @ 15:18
    Markus Johansson
    1

    My indexes inherit's from UmbracoContentIndex and it seems like some stuff that are index is using the "CultureInvariantStandardAnalyzer" which is the analyser that Umbraco uses for it's indexes.

    So it might be that my indexing-process is using that Analyzer while my search is using the configured StandardAnalyser, ie. this Analyser is hardcoded here in the FullTextType, https://github.com/Shazwazza/Examine/blob/515620ac8da1abd60404890cc0359cd53cda6079/src/Examine/LuceneEngine/Indexing/FullTextType.cs

    I did not spend a lot of time to figure out exactlly what's going on but changing my indexes to use the CultureInvariantStandardAnalyzer solved the issues.

  • Jan A 39 posts 243 karma points
    Apr 07, 2020 @ 14:11
    Jan A
    0

    Hi

    I'm facing the same problem and trying to figure out where to find the index settings to test to change to CultureInvariantStandardAnalyzer as well. Can I configure it somewhere or do I need to create a custom index?

    What I find strange is that if I try and search for åäö in backoffice (settings > Examine Management > Extarnal Index) it will find the posts.

    Think my search is pretty standard.

    if (!String.IsNullOrEmpty(searchTerm) && ExamineManager.Instance.TryGetIndex("ExternalIndex", out var index))
            {
                var searcher = index.GetSearcher();
    
                var criteria = searcher.CreateQuery("content", BooleanOperation.And)
                .GroupedOr(new List<string> { "combinedField" }, searchTerm.ToLower().MultipleCharacterWildcard())
                .And()
                .Field("searchablePath", Model.HomeNode.Id.ToString())
                .Not()
                .Field("umbracoNaviHide", "1");
    
    
                var searchList = criteria.Execute();
                var result = searchList.ToPublishedSearchResults(UmbracoContext.PublishedSnapshot.Content);
    }
    
  • Michael Nielsen 71 posts 309 karma points
    May 20, 2020 @ 07:59
    Michael Nielsen
    0

    I have the same problem, but using the AsciiFoldingFilter does not seem to work.

    The value of the search string is the same before and after getting sent through the filter.

    var s = Request.QueryString["s"];
    
    // s value = æøå
    
    if (!_examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out IIndex index))
    {
        throw new InvalidOperationException($"No index found by name {Constants.UmbracoIndexes.ExternalIndexName}");
    }
    
    var searcher = (BaseLuceneSearcher)index.GetSearcher();
    
    var asciiFilter = new AsciiFoldingFilter(searcher);
    s = asciiFilter.FlattenToAscii(s);
    
    // s value still = æøå, not aeoeaa as expected
    
  • Jan A 39 posts 243 karma points
    May 20, 2020 @ 08:19
    Jan A
    2

    Hi

    What I ended up doing was a extended method. This is for umbraco 8 (where swedish åäö is replaced with a a o and not aa ae oe

        public static string SearchFriendlyString(this string q)
        {
            byte[] tempBytes;
            tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(q);
            return System.Text.Encoding.UTF8.GetString(tempBytes);
    
        }
    

    So on my search I just call

            query = query.SearchFriendlyString();
    
  • Michael Nielsen 71 posts 309 karma points
    May 20, 2020 @ 08:40
    Michael Nielsen
    0

    Ok, unfortunately danish need ae, oe, aa for æ, ø å, so that solution won't work either.

  • Thomas Hansen 171 posts 398 karma points
    5 days ago
    Thomas Hansen
    0

    Did you find any solutions ? Have problems with ÆØÅ search

Please Sign in or register to post replies

Write your reply to:

Draft