Im trying to build a search function for a site that is in swedish. I am able to search, but when i search with swedish characters ( Å Ä Ö ) it does not work.
For example if i search for Göteborg i get 0 hits, but if i instead use the term Goteborg it works.
Anybody got a solution? Do i need to configure the index for multilanguage?
Are you doing a wildcard search? So during indexing it will run through standard analyser (thats if you have not changed it to another analyser) and it will ascii flatten characters so ( Å Ä Ö ) will go in as (a a o) also during searching it will do same thing so it should all work.
If you are doing wildcard searching then if i remember rightly it wont ascii flatten the query so it searches literally on those characters but in examine / lucene it has the flattened characters.
I recall covering this or having this in the notes on examine course so the code I have is:
public class AsciiFoldingFilter
{
private readonly Analyzer _analyzer;
// We are analyzing the query before adding the wildcards
// This way the words containg diactrics (characters specific to a language)
// will be folded to ASCII character set.
// e.g. word "weiß Glückwunsch" will be flattened into "weiss gluckwunsch"
//
// When the wildcards are added before analyzing, then the text will not be analyzed
// https://issues.apache.org/jira/browse/LUCENENET-486
// http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
public AsciiFoldingFilter(BaseSearchProvider baseSearchProvider)
{
var luceneSearch = (BaseLuceneSearcher)baseSearchProvider;
_analyzer = luceneSearch.IndexingAnalyzer;
}
public AsciiFoldingFilter(Analyzer analyzer)
{
_analyzer = analyzer;
}
public string FlattenToAscii(string stringToFold)
{
var parser = new QueryParser(
Lucene.Net.Util.Version.LUCENE_29,
string.Empty,
_analyzer);
var query = parser.Parse(stringToFold.Trim());
return query.ToString();
}
}
On the query side before you wildcard it run the query through this AsciiFoldingFilter then wildcard and it should work.
Thanks for the reply. After some investigation on my own i found out that i was indeed the wildcard search that did flatten the swedish characters. I did try it with a fuzzy search and it worked aswell.
Thanks for the code. I will try that solution out.
I'm facing the same issue here as well, I've set up my indexes to use the StandardAnalyser and I need to do Wildcard-searches. I've tried to pass my search-word to your AsciiFoldingFilter-example above to try to parse it but it still returns the Swedish characters, like åäö.
Ie. I'm trying to search for the word "små" which gives 0 hits, but searching for "sma" works fine.
Looking at the index with Luke shows that the "Term Vector" contains the word "sma" so in some way it was flattened in the right way during indexing.
So when using the QueryParser-implementation above, StandardAnalyzer it term is not parsed correctly.... what I'm I missing here... should'nt the QueryParser apply the same "processing" that the Indexer does?
My indexes inherit's from UmbracoContentIndex and it seems like some stuff that are index is using the "CultureInvariantStandardAnalyzer" which is the analyser that Umbraco uses for it's indexes.
I did not spend a lot of time to figure out exactlly what's going on but changing my indexes to use the CultureInvariantStandardAnalyzer solved the issues.
I'm facing the same problem and trying to figure out where to find the index settings to test to change to CultureInvariantStandardAnalyzer as well. Can I configure it somewhere or do I need to create a custom index?
What I find strange is that if I try and search for åäö in backoffice (settings > Examine Management > Extarnal Index) it will find the posts.
Think my search is pretty standard.
if (!String.IsNullOrEmpty(searchTerm) && ExamineManager.Instance.TryGetIndex("ExternalIndex", out var index))
{
var searcher = index.GetSearcher();
var criteria = searcher.CreateQuery("content", BooleanOperation.And)
.GroupedOr(new List<string> { "combinedField" }, searchTerm.ToLower().MultipleCharacterWildcard())
.And()
.Field("searchablePath", Model.HomeNode.Id.ToString())
.Not()
.Field("umbracoNaviHide", "1");
var searchList = criteria.Execute();
var result = searchList.ToPublishedSearchResults(UmbracoContext.PublishedSnapshot.Content);
}
I have the same problem, but using the AsciiFoldingFilter does not seem to work.
The value of the search string is the same before and after getting sent through the filter.
var s = Request.QueryString["s"];
// s value = æøå
if (!_examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out IIndex index))
{
throw new InvalidOperationException($"No index found by name {Constants.UmbracoIndexes.ExternalIndexName}");
}
var searcher = (BaseLuceneSearcher)index.GetSearcher();
var asciiFilter = new AsciiFoldingFilter(searcher);
s = asciiFilter.FlattenToAscii(s);
// s value still = æøå, not aeoeaa as expected
Examine with swedish characters
Hi,
Im trying to build a search function for a site that is in swedish. I am able to search, but when i search with swedish characters ( Å Ä Ö ) it does not work.
For example if i search for Göteborg i get 0 hits, but if i instead use the term Goteborg it works.
Anybody got a solution? Do i need to configure the index for multilanguage?
Thanks in advance!
Hello Daniel! I'm having the same issue for swedish characters? Were you able to solve this?
Are you doing a wildcard search? So during indexing it will run through standard analyser (thats if you have not changed it to another analyser) and it will ascii flatten characters so ( Å Ä Ö ) will go in as (a a o) also during searching it will do same thing so it should all work.
If you are doing wildcard searching then if i remember rightly it wont ascii flatten the query so it searches literally on those characters but in examine / lucene it has the flattened characters.
I recall covering this or having this in the notes on examine course so the code I have is:
On the query side before you wildcard it run the query through this AsciiFoldingFilter then wildcard and it should work.
Regards
Isamil
Hi,
Thanks for the reply. After some investigation on my own i found out that i was indeed the wildcard search that did flatten the swedish characters. I did try it with a fuzzy search and it worked aswell.
Thanks for the code. I will try that solution out.
/Daniel
Hi!
Thanks for sharing a potential solution Ismail!
I'm facing the same issue here as well, I've set up my indexes to use the StandardAnalyser and I need to do Wildcard-searches. I've tried to pass my search-word to your AsciiFoldingFilter-example above to try to parse it but it still returns the Swedish characters, like åäö.
Ie. I'm trying to search for the word "små" which gives 0 hits, but searching for "sma" works fine.
Looking at the index with Luke shows that the "Term Vector" contains the word "sma" so in some way it was flattened in the right way during indexing.
So when using the QueryParser-implementation above, StandardAnalyzer it term is not parsed correctly.... what I'm I missing here... should'nt the QueryParser apply the same "processing" that the Indexer does?
My indexes inherit's from UmbracoContentIndex and it seems like some stuff that are index is using the "CultureInvariantStandardAnalyzer" which is the analyser that Umbraco uses for it's indexes.
So it might be that my indexing-process is using that Analyzer while my search is using the configured StandardAnalyser, ie. this Analyser is hardcoded here in the FullTextType, https://github.com/Shazwazza/Examine/blob/515620ac8da1abd60404890cc0359cd53cda6079/src/Examine/LuceneEngine/Indexing/FullTextType.cs
I did not spend a lot of time to figure out exactlly what's going on but changing my indexes to use the CultureInvariantStandardAnalyzer solved the issues.
Hi
I'm facing the same problem and trying to figure out where to find the index settings to test to change to CultureInvariantStandardAnalyzer as well. Can I configure it somewhere or do I need to create a custom index?
What I find strange is that if I try and search for åäö in backoffice (settings > Examine Management > Extarnal Index) it will find the posts.
Think my search is pretty standard.
I have the same problem, but using the AsciiFoldingFilter does not seem to work.
The value of the search string is the same before and after getting sent through the filter.
Hi
What I ended up doing was a extended method. This is for umbraco 8 (where swedish åäö is replaced with a a o and not aa ae oe
So on my search I just call
Ok, unfortunately danish need ae, oe, aa for æ, ø å, so that solution won't work either.
This worked for me in danish. ø -> o, which was fine for my index. Running Umbraco 8.6.1. Thanks Jan!
But for that if you search for "Løn" would answers and "Lone" come back if wildcard are on ?
Did you found a solution ?
Did you find any solutions ? Have problems with ÆØÅ search
is working on a reply...