I dont think there should be an issue with danish chars. You do have to use special analysers when indexing chinese etc. Is anyone else out there running danish site and using lucene to search?
I also found this post on google but that refers to the java version of lucene. It may be that you need to do something similar but for lucene.net? Might be worth a browse on lucene.net home page.
I did some more digging around and no joy, however I think you do need a danish analyser. I have restored on my blog a copy of the old dotlucene site and that had in downloads a spanish and italian analyiser. You can download it here using that as a base you could knock together a danish one. When it comes to indexing searching you will need to determine if item is danish (you should be able to do this using umbraco localisation stuff possibly getting hostname or something) then instead of using standard analyser use the danish one.
Hope this helps. Also the lucene in action book gives a bit more information about creating analysers.
Thank you for all the resources. I have tried to look into it, and you are right about it being complicated. I don't think I would be up to building a danish stemmer based on the algorithm.
However, my content is both Danish and English (language specific tabs in umbraco). I am all right with all the stopwords and stemming not working for either of the two languages, I just want words with the Danish special characters to be searchable.
Any idea how?
I have tried to look for code om sourceforge and google code search, but cant find anything to get me on the right track.
Looking at this google post it mentions the .jj lexer file. I did some digging around in the lucene source code and found where the standard analysis stuff goes on there is a jj file in there and it looks as though the google post changes can be done in that file then recompiled.
This would be long shot but possibly worth a try. In the document there is comment:
[quote]
This should be a good tokenizer for most European-language documents
[/quote]
so not sure why danish is proving to be an issue.
You could also try emailing george aroush via http://www.aroush.net/ who is main contributor to lucene.net.
briliiant. as per the comment most european languages are covered by the standard analyzer and analyzers for chines/korean etc are readily available, but nice to see the fix a simple one.
Lucene search and special characters
I have implemented umbSearch and it seems I have a problem indexing / searching for special characters such as æ, ø and å.
Any ideas how I can trick Lucene to index and search for these characters?
Kind regards
Dennis
Dennis,
Which language is your website in?
Regards
Ismail
A combination of Danish and English.
milandt,
I dont think there should be an issue with danish chars. You do have to use special analysers when indexing chinese etc. Is anyone else out there running danish site and using lucene to search?
I also found this post on google but that refers to the java version of lucene. It may be that you need to do something similar but for lucene.net? Might be worth a browse on lucene.net home page.
Regards
Ismail
milandt,
I did some more digging around and no joy, however I think you do need a danish analyser. I have restored on my blog a copy of the old dotlucene site and that had in downloads a spanish and italian analyiser. You can download it here using that as a base you could knock together a danish one. When it comes to indexing searching you will need to determine if item is danish (you should be able to do this using umbraco localisation stuff possibly getting hostname or something) then instead of using standard analyser use the danish one.
Hope this helps. Also the lucene in action book gives a bit more information about creating analysers.
Regards
Ismail
Ps this will help to build the danish stemmer
this lot is advanced stuff so might be worth having a dig on sourceforge see if someone has already done a danish analyser.
Regards
Ismail
Hi Ismail,
Thank you for all the resources. I have tried to look into it, and you are right about it being complicated. I don't think I would be up to building a danish stemmer based on the algorithm.
However, my content is both Danish and English (language specific tabs in umbraco). I am all right with all the stopwords and stemming not working for either of the two languages, I just want words with the Danish special characters to be searchable.
Any idea how?
I have tried to look for code om sourceforge and google code search, but cant find anything to get me on the right track.
Dnenis
Milandt,
Looking at this google post it mentions the .jj lexer file. I did some digging around in the lucene source code and found where the standard analysis stuff goes on there is a jj file in there and it looks as though the google post changes can be done in that file then recompiled.
This would be long shot but possibly worth a try. In the document there is comment:
[quote]
This should be a good tokenizer for most European-language documents
[/quote]
so not sure why danish is proving to be an issue.
You could also try emailing george aroush via http://www.aroush.net/ who is main contributor to lucene.net.
Regards
Ismail
I think I have found the reason why Lucene refused to search for words with æ, ø and å in them..
When the RTE in Umbraco saves content special chars is escaped. æ is saved as æ
I found this by debugging the AddDoc method in Indexer.cs file in umbSearch.
I added text = HttpContext.Current.Server.HtmlDecode(text); to the method, which decodes the html so Lucene can match the search query correctly.
So far this seems to have solved it for me.
Thank you for all your feedback. If anything, it gave me some insight in stemming and tokenizing.
"One world wide language" has now been added to my wish-list amongst one world wide currency and one world wide time format :)
Dennis
dennis,
briliiant. as per the comment most european languages are covered by the standard analyzer and analyzers for chines/korean etc are readily available, but nice to see the fix a simple one.
Ismail
Dennis,
I am having a similary issue and I think default raw encoding in tinymce may also be something to do with this.
Regards
Ismail
is working on a reply...
This forum is in read-only mode while we transition to the new forum.
You can continue this topic on the new forum by tapping the "Continue discussion" link below.