lucene search and special characters

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Dennis Milandt 190 posts 517 karma points

Mar 19, 2009 @ 13:53

0

Lucene search and special characters

Extending Umbraco

I have implemented umbSearch and it seems I have a problem indexing / searching for special characters such as æ, ø and å.

Any ideas how I can trick Lucene to index and search for these characters?

Kind regards
Dennis

Copy Link
Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib

Mar 19, 2009 @ 14:28

0

Dennis,

Which language is your website in?

Regards

Ismail

Copy Link
Dennis Milandt 190 posts 517 karma points

Mar 19, 2009 @ 14:35

0

A combination of Danish and English.

Copy Link
Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib

Mar 19, 2009 @ 15:20

0

milandt,

I dont think there should be an issue with danish chars. You do have to use special analysers when indexing chinese etc. Is anyone else out there running danish site and using lucene to search?

I also found this post on google but that refers to the java version of lucene. It may be that you need to do something similar but for lucene.net? Might be worth a browse on lucene.net home page.

Regards

Ismail

Copy Link
Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib

Mar 19, 2009 @ 15:42

0

milandt,

I did some more digging around and no joy, however I think you do need a danish analyser. I have restored on my blog a copy of the old dotlucene site and that had in downloads a spanish and italian analyiser. You can download it here using that as a base you could knock together a danish one. When it comes to indexing searching you will need to determine if item is danish (you should be able to do this using umbraco localisation stuff possibly getting hostname or something) then instead of using standard analyser use the danish one.

Hope this helps. Also the lucene in action book gives a bit more information about creating analysers.

Regards

Ismail

Copy Link
Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib

Mar 19, 2009 @ 15:46

0

Ps this will help to build the danish stemmer

this lot is advanced stuff so might be worth having a dig on sourceforge see if someone has already done a danish analyser.

Regards

Ismail

Copy Link
Dennis Milandt 190 posts 517 karma points

Mar 19, 2009 @ 18:02

0

Hi Ismail,

Thank you for all the resources. I have tried to look into it, and you are right about it being complicated. I don't think I would be up to building a danish stemmer based on the algorithm.

However, my content is both Danish and English (language specific tabs in umbraco). I am all right with all the stopwords and stemming not working for either of the two languages, I just want words with the Danish special characters to be searchable.

Any idea how?

I have tried to look for code om sourceforge and google code search, but cant find anything to get me on the right track.

Dnenis

Copy Link
Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib

Mar 19, 2009 @ 18:13

0

Milandt,

Looking at this google post it mentions the .jj lexer file. I did some digging around in the lucene source code and found where the standard analysis stuff goes on there is a jj file in there and it looks as though the google post changes can be done in that file then recompiled.

This would be long shot but possibly worth a try. In the document there is comment:

[quote]
This should be a good tokenizer for most European-language documents
[/quote]

so not sure why danish is proving to be an issue.

You could also try emailing george aroush via http://www.aroush.net/ who is main contributor to lucene.net.

Regards

Ismail

Copy Link
Dennis Milandt 190 posts 517 karma points

Mar 19, 2009 @ 19:14

1

I think I have found the reason why Lucene refused to search for words with æ, ø and å in them..

When the RTE in Umbraco saves content special chars is escaped. æ is saved as æ

I found this by debugging the AddDoc method in Indexer.cs file in umbSearch.

I added text = HttpContext.Current.Server.HtmlDecode(text); to the method, which decodes the html so Lucene can match the search query correctly.

So far this seems to have solved it for me.

Thank you for all your feedback. If anything, it gave me some insight in stemming and tokenizing.

"One world wide language" has now been added to my wish-list amongst one world wide currency and one world wide time format :)

Dennis

Copy Link
Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib

Mar 19, 2009 @ 23:09

0

dennis,

briliiant. as per the comment most european languages are covered by the standard analyzer and analyzers for chines/korean etc are readily available, but nice to see the fix a simple one.

Ismail

Copy Link
Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib

Jun 24, 2009 @ 17:13

0

Dennis,

I am having a similary issue and I think default raw encoding in tinymce may also be something to do with this.

Regards

Ismail

Copy Link
is working on a reply...

Please Sign in or register to post replies

Flag this post as spam?