Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Dennis Milandt 190 posts 517 karma points
    Mar 19, 2009 @ 13:53
    Dennis Milandt
    0

    Lucene search and special characters

    I have implemented umbSearch and it seems I have a problem indexing / searching for special characters such as æ, ø and å.

    Any ideas how I can trick Lucene to index and search for these characters?

    Kind regards
    Dennis

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Mar 19, 2009 @ 14:28
    Ismail Mayat
    0

    Dennis,

    Which language is your website in?

    Regards

    Ismail

  • Dennis Milandt 190 posts 517 karma points
    Mar 19, 2009 @ 14:35
    Dennis Milandt
    0

    A combination of Danish and English.

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Mar 19, 2009 @ 15:20
    Ismail Mayat
    0

    milandt,

    I dont think there should be an issue with danish chars. You do have to use special analysers when indexing chinese etc. Is anyone else out there running danish site and using lucene to search?

    I also found this post on google but that refers to the java version of lucene. It may be that you need to do something similar but for lucene.net? Might be worth a browse on lucene.net home page.

    Regards

    Ismail

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Mar 19, 2009 @ 15:42
    Ismail Mayat
    0

    milandt,

    I did some more digging around and no joy, however I think you do need a danish analyser. I have restored on my blog a copy of the old dotlucene site and that had in downloads a spanish and italian analyiser. You can download it here using that as a base you could knock together a danish one. When it comes to indexing searching you will need to determine if item is danish (you should be able to do this using umbraco localisation stuff possibly getting hostname or something) then instead of using standard analyser use the danish one.

    Hope this helps. Also the lucene in action book gives a bit more information about creating analysers.

    Regards

    Ismail

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Mar 19, 2009 @ 15:46
    Ismail Mayat
    0

    Ps this will help to build the danish stemmer

    this lot is advanced stuff so might be worth having a dig on sourceforge see if someone has already done a danish analyser.

    Regards

    Ismail

  • Dennis Milandt 190 posts 517 karma points
    Mar 19, 2009 @ 18:02
    Dennis Milandt
    0

    Hi Ismail,

    Thank you for all the resources. I have tried to look into it, and you are right about it being complicated. I don't think I would be up to building a danish stemmer based on the algorithm.

    However, my content is both Danish and English (language specific tabs in umbraco). I am all right with all the stopwords and stemming not working for either of the two languages, I just want words with the Danish special characters to be searchable.

    Any idea how?

    I have tried to look for code om sourceforge and google code search, but cant find anything to get me on the right track.

    Dnenis

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Mar 19, 2009 @ 18:13
    Ismail Mayat
    0

    Milandt,

    Looking at this google post it mentions the .jj lexer file. I did some digging around in the lucene source code and found where the standard analysis stuff goes on there is a jj file in there and it looks as though the google post changes can be done in that file then recompiled.

    This would be long shot but possibly worth a try. In the document there is comment:

    [quote]
    This should be a good tokenizer for most European-language documents
    [/quote]

    so not sure why danish is proving to be an issue.

    You could also try emailing george aroush via http://www.aroush.net/ who is main contributor to lucene.net.

    Regards

    Ismail

  • Dennis Milandt 190 posts 517 karma points
    Mar 19, 2009 @ 19:14
    Dennis Milandt
    1

    I think I have found the reason why Lucene refused to search for words with æ, ø and å in them..

    When the RTE in Umbraco saves content special chars is escaped. æ is saved as æ

    I found this by debugging the AddDoc method in Indexer.cs file in umbSearch.

    I added text = HttpContext.Current.Server.HtmlDecode(text); to the method, which decodes the html so Lucene can match the search query correctly.

    So far this seems to have solved it for me.

    Thank you for all your feedback. If anything, it gave me some insight in stemming and tokenizing.

    "One world wide language" has now been added to my wish-list amongst one world wide currency and one world wide time format :)

    Dennis

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Mar 19, 2009 @ 23:09
    Ismail Mayat
    0

    dennis,

    briliiant. as per the comment most european languages are covered by the standard analyzer and analyzers for chines/korean etc are readily available, but nice to see the fix a simple one.

    Ismail

  • Ismail Mayat 4511 posts 10090 karma points MVP 2x admin c-trib
    Jun 24, 2009 @ 17:13
    Ismail Mayat
    0

    Dennis,

    I am having a similary issue and I think default raw encoding in tinymce may also be something to do with this.

    Regards

    Ismail

Please Sign in or register to post replies

Write your reply to:

Draft