Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Sep 15, 2016 @ 09:53
    Ismail Mayat
    0

    German analyser

    Guys,

    Working on search for german using lucene.net contrib and the german analyser. When i do a query the generated query looks like:

    +(+(contents:universal*)) +__IndexType:con
    

    You notice it has cut of the last bit it should read __IndexType:content

    When using standard analyser query is generated fine. Anyone else seen this before?

    Regards

    Ismail

  • Tim 1193 posts 2675 karma points MVP 4x c-trib
    Sep 15, 2016 @ 11:29
    Tim
    0

    My first guess would be maybe the german stemmer, or some other part of the german analyser pipeline is changing the word? Do you get the same result if you put "content" in the contents field? E.g.

    +(+(contents:con*)) +__IndexType:con
    
  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Sep 15, 2016 @ 12:01
    Ismail Mayat
    0

    Tim,

    No if you type word content you get

    { SearchIndexType: content, LuceneQuery: +(+(contents:content*)) +__IndexType:con }
    

    Something else is borking it GRRRR

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Sep 15, 2016 @ 12:06
    Ismail Mayat
    0

    Tim,

    Actually you are right. So we took of the wildcard then did query again and now we get

    "{ SearchIndexType: content, LuceneQuery: +(+(contents:con)) +__IndexType:con }"
    

    Rats!!

    Ismail

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Sep 15, 2016 @ 13:22
    Ismail Mayat
    0

    Tim,

    Very interesting read http://www.evelix.ch/unternehmen/Blog/evelix/2013/11/11/inner-workings-of-the-german-analyzer-in-lucene could be the stemmer mashing it up. Says if word is wildcarded then its not stemmed.

    So maybe this is the issue.

    Regards

    Ismail

  • Tim 1193 posts 2675 karma points MVP 4x c-trib
    Sep 15, 2016 @ 13:22
    Tim
    1

    This stack overflow post might have some clues, it looks like in the Java version at least, you can specify words that don't get stemmed. Not sure if that's the case in the .Net port though.

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Sep 15, 2016 @ 14:07
    Ismail Mayat
    0

    Tim,

    That looks like list of stop words not words to ignore for stemming?

    Regards

    Ismail

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Sep 15, 2016 @ 13:30
    Ismail Mayat
    0

    Tim,

    Its definately the stemmer, I have opened luke and in the plugins section done some analysis and for word content using german analyser you get stemmed form content.

    Regards

    Ismail

  • Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib
    Sep 16, 2016 @ 08:50
    Ismail Mayat
    0

    Tim,

    My colleague Dawid has created an issue on github https://github.com/Shazwazza/Examine/issues/54

    Regards

    Ismail

  • Dawid 26 posts 136 karma points c-trib
    Sep 16, 2016 @ 09:27
    Dawid
    0

    I've created a pull request for this issue in Examine repo:

    A workaround for this to modify the query text, after query is being compiled (so tokenized and stemmed by GermanAnalyzer).

     var rawQuery = query.Compile().ToString();
    
     var queryMatch = Regex.Match(rawQuery, 
          @"LuceneQuery: (?<query>.*?)\s*}$");
    
     var luceneQueryText = queryMatch.Groups["query"].Value;
    
     var fixedRawQuery = Regex.Replace(
            luceneQueryText, 
            "__IndexType:con$", 
            "__IndexType:content*");
    
     var fixedCriteria =  contentSearcher.CreateSearchCriteria();
    
     fixedCriteria.RawQuery(fixedRawQuery);
    
     var contentSearchResults = contentSearcher.Search(fixedCriteria);
    
Please Sign in or register to post replies

Write your reply to:

Draft