german analyser

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 15, 2016 @ 09:53
0

German analyser
Guys,

Working on search for german using lucene.net contrib and the german analyser. When i do a query the generated query looks like:
```
+(+(contents:universal*)) +__IndexType:con
```
You notice it has cut of the last bit it should read __IndexType:content

When using standard analyser query is generated fine. Anyone else seen this before?

Regards

Ismail
Copy Link
Tim 1193 posts 2675 karma points MVP 4x c-trib

Sep 15, 2016 @ 11:29
0
My first guess would be maybe the german stemmer, or some other part of the german analyser pipeline is changing the word? Do you get the same result if you put "content" in the contents field? E.g.
```
+(+(contents:con*)) +__IndexType:con
```
Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 15, 2016 @ 12:01
0
Tim,

No if you type word content you get
```
{ SearchIndexType: content, LuceneQuery: +(+(contents:content*)) +__IndexType:con }
```
Something else is borking it GRRRR
Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 15, 2016 @ 12:06
0
Tim,

Actually you are right. So we took of the wildcard then did query again and now we get
```
"{ SearchIndexType: content, LuceneQuery: +(+(contents:con)) +__IndexType:con }"
```
Rats!!

Ismail
Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 15, 2016 @ 13:22

0

Tim,

Very interesting read http://www.evelix.ch/unternehmen/Blog/evelix/2013/11/11/inner-workings-of-the-german-analyzer-in-lucene could be the stemmer mashing it up. Says if word is wildcarded then its not stemmed.

So maybe this is the issue.

Regards

Ismail

Copy Link
Tim 1193 posts 2675 karma points MVP 4x c-trib

Sep 15, 2016 @ 13:22

1

This stack overflow post might have some clues, it looks like in the Java version at least, you can specify words that don't get stemmed. Not sure if that's the case in the .Net port though.

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 15, 2016 @ 14:07

0

Tim,

That looks like list of stop words not words to ignore for stemming?

Regards

Ismail

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 15, 2016 @ 13:30

0

Tim,

Its definately the stemmer, I have opened luke and in the plugins section done some analysis and for word content using german analyser you get stemmed form content.

Regards

Ismail

Copy Link
Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Sep 16, 2016 @ 08:50

0

Tim,

My colleague Dawid has created an issue on github https://github.com/Shazwazza/Examine/issues/54

Regards

Ismail

Copy Link

Dawid 26 posts 136 karma points c-trib

Sep 16, 2016 @ 09:27

I've created a pull request for this issue in Examine repo:

A workaround for this to modify the query text, after query is being compiled (so tokenized and stemmed by GermanAnalyzer).

 var rawQuery = query.Compile().ToString();

 var queryMatch = Regex.Match(rawQuery, 
      @"LuceneQuery: (?<query>.*?)\s*}$");

 var luceneQueryText = queryMatch.Groups["query"].Value;

 var fixedRawQuery = Regex.Replace(
        luceneQueryText, 
        "__IndexType:con$", 
        "__IndexType:content*");

 var fixedCriteria =  contentSearcher.CreateSearchCriteria();

 fixedCriteria.RawQuery(fixedRawQuery);

 var contentSearchResults = contentSearcher.Search(fixedCriteria);

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

German analyser