examine case insensitive keyword search

Barry Fogarty 493 posts 1129 karma points

Nov 05, 2011 @ 16:15

Examine case insensitive keyword search

I am using the keyword analyser as I need to include stop words in the query. However I also need it to be case-insensitive.

As a bonus I would like to have partial word match e.g. 'IT dev' would return a result titled 'IT Development'

Playing with luke, neither of these seem possible with keyword analyser e.g.

(jobRole:"IT dev*")
(jobRole:"IT development")
(jobRole:"IT Dev*")

- none of these return the desired result

(jobRole:IT*)
(jobRole:IT Development)

I am also quite confused about 2 word searches in general - in luke, if I try to search

(jobRole:IT Development)

it parses the search as jobRole:IT __IndexType:Developer

(where __IndexType is the default search field in luke). I can wrap the query in quotes in luke but this does not happen when compiling the filter in code.

Copy Link

Barry Fogarty 493 posts 1129 karma points

Nov 05, 2011 @ 16:21

Still cant edit my posts! Meant to add:

I seem to have better success in general with the Standard Analyser - would it be easier to manually include 'IT' in a list of overriding 'include' words somehow?

Copy Link

Tim 1193 posts 2675 karma points MVP 4x c-trib

Nov 07, 2011 @ 12:51

Hiya,

I have a feeling that the the keyword analyser is case sensitive. You could write an indexing event handler to convert all the values to lower case as it indexes them maybe? And then convert the search term to lower case as well.

I've only really just started out using Lucene properly, and I recommed the Lucene in action (latest edition) book, its geared towards the Java implementation, but a lot of the examples are relevant to the .net version to. It also does a very good job of explainig the different types of analysers etc.

Ismail Mayat is probably the best person to ask Lucene questions to, he's done some very advanced stuff with it, and he helped me to get a multi-index search working a few months ago.

:)

Copy Link

Barry Fogarty 493 posts 1129 karma points

Nov 07, 2011 @ 18:57

Thanks for the tips Tim. I would have thought there is an analyser available that can ignore case but include stop words. I hope someone like Ismail or Slace can advise on the correct analyser, or a way to force in a stop word like 'IT'

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Nov 07, 2011 @ 23:07

You can easily create your own analyzer by overriding an existing one. IIRC you can also set stop words on an analyzer like StandardAnalyzer using a statically available property.

If you want case insensitive, use an analyzer that lower cases input like the StandardAnalyzer, then when you search just ToLower() your search terms. If you want case sensitive, then you'll need to use an analyzer like KeywordAnalyzer that doesn't change the case when it gets analyzed and then don't change the casing of your search term.

If you want the best of both worlds, then you can use a case sensitive analyzer and use Examine events to make duplicate fields that are lowercased.\

Also make sure you are using the lastest version of Examine.

Copy Link

Barry Fogarty 493 posts 1129 karma points

Nov 08, 2011 @ 00:55

Thanks Shanon, either of your solutions (custom analyser or setting stop words) should work in my case - can you point me to any resources that might help get me started? Setting stop words sounds simpler, literally I just need to allow the term 'IT'.

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Nov 08, 2011 @ 01:04

in your global.asax on app startup you can modify the StandardAnalyzer's stop word set which is a c# Hashtable, so if you want to remove the "IT" stop word, then you'll need to find it in the Hashtable and remove it. The static property you're after is:

Lucene.Net.Analysis.Standard.StandardAnalyzer.STOP_WORDS_SET

Otherwise you can override the StandardAnalyzer and pass in your own stop words to its ctor.

Copy Link

Barry Fogarty 493 posts 1129 karma points

Nov 08, 2011 @ 04:34

Thanks for your help with this Shannon.

1) Started out in Global.asax but Lucene.Net.Analysis.Standard.StandardAnalyzer.STOP_WORDS_SET - returns a NULL in Application_Start.

2) So I tried to create my own custom analyser extending StandardAnalyzer, but I don't know where I am going wrong and I could not locate any useful examples of this.

public class MyMemberAnalyzer : StandardAnalyzer
    {
        
        public MyMemberAnalyzer() : base(new StandardAnalyzer(Version matchVersion, TextReader stopwords)
        {
            stopSet = WordlistLoader.GetWordSet(stopwords);
            Init(matchVersion);
        }

    }

Am I on the right track?

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Nov 08, 2011 @ 05:13

Here's the static ctor for the StandardAnalyzer which seems to set the STOP_WORDS_SET from the StopAnalyzer.ENGLISH_STOP_WORDS_SET:

static StandardAnalyzer()
    {
      string str = SupportClass.AppSettings.Get("Lucene.Net.Analysis.Standard.StandardAnalyzer.replaceInvalidAcronym", "true");
      StandardAnalyzer.defaultReplaceInvalidAcronym = (str == null || str.Equals("true")) && true;
      StandardAnalyzer.STOP_WORDS = StopAnalyzer.ENGLISH_STOP_WORDS;
      StandardAnalyzer.STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
    }

So thats strange that it is NULL on app startup since the static ctor will fire before you try to access it's properties/fields, so I'd check to see if the StopAnalyzer.ENGLISH_STOP_WORDS_SET has values. If so then you can just create an analyzer like:

public MyMemberAnalyzer : StandardAnalyzer {

public MyMemberAnalyzer() : base(Lucene.Net.Util.Version.LUCENE_29, StopAnalyzer.ENGLISH_STOP_WORDS_SET){ }

}

You'll need a parameterless ctor for Examine to instantiate it.

Strangely enough, looking at the decompiled source of Lucene, the defaul ctor for the StandardAnalyzer is using the older LUCENE_24 version... thats pretty strange! So you're pretty much better off doing this anyways cuz at least you'll be using the most recent Lucene analyzer version.

Copy Link

Barry Fogarty 493 posts 1129 karma points

Nov 08, 2011 @ 13:40

Thanks again Shannon, it looks like there are values in the StopAnalyzer.ENGLISH_STOP_WORDS_SET at app start. However when I used your code and set the analyser in my ExamineSettings.config as follows:

analyzer="MyProject.Web.Classes.MyMemberAnalyzer, Lucene.Net"

I get the following error in the razor script where I perform the search:

The type initializer for 'Examine.ExamineManager' threw an exception.

Is that the right way to set the analyzer attribute? I have tried without the Lucene.Net but the result is the same.

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Nov 08, 2011 @ 22:06

No, that ", Lucene.Net" is telling .Net that you're class belongs in the Lucene.Net assembly, you need to put your assembly name in there.

So perhaps your assemly is MyProject then you'd put:

"MyProject.Web.Classes.MyMemberAnalyzer, MyProject"

Copy Link

Barry Fogarty 493 posts 1129 karma points

Nov 08, 2011 @ 23:27

Doh! Thought that was so it could reference both assemblies. FYI for others reference here is my class:

    public class MyMemberAnalyzer : StandardAnalyzer
    {
        private static TextReader stopWords = File.OpenText(@"C:\stopwords.txt");

        public MyMemberAnalyzer() : base(Lucene.Net.Util.Version.LUCENE_29, stopWords) { }

    }

I guess it would be more performant to hardcode stop words into a hashtable, but I'm not going to worry about that right now!

Thanks again mate.. #H5YR

Copy Link

Lee Gunn 5 posts 26 karma points

Oct 11, 2012 @ 10:26

Hi,

I had a similar problem. I wanted to use the StandardAnalyzer but not throw away "stop words". By adding this line to Application_Start in global.asax

Lucene.Net.Analysis.StopAnalyzer.ENGLISH_STOP_WORDS_SET = new System.Collections.Hashtable();

It solved my problem.

Lee

Copy Link

Mike Chambers 636 posts 1253 karma points c-trib

Nov 26, 2012 @ 19:51

on the whitespaceanalyzer and caseinsensitive search... I used fuzzy as by default it ....

var _searcher = ExamineManager.Instance.DefaultSearchProvider;

            var criteria = _searcher.CreateSearchCriteria(IndexTypes.Content, BooleanOperation.Or);

            Examine.SearchCriteria.IBooleanOperation filter = null;
            // exact phrase match - case sensitive
            filter = criteria.GroupedOr(new[] { "title", "content", "nodeName" }, searchString);
            // split on words use fuzzy to make case-insensitive
            foreach (var t in searchString.Split(' ')) { filter.Or().GroupedOr(new[] { "title", "content", "nodeName" }, t.Fuzzy(0.8f)); }

            var searchResults = _searcher.Search(filter.Compile());

My search here looks for an exact phrase match or any document containing any of the terms (case insenitive)

It relies on [http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F]

Are Wildcard, Prefix, and Fuzzy queries case sensitive?

No, not by default. Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming and lowercasing. The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query. These queries are case-insensitive anyway because QueryParser makes them lowercase. This behavior can be changed using the setLowercaseExpandedTerms(boolean) method.

Copy Link

David Conlisk 432 posts 1008 karma points

Apr 10, 2013 @ 18:00

Using Mike's suggestion, I added a call to Fuzzy which made my search case-insensitive, even though it's using the WhitespaceAnalyzer. Using the value 0.4f also meant that it matched words with small spelling errors, worth experimenting with.

var criteria = ExamineManager.Instance.SearchProviderCollection["ContactSearcher"].CreateSearchCriteria(UmbracoExamine.IndexTypes.Content);

var filter = criteria.GroupedOr(new[] { "fullName", "email" }, SearchTerm.Fuzzy(0.4f)).Compile();

Results = ExamineManager.Instance.SearchProviderCollection["ContactSearcher"].Search(filter);

Copy Link

Simon Dingley 1474 posts 3451 karma points c-trib

Apr 19, 2013 @ 14:45

I'm not seeing the same result, I am looking for case insensitve searching and have the following:

var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
var criteria = searcher.CreateSearchCriteria(UmbracoExamine.IndexTypes.Content);

criteria.NodeTypeAlias("Organisation").And().NodeName(string.Format("{0}*", term).Fuzzy()).Compile();

var results = searcher.Search(criteria);

This results in the following Lucene query:

+(+__NodeTypeAlias:organisation +nodeName:ikea*~0.5) +__IndexType:content

This produces no results yet if I run this with the exact casing in Luke I get the expected result:

+(+__NodeTypeAlias:organisation +nodeName:IKEA*~0.5) +__IndexType:content

Each time I use Examine it's a fight, great when it works but usually a hard slog getting there.

Copy Link

Shannon Deminick 1530 posts 5278 karma points MVP 3x

Apr 19, 2013 @ 17:30

Hi @Simon, unfortunately TheFARM took down FarmCode.org which had a lot of great Examine references and 'how tos'. Luckily they gave me the source of that and I've re-posted all of those blogs posts to my site. This may (or may not:) help you:

http://shazwazza.com/post/Text-casing-and-Examine

The Examine project is now starting to get some much needed TLC and UmbracoExamine.dll is now part of the 6.1 core so there will be some big leaps being made in regards to using Examine. I also plan on completely upgrading the documentation on using Examine and putting it on the regular Umbraco docs on Our. The casing stuff for Examine is simply based on how Lucene deals with queries. What we may end up doing is writing our own analyzer(s) that caters for most of the things people want to do with Examine and hopefully that could iron out many of these descrepencies. Also note, that Examine will let you search using Raw lucene markup if you want to use that syntax instead.

Cheers,
Shan

Copy Link

MrFlo 159 posts 403 karma points

Dec 03, 2015 @ 21:48

I manage to have this custom analyser working but I had to set it in the ExamineSettings.config in the searcher AND in the Indexer. This could be useful for those struggling:

  <ExamineIndexProviders>
    <providers>
      <add name="MyIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"  analyzer="MyProject.Web.Classes.MyMemberAnalyzer, MyProject" />
    </providers>
  </ExamineIndexProviders>

 <ExamineSearchProviders defaultProvider="ExternalSearcher">
    <providers>
      <add name="MySearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
       analyzer="MyProject.Web.Classes.MyMemberAnalyzer, MyProject" />
    </providers>
  </ExamineSearchProviders>

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

Examine case insensitive keyword search

Are Wildcard, Prefix, and Fuzzy queries case sensitive?