Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Barry Fogarty 493 posts 1129 karma points
    Nov 05, 2011 @ 16:15
    Barry Fogarty
    0

    Examine case insensitive keyword search

    I am using the keyword analyser as I need to include stop words in the query.  However I also need it to be case-insensitive.

    As a bonus I would like to have partial word match e.g. 'IT dev' would return a result titled 'IT Development'

    Playing with luke, neither of these seem possible with keyword analyser e.g.

    (jobRole:"IT dev*")
    (jobRole:"IT development")
    (jobRole:"IT Dev*")

    - none of these return the desired result

    (jobRole:IT*)
    (jobRole:IT Development)

     

    I am also quite confused about 2 word searches in general - in luke, if I try to search

    (jobRole:IT Development)

    it parses the search as jobRole:IT __IndexType:Developer

    (where __IndexType is the default search field in luke).  I can wrap the query in quotes in luke but this does not happen when compiling the filter in code.

     

     

  • Barry Fogarty 493 posts 1129 karma points
    Nov 05, 2011 @ 16:21
    Barry Fogarty
    0

    Still cant edit my posts!  Meant to add:

    I seem to have better success in general with the Standard Analyser - would it be easier to manually include 'IT' in a list of overriding 'include' words somehow?

  • Tim 1193 posts 2675 karma points MVP 2x c-trib
    Nov 07, 2011 @ 12:51
    Tim
    0

    Hiya,

    I have a feeling that the the keyword analyser is case sensitive. You could write an indexing event handler to convert all the values to lower case as it indexes them maybe? And then convert the search term to lower case as well.

    I've only really just started out using Lucene properly, and I recommed the Lucene in action (latest edition) book, its geared towards the Java implementation, but a lot of the examples are relevant to the .net version to. It also does a very good job of explainig the different types of analysers etc.

    Ismail Mayat is probably the best person to ask Lucene questions to, he's done some very advanced stuff with it, and he helped me to get a multi-index search working a few months ago.

    :)

  • Barry Fogarty 493 posts 1129 karma points
    Nov 07, 2011 @ 18:57
    Barry Fogarty
    0

    Thanks for the tips Tim.  I would have thought there is an analyser available that can ignore case but include stop words.  I hope someone like Ismail or Slace can advise on the correct analyser, or a way to force in a stop word like 'IT'

  • Shannon Deminick 1523 posts 5256 karma points MVP
    Nov 07, 2011 @ 23:07
    Shannon Deminick
    0

    You can easily create your own analyzer by overriding an existing one. IIRC you can also set stop words on an analyzer like StandardAnalyzer using a statically available property.

    If you want case insensitive, use an analyzer that lower cases input like the StandardAnalyzer, then when you search just ToLower() your search terms. If you want case sensitive, then you'll need to use an analyzer like KeywordAnalyzer that doesn't change the case when it gets analyzed and then don't change the casing of your search term.

    If you want the best of both worlds, then you can use a case sensitive analyzer and use Examine events to make duplicate fields that are lowercased.\

    Also make sure you are using the lastest version of Examine.

  • Barry Fogarty 493 posts 1129 karma points
    Nov 08, 2011 @ 00:55
    Barry Fogarty
    0

    Thanks Shanon, either of your solutions (custom analyser or setting stop words) should work in my case - can you point me to any resources that might help get me started?  Setting stop words sounds simpler, literally I just need to allow the term 'IT'.

  • Shannon Deminick 1523 posts 5256 karma points MVP
    Nov 08, 2011 @ 01:04
    Shannon Deminick
    0

    in your global.asax on app startup you can modify the StandardAnalyzer's stop word set which is a c# Hashtable, so if you want to remove the "IT" stop word, then you'll need to find it in the Hashtable and remove it. The static property you're after is:

    Lucene.Net.Analysis.Standard.StandardAnalyzer.STOP_WORDS_SET

    Otherwise you can override the StandardAnalyzer and pass in your own stop words to its ctor.

     

  • Barry Fogarty 493 posts 1129 karma points
    Nov 08, 2011 @ 04:34
    Barry Fogarty
    0

    Thanks for your help with this Shannon. 

    1) Started out in Global.asax  but Lucene.Net.Analysis.Standard.StandardAnalyzer.STOP_WORDS_SET - returns a NULL in Application_Start. 

    2) So I tried to create my own custom analyser extending StandardAnalyzer, but I don't know where I am going wrong and I could not locate any useful examples of this.

    public class MyMemberAnalyzer : StandardAnalyzer
        {
           
            public MyMemberAnalyzer() : base(new StandardAnalyzer(Version matchVersion, TextReader stopwords)
            {
                stopSet = WordlistLoader.GetWordSet(stopwords);
                Init(matchVersion);
            }

        }

    Am I on the right track?

     

  • Shannon Deminick 1523 posts 5256 karma points MVP
    Nov 08, 2011 @ 05:13
    Shannon Deminick
    0

    Here's the static ctor for the StandardAnalyzer which seems to set the STOP_WORDS_SET from the StopAnalyzer.ENGLISH_STOP_WORDS_SET:

    static StandardAnalyzer()
        {
          string str = SupportClass.AppSettings.Get("Lucene.Net.Analysis.Standard.StandardAnalyzer.replaceInvalidAcronym""true");
          StandardAnalyzer.defaultReplaceInvalidAcronym = (str == null || str.Equals("true")) && true;
          StandardAnalyzer.STOP_WORDS = StopAnalyzer.ENGLISH_STOP_WORDS;
          StandardAnalyzer.STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
        }

    So thats strange that it is NULL on app startup since the static ctor will fire before you try to access it's properties/fields, so I'd check to see if the StopAnalyzer.ENGLISH_STOP_WORDS_SET has values. If so then you can just create an analyzer like:

    public MyMemberAnalyzer : StandardAnalyzer {

    public MyMemberAnalyzer() : base(Lucene.Net.Util.Version.LUCENE_29, StopAnalyzer.ENGLISH_STOP_WORDS_SET){ }

    }

     

    You'll need a parameterless ctor for Examine to instantiate it.

    Strangely enough, looking at the decompiled source of Lucene, the defaul ctor for the StandardAnalyzer is using the older LUCENE_24 version... thats pretty strange! So you're pretty much better off doing this anyways cuz at least you'll be using the most recent Lucene analyzer version.

  • Barry Fogarty 493 posts 1129 karma points
    Nov 08, 2011 @ 13:40
    Barry Fogarty
    0

    Thanks again Shannon, it looks like there are values in the StopAnalyzer.ENGLISH_STOP_WORDS_SET at app start.  However when I used your code and set the analyser in my ExamineSettings.config as follows:

    analyzer="MyProject.Web.Classes.MyMemberAnalyzer, Lucene.Net"

    I get the following error in the razor script where I perform the search:

    The type initializer for 'Examine.ExamineManager' threw an exception.

    Is that the right way to set the analyzer attribute?  I have tried without the Lucene.Net but the result is the same.

  • Shannon Deminick 1523 posts 5256 karma points MVP
    Nov 08, 2011 @ 22:06
    Shannon Deminick
    0

    No, that ", Lucene.Net" is telling .Net that you're class belongs in the Lucene.Net assembly, you need to put your assembly name in there.

    So perhaps your assemly is MyProject then you'd put:

    "MyProject.Web.Classes.MyMemberAnalyzer, MyProject"

  • Barry Fogarty 493 posts 1129 karma points
    Nov 08, 2011 @ 23:27
    Barry Fogarty
    1

    Doh!  Thought that was so it could reference both assemblies.  FYI for others reference here is my class:

        public class MyMemberAnalyzer : StandardAnalyzer
        {
            private static TextReader stopWords = File.OpenText(@"C:\stopwords.txt");

            public MyMemberAnalyzer() : base(Lucene.Net.Util.Version.LUCENE_29, stopWords) { }

        }

    I guess it would be more performant to hardcode stop words into a hashtable, but I'm not going to worry about that right now!

    Thanks again mate.. #H5YR

  • Lee Gunn 5 posts 26 karma points
    Oct 11, 2012 @ 10:26
    Lee Gunn
    0

    Hi,

    I had a similar problem. I wanted to use the StandardAnalyzer but not throw away "stop words". By adding this line to Application_Start in global.asax

    Lucene.Net.Analysis.StopAnalyzer.ENGLISH_STOP_WORDS_SET = new System.Collections.Hashtable();

    It solved my problem.

    Lee 

  • Mike Chambers 631 posts 1226 karma points c-trib
    Nov 26, 2012 @ 19:51
    Mike Chambers
    1

    on the whitespaceanalyzer and caseinsensitive search... I used fuzzy as by default it ....

    var _searcher = ExamineManager.Instance.DefaultSearchProvider;
    
                var criteria = _searcher.CreateSearchCriteria(IndexTypes.Content, BooleanOperation.Or);
    
                Examine.SearchCriteria.IBooleanOperation filter = null;
                // exact phrase match - case sensitive
                filter = criteria.GroupedOr(new[] { "title", "content", "nodeName" }, searchString);
                // split on words use fuzzy to make case-insensitive
                foreach (var t in searchString.Split(' ')) { filter.Or().GroupedOr(new[] { "title", "content", "nodeName" }, t.Fuzzy(0.8f)); }
    
                var searchResults = _searcher.Search(filter.Compile());

     My search here looks for an exact phrase match or any document containing any of the terms (case insenitive)

     

    It relies on [http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F]

    Are Wildcard, Prefix, and Fuzzy queries case sensitive?

    No, not by default. Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming and lowercasing. The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query. These queries are case-insensitive anyway because QueryParser makes them lowercase. This behavior can be changed using the setLowercaseExpandedTerms(boolean) method.

  • David Conlisk 432 posts 1008 karma points
    Apr 10, 2013 @ 18:00
    David Conlisk
    0

    Using Mike's suggestion, I added a call to Fuzzy which made my search case-insensitive, even though it's using the WhitespaceAnalyzer. Using the value 0.4f also meant that it matched words with small spelling errors, worth experimenting with.

     

     

    var criteria = ExamineManager.Instance.SearchProviderCollection["ContactSearcher"].CreateSearchCriteria(UmbracoExamine.IndexTypes.Content);

    var filter = criteria.GroupedOr(new[] { "fullName", "email" }, SearchTerm.Fuzzy(0.4f)).Compile();

    Results = ExamineManager.Instance.SearchProviderCollection["ContactSearcher"].Search(filter);

     

  • Simon Dingley 1457 posts 3408 karma points c-trib
    Apr 19, 2013 @ 14:45
    Simon Dingley
    0

    I'm not seeing the same result, I am looking for case insensitve searching and have the following:

    var searcher = ExamineManager.Instance.SearchProviderCollection["ExternalSearcher"];
    var criteria = searcher.CreateSearchCriteria(UmbracoExamine.IndexTypes.Content);
    
    criteria.NodeTypeAlias("Organisation").And().NodeName(string.Format("{0}*", term).Fuzzy()).Compile();
    
    var results = searcher.Search(criteria);

    This results in the following Lucene query:

    +(+__NodeTypeAlias:organisation +nodeName:ikea*~0.5) +__IndexType:content

    This produces no results yet if I run this with the exact casing in Luke I get the expected result:

    +(+__NodeTypeAlias:organisation +nodeName:IKEA*~0.5) +__IndexType:content

    Each time I use Examine it's a fight, great when it works but usually a hard slog getting there.

  • Shannon Deminick 1523 posts 5256 karma points MVP
    Apr 19, 2013 @ 17:30
    Shannon Deminick
    0

    Hi @Simon, unfortunately TheFARM took down FarmCode.org which had a lot of great Examine references and 'how tos'. Luckily they gave me the source of that and I've re-posted all of those blogs posts to my site. This may (or may not:) help you:

    http://shazwazza.com/post/Text-casing-and-Examine

    The Examine project is now starting to get some much needed TLC and UmbracoExamine.dll is now part of the 6.1 core so there will be some big leaps being made in regards to using Examine. I also plan on completely upgrading the documentation on using Examine and putting it on the regular Umbraco docs on Our. The casing stuff for Examine is simply based on how Lucene deals with queries. What we may end up doing is writing our own analyzer(s) that caters for most of the things people want to do with Examine and hopefully that could iron out many of these descrepencies. Also note, that Examine will let you search using Raw lucene markup if you want to use that syntax instead.

    Cheers,
    Shan 

     

  • MrFlo 158 posts 402 karma points
    Dec 03, 2015 @ 21:48
    MrFlo
    0

    I manage to have this custom analyser working but I had to set it in the ExamineSettings.config in the searcher AND in the Indexer. This could be useful for those struggling:

      <ExamineIndexProviders>
        <providers>
          <add name="MyIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"  analyzer="MyProject.Web.Classes.MyMemberAnalyzer, MyProject" />
        </providers>
      </ExamineIndexProviders>
    
     <ExamineSearchProviders defaultProvider="ExternalSearcher">
        <providers>
          <add name="MySearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
           analyzer="MyProject.Web.Classes.MyMemberAnalyzer, MyProject" />
        </providers>
      </ExamineSearchProviders>
    
Please Sign in or register to post replies

Write your reply to:

Draft