Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Dominic Resch 45 posts 115 karma points
    Oct 20, 2023 @ 11:33
    Dominic Resch
    0

    How to improve search

    Greetings,

    I have followed the following example and I am quite dissatisfied with the result:

    https://docs.umbraco.com/umbraco-cms/reference/searching/examine/quick-start

    The search basically works. If I search for "Laura", I find "Laura Weaterhead". If I search for "Ravi", I find "Ravi Motha".

    But if I search for "Jan", for example, I only get the entry of "Jan Skovgaard", "Janae Cram" is ignored here. Also a search for "J" finds nothing. " * " as a wildcard, e.g. in " J* ", also returns no results.

    Now I had a look at the query syntax of Lucene and also older posts in the forum (from 2010, where the linked web pages to some blog posts just don't exist anymore) and found the SearchExtensions "MultipleCharacterWildcard" and "Fuzzy".

    With MultipleCharacterWildcard the whole thing looks better. If I search for "J" I find now 3 entries. But if I search for "C" I find only "Janae Cram" and not "Erica Quessenberry". So, if I understand it correctly, Lucene cannot provide a "Contains" function, correct? The search term must be the beginning of a searched word, otherwise nothing will simply be found, correct ?

    Am I doing something wrong here ? I just keep reading that the search is supposed to be so powerful, but I really can't subscribe to that with such results currently.

    Are there any alternatives? Is it simply not possible to use Lucene/Examine? What does another solution look like?

    Unfortunately, I haven't received a reply to any post of mine yet either. And unfortunately you can't see how many users have seen the post, so I could say "Maybe I need to change the text because it's not understandable?".

    So, if the text here is incomprehensible, then please let me know, then I can try to change it.

    So, if I change "SearchContentNames" as follows, I get the result I would actually expect:

            public IEnumerable<IPublishedContent> SearchContentNames(string query)
        {
            IEnumerable<string> ids = Array.Empty<string>();
            if (!string.IsNullOrWhiteSpace(query) && examineManager.TryGetIndex(UmbracoIndexes.ExternalIndexName, out var index))
            {
                ids = index
                    .Searcher
                    .CreateQuery("content")
                    .NodeTypeAlias("person")
                    //.And()
                    //    .Field("nodeName", SearchExtensions.MultipleCharacterWildcard(query))
                    .Execute()
                    .Where (x => x.Values["nodeName"].Contains(query, StringComparison.OrdinalIgnoreCase))
                    .Select(x => x.Id);
            }
    
            foreach (var id in ids)
            {
                yield return umbracoHelper.Content(id)!;
            }
        }
    

    But is that the way it is supposed to be? Is there a better solution for this?

  • Marc Goodson 2155 posts 14408 karma points MVP 9x c-trib
    Oct 21, 2023 @ 22:52
    Marc Goodson
    1

    Hi Dominic

    It depends on what you are trying to achieve with your site and search functionality as to whether Lucene is a good fit for your requirements.

    A lot of people expect a Lucene query to work a bit like a database query, where your search term is used to directly search the text of a page, and you therefore expect to easily be able to match parts of words, if they match the letters in your search term...

    But Lucene is trying to be much cleverer than this, it builds an inverted index of words from documents and it is the index that is searched for 'relevance' rather than the matching of letter combinations

    This is very powerful when the search terms are different words, and you are trying to find the most relevant documents from a large set of documents quickly.

    But if you are building a search to find words that contain specific letters of the alphabet, then out of the box the Standard Analyzer is building an inverted index of words, (it won't include common words in the index like it, and, a, the... English stop words. Because people won't be searching for 'The') It is not indexing each letter...

    There are different analyzers that index following different rules and you can write your own analyzers...

    .. But generally speaking out of the box the Examine/Lucene search with the standard analyzer is pretty good for querying different search terms and being able to boost results if a match occurs in a different field, eg if search term matches in a title field it can be considered a more relevant result than if a match is in body text, or even more relevant in both, or blog posts more relevant than text pages etc... You can build up quite complex search criteria for a specific site...

    So I wouldn't expect if the search term was J that all documents containing words containing the letter J would be returned...

    When you add the Where, after execute, you are filtering all the search results in memory, if there are only a few results, this will be fine, but if you are filtering millions of results this won't be as performant as Lucene.

    Lucene is indexing Janae and Cram as two different words. Same with Erica and Quessenbury, so your multiplewildcardsearch is searching from start of each word, so the C matches on Cram as that starts with C, but Erica starts with an E... Try renaming Quessenbury to be Quessenbury or search for Que and you should match Erica.

    Often, you'll need to combine your examine search to try and get best of both worlds, eg search for 'Jan' with the multiplewildcardsearch Or 'Jan' without the wildcard...

    This would match both Janae And Jan, but because the 2nd OR criteria would just match Jan, the Jan result would have a higher relevance score from matching both parts of the query, than Janae which only matches one part...

    So the power comes from it not being a binary 'matches' or 'not matches' a complex query, but that it scores how well it matches...

    ... But thats not an advantage if you expect or want a binary outcome...

    Most alternatives like Elastic Search or Azure Cognitive search, for which Umbraco has plugins for, are also wrappers for Lucene.

    It's hard to know what would be best for single letter searches or part word matches... There is talk here of a 'Shingle Filter' that might be an approach.. https://stackoverflow.com/questions/5484965/howto-perform-a-contains-search-rather-than-starts-with-using-lucene-net

    Or for a small site with a search on names that must use contains maybe an in memory query against the Umbraco published cache would be acceptable.

    It all depends on how useful it would be in the context of your site/application.

    I've often found that it's only after a site has gone live that I start to see the search terms real people are using that I can grasp how best to tune the search query logic for the sites content to return the best results.

    Anyway, not sure if this has helped, with a big ramble on things, but at least you have had a reply this time!

    Regards

    Marc

Please Sign in or register to post replies

Write your reply to:

Draft