Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Thanh Pham 22 posts 133 karma points
    Feb 08, 2018 @ 01:36
    Thanh Pham
    0

    Umbraco Examine - Search result highlighting

    Hi guys,

    I'm trying implement the search result highlighting (like Google) within an Umbraco web app. I followed this https://our.umbraco.org/forum/developers/extending-umbraco/13571-Umbraco-Examine-Search-Results-Highlighting, however it's 8 years old and I want to target multiple fields with fuzzy search so below is my code:

            var stdAnalyzer = new StandardAnalyzer(Version.LUCENE_29);
            var formatter = new SimpleHTMLFormatter();
            var finalQuery = new BooleanQuery();
            var tmpQuery = new BooleanQuery();
    
            var multiQueryParser = new MultiFieldQueryParser(Version.LUCENE_29, fields, stdAnalyzer);
            var externalIndexSet = Examine.LuceneEngine.Config.IndexSets.Instance.Sets["ExternalIndexSet"];
            var externalSearcher = new IndexSearcher($"{externalIndexSet.IndexDirectory.FullName}\\Index", true);
            var terms = searchTerm.RemoveStopWords().Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
    
            foreach (var term in terms)
            {
                tmpQuery.Add(multiQueryParser.Parse(term.Replace("~", "") + $@"~{fuzzyScore}"),
                    BooleanClause.Occur.SHOULD);
            }
            tmpQuery.Add(multiQueryParser.Parse("noIndex:1"), BooleanClause.Occur.MUST_NOT);
    
            finalQuery.Add(multiQueryParser.Parse($@"{tmpQuery}"),
                BooleanClause.Occur.MUST);
            finalQuery.Add(multiQueryParser.Parse("__IndexType:content"), BooleanClause.Occur.MUST);
    
    
            var hits = externalSearcher.Search(finalQuery, 100);
            var qs = new QueryScorer(finalQuery);
            var highlighter = new Highlighter(formatter, qs);
            var fragmenter = new SimpleFragmenter();
            highlighter.SetTextFragmenter(fragmenter);
            highlighter.SetMaxDocBytesToAnalyze(int.MaxValue);
    
            foreach (var item in hits.ScoreDocs)
            {                
                var document = externalSearcher.Doc(item.doc);
                var description = document.Get("description");
                var tokenStream = TokenSources.GetTokenStream(externalSearcher.GetIndexReader(), item.doc,
                    "description", stdAnalyzer);
                var frags = highlighter.GetBestFragments(tokenStream, description, 10);
            }
    
            externalSearcher.Dispose();
    

    Everything seems working fine except I can't get token stream regardless how many different methods from different classes I've tried, therefore no frags returned. I then looked at the lucene.net source code here at https://lucenenet.apache.org/docs/3.0.3/df/d43/tokensources8cssource.html and found that the method GetTokenStream will throw an ArgumentException (see image below) if the "description" field I use above is not TermPositionVector. I got exactly this exception when I debugged it. How do I fix this issue?

    enter image description here

    I use default ExternalSearcher & ExternalIndexSet provided by Umbraco (7.7.6) to index & query content within BackOffice.

    Thanks.

    TP

  • Thanh Pham 22 posts 133 karma points
    Feb 08, 2018 @ 06:33
    Thanh Pham
    0

    Update.

    I used Lucene Luke to examine the index Umbraco created and found that the description field has option Term Vector ticked but not positions nor offsets (see image below), that means Umbraco Examine only knows the number of occurrences, not positions and offsets which are required to be able to get token stream I mentioned in the initial post. Reference: http://makble.com/what-is-term-vector-in-lucene

    Can anyone shed some lights on how to fix this? Thanks.

    enter image description here

  • Thanh Pham 22 posts 133 karma points
    Feb 12, 2018 @ 05:45
    Thanh Pham
    0

    Can anyone help please as our client really wants to have this feature when they decommission Google search plugin?

  • Dan Diplo 1316 posts 4880 karma points MVP 2x c-trib
    Feb 12, 2018 @ 09:23
    Dan Diplo
    1

    Here's how I do syntax highlighting in Lucene:

    First, add a reference to the NuGet package Lucene.Net.Contrib 2.9.4.1 (ensure it's the 2.9.4.1 version and not latest).

    Then I have the following class with various methods to generate highlighting:

    public class LuceneHighlighter
    {
        private readonly Lucene.Net.Util.Version _luceneVersion = Lucene.Net.Util.Version.LUCENE_29;
    
        /// <summary>
        /// Initialises the queryparsers with an empty dictionary
        /// </summary>
        protected Dictionary<string, QueryParser> QueryParsers = new Dictionary<string, QueryParser>();
    
        /// <summary>
        /// Get or set the separator string (default = "...")
        /// </summary>
        public string Separator { get; set; }
    
        /// <summary>
        /// Get or set the maximum number of highlights to show (default = 5)
        /// </summary>
        public int MaxNumHighlights { get; set; }
    
        /// <summary>
        /// Get or set the Formatter to use (default = SimpleHTMLFormatter)
        /// </summary>
        public Formatter HighlightFormatter { get; set; }
    
        /// <summary>
        /// Get or set the Analyzer to use (default = StandardAnalyzer)
        /// </summary>
        public Analyzer HighlightAnalyzer { get; set; }
    
        /// <summary>
        /// Get the index search being used
        /// </summary>
        public IndexSearcher Searcher { get; private set; }
    
        /// <summary>
        /// Get the Query to be used for highlighting
        /// </summary>
        public Query LuceneQuery { get; private set; }
    
        /// <summary>
        /// Initialise a new LuceneHighlighter instance
        /// </summary>
        /// <param name="searcher">The IndexSearch being used</param>
        /// <param name="luceneQuery">The underlying Lucene Query being used</param>
        /// <param name="highlightCssClassName">The name of the CSS class used to wrap around highlighted words</param>
        public LuceneHighlighter(IndexSearcher searcher, Query luceneQuery, string highlightCssClassName)
        {
            this.Searcher = searcher;
            this.LuceneQuery = luceneQuery;
            this.Separator = "...";
            this.MaxNumHighlights = 5;
            this.HighlightAnalyzer = new StandardAnalyzer(_luceneVersion);
            this.HighlightFormatter = new SimpleHTMLFormatter("<span class=\"" + highlightCssClassName + "\">", "</span> ");
        }
    
        /// <summary>
        /// Get the highlighted string for a value and a field
        /// </summary>
        /// <param name="value">The field value</param>
        /// <param name="highlightField">The field name</param>
        /// <returns>A string containing the highlighted result</returns>
        public string GetHighlight(string value, string highlightField)
        {
            value = Regex.Replace(value, "content", "", RegexOptions.IgnoreCase); // weird bug in GetBestFragments always adds "content"
    
            var scorer = new QueryScorer(LuceneQuery.Rewrite(Searcher.GetIndexReader()));
    
            var highlighter = new Highlighter(HighlightFormatter, scorer);
    
            var tokenStream = HighlightAnalyzer.TokenStream(highlightField, new StringReader(value));
            return highlighter.GetBestFragments(tokenStream, value, MaxNumHighlights, Separator);
        }
    
        /// <summary>
        /// Get the highlighted field for a value and field
        /// </summary>
        /// <param name="value">The field value</param>
        /// <param name="searcher">The Examine searcher</param>
        /// <param name="highlightField">The hghlight field</param>
        /// <param name="luceneQuery">The query being used</param>
        /// <returns>A string containing the highlighted result</returns>
        public string GetHighlight(string value, IndexSearcher searcher, string highlightField, Query luceneQuery)
        {
            var scorer = new QueryScorer(luceneQuery.Rewrite(searcher.GetIndexReader()));
            var highlighter = new Highlighter(HighlightFormatter, scorer);
    
            var tokenStream = HighlightAnalyzer.TokenStream(highlightField, new StringReader(value));
            return highlighter.GetBestFragments(tokenStream, value, MaxNumHighlights, Separator);
        }
    
        /// <summary>
        /// Gets a query parser for a hightlight field
        /// </summary>
        /// <param name="highlightField">The field</param>
        /// <returns>A query parser</returns>
        protected QueryParser GetQueryParser(string highlightField)
        {
            if (!QueryParsers.ContainsKey(highlightField))
            {
                QueryParsers[highlightField] = new QueryParser(_luceneVersion, highlightField, HighlightAnalyzer);
            }
            return QueryParsers[highlightField];
        }
    }
    
  • Thanh Pham 22 posts 133 karma points
    Feb 12, 2018 @ 22:59
    Thanh Pham
    0

    Thanks heaps Dan, I'll try it and let you know how it goes.

  • Thanh Pham 22 posts 133 karma points
    Feb 12, 2018 @ 23:52
    Thanh Pham
    0

    Hi Dan,

    Woohoo, it's working. Thank you very much :).

    By the way I found that my code looked pretty much same as yours except the parameter of the QueryScorer. My one did not have the .Rewrite method which was identified as the root of the issue. Again, thank you.

  • Jonny Flanagan 14 posts 113 karma points
    Jul 04, 2018 @ 11:47
    Jonny Flanagan
    0

    Hi Dan/Thanh,

    Could you show me how you called the LuceneHighlighter class using razor on your search/search results page?

    Thanks Jonny

  • Dan Diplo 1316 posts 4880 karma points MVP 2x c-trib
    Jul 04, 2018 @ 12:44
    Dan Diplo
    0

    Hi Jonny,

    I wrote an extension for the Examine.SearchResult class that can be used to get the highlight easily:

    /// <summary>
    /// Gets the contents of a field as a summary fragment containing the keywords highlighted
    /// </summary>
    /// <param name="result">The search result</param>
    /// <param name="fieldName">The field name to use (eg. 'bodyText')</param>
    /// <param name="highlighter">A reference to an instance of a Lucene highlighter</param>
    /// <returns>A string containing the field contents with search words highlighted</returns>
    public static string GetHighlightForField(this SearchResult result, string fieldName, LuceneHighlighter highlighter)
    {
        string highglight = null;
        if (result.Fields.ContainsKey(fieldName))
        {
            string fieldContents = result.Fields[fieldName];
            if (fieldContents != null)
            {
                highglight = highlighter.GetHighlight(fieldContents, fieldName);
            }
        }
        return highglight;
    }
    

    Hope that points you in right direction.

  • Jonny Flanagan 14 posts 113 karma points
    Jul 04, 2018 @ 16:35
    Jonny Flanagan
    0

    Thank you Dan. This is really helpful.

    The part I am stuck on is building the actual Query to pass into the highlighter?? Any help would be welcome.

    LuceneIndexer indexer = (LuceneIndexer)ExamineManager.Instance.IndexProviderCollection["ExternalIndexer"];

    IndexSearcher searcher = new IndexSearcher(indexer.GetLuceneDirectory(), false);

    var luceneQuery = new Query(); // how to build the query with the search keyword??

    var highlighter = new LuceneHighlighter(searcher, luceneQuery, "text-warning");

    Examine.SearchResult highlightResult = new SearchResult(); var summary = highlightResult.GetHighlightForField("bodyText", highlighter);

  • Dan Diplo 1316 posts 4880 karma points MVP 2x c-trib
    Jul 05, 2018 @ 12:26
Please Sign in or register to post replies

Write your reply to:

Draft