Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • John 2 posts 92 karma points
    Aug 27, 2024 @ 16:12
    John
    0

    Issues with feeding Lucene.Search.Highlight content stripped of html tags

    I'm trying to utilize Lucene.Search.Highlight in conjunction with Examine to provide search result matched text previews. I've managed to get the Highlighter working, and it can parse the content and successfully highlight matches. However, because the contents of my pages are created with the Rich Text Editor, the "content" field contains plenty of html tags that I wouldn't like to display to the user, nor highlight as a match. Thus far, I've been using this code:

            var highlighter = new Highlighter(formatter, new QueryScorer(GetLuceneQueryObject(searchQuery, highlightFieldName)));
            var tokenStream = new StandardAnalyzer(LuceneVersion.LUCENE_48).GetTokenStream(highlightFieldName, new StringReader(IndexField));
    

    Unfortunately, the result of this highlighter often looks like this: ...="Forecasts">Forecasts</a>&nbsp;- <span class="umbSearchHighlight">Single</span>-<span class="umbSearchHighlight">asset</span> exogenous information</p> </li> <li class="p"> <p><a href... Contrasts">Forecast Contrasts</a>&nbsp;- Multi-<span class="umbSearchHighlight">asset</span> exogenous information &nbsp;...

    My initial intuition was to try and strip the text of html tags before the highlighter matches on it. The issue with this approach is that while IndexField is the full text that I'm trying to highlight from (and as such I can strip the html tags from it manually), GetTokenStream only accepts the name of the field that I'm trying to search, and as such is referencing the ExternalIndex field ("content" in my case). If I try to strip IndexField of html tags on its own without changing highlightFieldName, I get issues where the token stream is "desynced" for lack of a better word, and my returned matches are just displaced onto random characters.

    Is there a way to modify the ExternalIndex to add a field which is just an html-stripped version of the content field? Alternatively, is there a way to use Lucene.Search.Highlight to parse text directly rather than being forced to query on the fields of a given node?

    Thanks.

  • John 2 posts 92 karma points
    Aug 30, 2024 @ 14:29
    John
    100

    In case someone comes across this looking for help, I managed to figure it out despite the lack of response. It turned out it was possible to use overloads to completely avoid using the highlightFieldName. I was able to generate the Lucene query object with an empty string, and do the same with the token stream. Once this was done, the highlighter only had access to the content string I was feeding it, and as such I was able to avoid needing to add a new field to the external index.

    I.e., the code ended up looking like

    var highlighter = new Highlighter(formatter, new QueryScorer(GetLuceneQueryObject(searchQuery, String.Empty)));
    var tokenStream = new StandardAnalyzer(LuceneVersion.LUCENE_48).GetTokenStream(String.Empty, new StringReader(strippedIndexField));
    
Please Sign in or register to post replies

Write your reply to:

Draft