How to improve performance of search results processing ?
Hey all,
I've been using the following code to build preview text for search results. But it's taking too long to process the content of all files in search results .
Is there any better way to get the preview texts with better performance?
public static string FindSnippet(string text, string query, int maxLength)
{
Regex regex = new Regex("[ ]{2,}", RegexOptions.None);
text = regex.Replace(text, " ").Replace("\n", ". ").Trim();
//text = text.Substring(0, text.Length / 2);
if (maxLength < 0)
{
throw new ArgumentException("maxLength");
}
var words = query.Split(' ').Where(w => !string.IsNullOrWhiteSpace(w)).Select(word => word.ToLower()).ToLookup(s => s);
var sentences = text.Split('.');
var i = 0;
var packets = sentences.Select(sentence => new Packet
{
Sentence = sentence,
Density = ComputeDensity(words, sentence),
Offset = i++
}).OrderByDescending(packet => packet.Density);
var list = new SortedList<int, string>();
int length = 0;
foreach (var packet in packets)
{
if (length >= maxLength || packet.Density == 0)
{
break;
}
string sentence = packet.Sentence;
list.Add(packet.Offset, sentence.Substring(0, Math.Min(sentence.Length, maxLength - length)));
length += packet.Sentence.Length;
}
var sb = new List<string>();
int previous = -1;
foreach (var item in list)
{
var offset = item.Key;
var sentence = item.Value;
if (previous != -1 && offset - previous != 1)
{
sb.Add(".");
}
previous = offset;
sb.Add(Highlight(sentence, words));
}
return String.Join(".", sb);
}
The Umbraco helper object has a Truncate method you can use:
@Umbraco.Truncate(string, length)
There's quite a few useful overloads. There is also a StripHtml() method on the same helper you can use in conjunction.
If you are really serious about searching then I'm assuming you are using Examine? If so, Lucene.Net Contrib has lots of extensions, including a highlighter that generates summaries, that does exactly what you are asking for. Make sure to grab the 2.9.4.1 package to be compatible with Umbraco.
I've been looking online for examples to use "Lucene.Net.Highlight.Highlighter.GetBestFragments" with Examine, but all of them are in Java and searching and highlighting directly with Lucene. See: http://makble.com/how-to-do-lucene-search-highlight-example
Could you help with an example or more guidance to get it work with Examine?
I've got it worked somehow. There is significant improvement in the performance but the fragments returned aren't the best.
Code:
public static string GetBestTextFragments(string content, string searchTerms, int previewLength)
{
Analyzer analyzer = new Lucene.Net.Analysis.Snowball.SnowballAnalyzer("English");
Lucene.Net.Search.Query query = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "content", analyzer).Parse(searchTerms);
TokenStream tokenStream = analyzer.TokenStream("", new StringReader(content));
Lucene.Net.Highlight.Highlighter highlighter = new Lucene.Net.Highlight.Highlighter(new SimpleHTMLFormatter(), new QueryScorer(query));
string[] frag = highlighter.GetBestFragments(tokenStream, content, 5);
return String.Join(".", frag);
}
Some examples of fragments returned:
Starts with comma:
, hospitals and pharmacies text text text text text text text text text text text text text text text text text text text text text text text ..., text text text text text text ). Please refer
Starts with 'and'
and pharmacies text text text text text text text without a referral.
I've always found it to return good results. I usually use it conjunction with the search term that has been used, so it gets the relevant match. My basic implementation I've used before is something like this (I've created a class that abstracts it - some of this was found online and modified):
public class LuceneHighlighter
{
private readonly Lucene.Net.Util.Version _luceneVersion = Lucene.Net.Util.Version.LUCENE_29;
/// <summary>
/// Initialises the queryparsers with an empty dictionary
/// </summary>
protected Dictionary<string, QueryParser> QueryParsers = new Dictionary<string, QueryParser>();
/// <summary>
/// Get or set the separator string (default = "...")
/// </summary>
public string Separator { get; set; }
/// <summary>
/// Get or set the maximum number of highlights to show (default = 5)
/// </summary>
public int MaxNumHighlights { get; set; }
/// <summary>
/// Get or set the Formatter to use (default = SimpleHTMLFormatter)
/// </summary>
public Formatter HighlightFormatter { get; set; }
/// <summary>
/// Get or set the Analyzer to use (default = StandardAnalyzer)
/// </summary>
public Analyzer HighlightAnalyzer { get; set; }
/// <summary>
/// Get the index search being used
/// </summary>
public IndexSearcher Searcher { get; private set; }
/// <summary>
/// Get the Query to be used for highlighting
/// </summary>
public Query LuceneQuery { get; private set; }
/// <summary>
/// Initialise a new LuceneHighlighter instance
/// </summary>
/// <param name="searcher">The IndexSearch being used</param>
/// <param name="luceneQuery">The underlying Lucene Query being used</param>
/// <param name="highlightCssClassName">The name of the CSS class used to wrap around highlighted words</param>
public LuceneHighlighter(IndexSearcher searcher, Query luceneQuery, string highlightCssClassName)
{
this.Searcher = searcher;
this.LuceneQuery = luceneQuery;
this.Separator = "...";
this.MaxNumHighlights = 5;
this.HighlightAnalyzer = new StandardAnalyzer(_luceneVersion);
this.HighlightFormatter = new SimpleHTMLFormatter("<span class=\"" + highlightCssClassName + "\">", "</span> ");
}
/*
public string GetHighlight(string value, string highlightField, IndexSearcher searcher, string luceneRawQuery)
{
var query = GetQueryParser(highlightField).Parse(luceneRawQuery);
var scorer = new QueryScorer(query.Rewrite(searcher.GetIndexReader()));
var highlighter = new Highlighter(HighlightFormatter, scorer);
var tokenStream = HighlightAnalyzer.TokenStream(highlightField, new StringReader(value));
return highlighter.GetBestFragments(tokenStream, value, MaxNumHighlights, Separator);
}
*/
/// <summary>
/// Get the highlighted string for a value and a field
/// </summary>
/// <param name="value">The field value</param>
/// <param name="highlightField">The field name</param>
/// <returns>A string containing the highlighted result</returns>
public string GetHighlight(string value, string highlightField)
{
value = Regex.Replace(value, "content", "", RegexOptions.IgnoreCase); // weird bug in GetBestFragments always adds "content"
var scorer = new QueryScorer(LuceneQuery.Rewrite(Searcher.GetIndexReader()));
var highlighter = new Highlighter(HighlightFormatter, scorer);
var tokenStream = HighlightAnalyzer.TokenStream(highlightField, new StringReader(value));
return highlighter.GetBestFragments(tokenStream, value, MaxNumHighlights, Separator);
}
/// <summary>
/// Get the highlighted field for a value and field
/// </summary>
/// <param name="value">The field value</param>
/// <param name="searcher">The Examine searcher</param>
/// <param name="highlightField">The hghlight field</param>
/// <param name="luceneQuery">The query being used</param>
/// <returns>A string containing the highlighted result</returns>
public string GetHighlight(string value, IndexSearcher searcher, string highlightField, Query luceneQuery)
{
var scorer = new QueryScorer(luceneQuery.Rewrite(searcher.GetIndexReader()));
var highlighter = new Highlighter(HighlightFormatter, scorer);
var tokenStream = HighlightAnalyzer.TokenStream(highlightField, new StringReader(value));
return highlighter.GetBestFragments(tokenStream, value, MaxNumHighlights, Separator);
}
/// <summary>
/// Gets a query parser for a hightlight field
/// </summary>
/// <param name="highlightField">The field</param>
/// <returns>A query parser</returns>
protected QueryParser GetQueryParser(string highlightField)
{
if (!QueryParsers.ContainsKey(highlightField))
{
QueryParsers[highlightField] = new QueryParser(_luceneVersion, highlightField, HighlightAnalyzer);
}
return QueryParsers[highlightField];
}
}
Then you can use... (last param is a CSS class that wraps words that match).
var highlighter = new LuceneHighlighter(luceneIndexSearcher, luceneQuery, "text-warning");
Then on a SearchResult you can do...
var summary = SearchResult result.GetSummary(highlighter);
It's a long time since I wrote this, so can't remember all the details, but hope it helps :)
I was able to strip off special characters at the beginning of each preview text, and also found the API returning reasonable text fragments, after doing some minor changes in my code.
How to improve performance of search results processing ?
Hey all,
I've been using the following code to build preview text for search results. But it's taking too long to process the content of all files in search results .
Is there any better way to get the preview texts with better performance?
Thanks for your help!!
The Umbraco helper object has a Truncate method you can use:
There's quite a few useful overloads. There is also a StripHtml() method on the same helper you can use in conjunction.
If you are really serious about searching then I'm assuming you are using Examine? If so, Lucene.Net Contrib has lots of extensions, including a highlighter that generates summaries, that does exactly what you are asking for. Make sure to grab the 2.9.4.1 package to be compatible with Umbraco.
https://www.nuget.org/packages/Lucene.Net.Contrib/2.9.4.1
Yes, I'm using Examine for searching.
I've been looking online for examples to use "Lucene.Net.Highlight.Highlighter.GetBestFragments" with Examine, but all of them are in Java and searching and highlighting directly with Lucene. See: http://makble.com/how-to-do-lucene-search-highlight-example
Could you help with an example or more guidance to get it work with Examine?
Thanks so much!
/manideep
I've got it worked somehow. There is significant improvement in the performance but the fragments returned aren't the best.
Code:
Some examples of fragments returned:
Starts with comma:
Starts with 'and'
I've always found it to return good results. I usually use it conjunction with the search term that has been used, so it gets the relevant match. My basic implementation I've used before is something like this (I've created a class that abstracts it - some of this was found online and modified):
Then you can use... (last param is a CSS class that wraps words that match).
Then on a
SearchResult
you can do...It's a long time since I wrote this, so can't remember all the details, but hope it helps :)
I was able to strip off special characters at the beginning of each preview text, and also found the API returning reasonable text fragments, after doing some minor changes in my code.
Thanks for your help, Dan Diplo!
is working on a reply...