Hello:
I am using Umbraco 7.5.9 and I have 3 indexers defined and working fine, except when my user search the Umbraco site, it is taking 18 seconds to returns through the three indexers before returning the results.
So I retooled the code and utilized C# parallel threading and now my search is done in half the time 9 seconds (still too slow).
For my site, when my users search on "housing" there are a total of 303 items.
131 items belong to the Content indexer.
250 items belong to the pdf indexer,
2 items below to the Inbox indexer (a custom backoffice App PlugIn we wrote.
Question: I have broken my code down and found the bottleneck is in the PDF indexer.
Does anyone have any suggestions on how to improve PDF indexer>
Thank You
Tom
PS Here is the code. Note: I am using ConcurrentBag and Parallel processing.
var indexesToSearch = new List<Tuple<SearchIndexType, string>>
{
new Tuple<SearchIndexType, string>(SearchIndexType.Content, @"MembersOnlyIndexSet"),
new Tuple<SearchIndexType, string>(SearchIndexType.Media, @"MembersOnlyPDFIndexSet"),
new Tuple<SearchIndexType, string>(SearchIndexType.InboxMessage, "MembersOnlyInboxMessageIndexSet")
};
var results = new List<SearchResultVM>();
if (string.IsNullOrWhiteSpace(searchTerm))
return results;
var analyzer = new StandardAnalyzer(Version.LUCENE_29);
var parser = new MultiFieldQueryParser(Version.LUCENE_29,
QueryFields,
new StandardAnalyzer(Version.LUCENE_29));
var query = parser.Parse(searchTerm);
// Build the highlighter
var formatter = new SimpleHTMLFormatter("<span class=\"lucene-highlight\">", "</span>");
var scorer = new QueryScorer(query);
var highlighter = new Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(new SimpleFragmenter(FragementLength));
var sets = IndexSets.Instance.Sets;
UmbracoContext context = UmbracoContext.Current;
ConcurrentBag<SearchResultVM> bag = new ConcurrentBag<SearchResultVM>();
Parallel.ForEach(indexSets, (index) =>
{
var set = sets[index.Item2];
var dirInfo = new DirectoryInfo(Path.Combine(set.IndexDirectory.FullName, @"Index"));
using (var indexDir = FSDirectory.Open(dirInfo))
{
using (var indexSearcher = new IndexSearcher(indexDir, true))
{
var collect = TopScoreDocCollector.create(3000, true);
indexSearcher.Search(query, collect);
var docs = collect.TopDocs();
for (int i = 0; i < collect.GetTotalHits(); i++)
{
var rec = docs.ScoreDocs[i];
var doc = indexSearcher.Doc(rec.doc);
SearchResultVM item;
switch (index.Item1)
{
case SearchIndexType.Content:
item = BuildSearchContentItem(rec, analyzer, highlighter, doc);
break;
case SearchIndexType.Media:
item = BuildSearchMediaItem(rec, analyzer, highlighter, doc, context);
break;
case SearchIndexType.InboxMessage:
item = BuildSearchInboxMessageItem(rec, analyzer, highlighter, doc);
break;
default:
Log.DebugFormat("Unrecognized search index type {0}", index.Item1);
item = null;
break;
}
if (item != null)
{
bag.Add(item);
}
}
}
}
});
results = bag.ToList();
return results.OrderByDescending(o => o.Score).ToList();
}
What's in BuildMediaItemUrl()? The only other place I see that could be taking your time (in my light experience with media in searches) would be the HighlightContent() method. Looks like you need to break it down a little further to locate the bottle neck
The BuildMediaItemUrl() simple returns a friendly URL for each item.
And when I removed Highlighting from the Media searcch index, it improves performance by just a little.
Do you have any other suggestions. Note: I found Cogworks.ExamineFileIndexer online. Do you have any experience with this. Is it faster than Umbraco's 7.0 Examine v0.1.89?
private static string BuildMediaItemUrl(IPublishedContent mediaItem, UmbracoContext context)
{
var ctypeSvc = ApplicationContext.Current.Services.ContentTypeService;
var contentSvc = ApplicationContext.Current.Services.ContentService;
var urlHelper = new UmbracoHelper(context);
if (string.Equals(mediaItem.DocumentTypeAlias, @"membersOnlyPDF", StringComparison.InvariantCultureIgnoreCase)
|| string.Equals(mediaItem.DocumentTypeAlias, @"membersOnlyFile", StringComparison.InvariantCultureIgnoreCase))
{
var cTypeMOHomePage = ctypeSvc.GetContentType("membersOnlyHomepage");
var moHomePage = contentSvc.GetContentOfContentType(cTypeMOHomePage.Id).FirstOrDefault();
if (moHomePage != null)
{
return $"{urlHelper.NiceUrlWithDomain(moHomePage.Id).TrimEnd('/')}{mediaItem.Url()}";
}
}
if (string.Equals(mediaItem.DocumentTypeAlias, @"File", StringComparison.InvariantCultureIgnoreCase))
{
var cTypePWSHomePage = ctypeSvc.GetContentType("Homepage");
var homePage = contentSvc.GetContentOfContentType(cTypePWSHomePage.Id).FirstOrDefault();
if (homePage != null)
{
return $"{urlHelper.NiceUrlWithDomain(homePage.Id).TrimEnd('/')}{mediaItem.Url()}";
}
}
return string.Empty;
}
Interesting, the method must be creating a whole bunch of temporary objects, that's the only thing I can thing of that would have a performance hit like you are talking about.
Was there a reason for using lucene.net direct and not examine? Was it so you could get highlighter working? Also in lucene and examine you have multi index searcher which allows you to search over more than one index.
Ways to Improve Lucene Search Engine results
Hello: I am using Umbraco 7.5.9 and I have 3 indexers defined and working fine, except when my user search the Umbraco site, it is taking 18 seconds to returns through the three indexers before returning the results. So I retooled the code and utilized C# parallel threading and now my search is done in half the time 9 seconds (still too slow).
For my site, when my users search on "housing" there are a total of 303 items. 131 items belong to the Content indexer. 250 items belong to the pdf indexer, 2 items below to the Inbox indexer (a custom backoffice App PlugIn we wrote.
Question: I have broken my code down and found the bottleneck is in the PDF indexer.
Does anyone have any suggestions on how to improve PDF indexer>
Thank You
Tom
PS Here is the code. Note: I am using ConcurrentBag and Parallel processing.
Here's the routine to build the PDFs
What's in BuildMediaItemUrl()? The only other place I see that could be taking your time (in my light experience with media in searches) would be the HighlightContent() method. Looks like you need to break it down a little further to locate the bottle neck
John:
Thanks do much for replying.
The BuildMediaItemUrl() simple returns a friendly URL for each item.
And when I removed Highlighting from the Media searcch index, it improves performance by just a little.
Do you have any other suggestions. Note: I found Cogworks.ExamineFileIndexer online. Do you have any experience with this. Is it faster than Umbraco's 7.0 Examine v0.1.89?
I believe a bottleneck could be due using the ContentService which is the full CRUD api and thus not optimised for read alone (and not cached).
There's more in-depth information in the Common Pitfals part of the documentation which can be a really good and enlightning read. Here's the section around using the Services in views: https://our.umbraco.org/documentation/Reference/Common-Pitfalls/#using-the-services-layer-in-your-views
Hope this helps!
Best,
Niels...
John:
I figured it out. If I bypass the BuildMediaItemUrl and just use the out of box, I am back to sub-second response time for search.
var mediaItem = context.MediaCache.GetById(nodeId); Url = mediaItem.Url,
Thanks for your help and tips.
Interesting, the method must be creating a whole bunch of temporary objects, that's the only thing I can thing of that would have a performance hit like you are talking about.
Tom,
Was there a reason for using lucene.net direct and not examine? Was it so you could get highlighter working? Also in lucene and examine you have multi index searcher which allows you to search over more than one index.
Regards
Ismail
is working on a reply...