ways to improve lucene search engine results

Tom 161 posts 322 karma points

Jan 02, 2018 @ 11:41

Hello: I am using Umbraco 7.5.9 and I have 3 indexers defined and working fine, except when my user search the Umbraco site, it is taking 18 seconds to returns through the three indexers before returning the results. So I retooled the code and utilized C# parallel threading and now my search is done in half the time 9 seconds (still too slow).

For my site, when my users search on "housing" there are a total of 303 items. 131 items belong to the Content indexer. 250 items belong to the pdf indexer, 2 items below to the Inbox indexer (a custom backoffice App PlugIn we wrote.

Question: I have broken my code down and found the bottleneck is in the PDF indexer.

Does anyone have any suggestions on how to improve PDF indexer>

Thank You

Tom

PS Here is the code. Note: I am using ConcurrentBag and Parallel processing.

        var indexesToSearch = new List<Tuple<SearchIndexType, string>>
        {
            new Tuple<SearchIndexType, string>(SearchIndexType.Content, @"MembersOnlyIndexSet"),
            new Tuple<SearchIndexType, string>(SearchIndexType.Media, @"MembersOnlyPDFIndexSet"),
            new Tuple<SearchIndexType, string>(SearchIndexType.InboxMessage, "MembersOnlyInboxMessageIndexSet")
        };


        var results = new List<SearchResultVM>();

        if (string.IsNullOrWhiteSpace(searchTerm))
            return results;

        var analyzer = new StandardAnalyzer(Version.LUCENE_29);
        var parser = new MultiFieldQueryParser(Version.LUCENE_29,
            QueryFields,
            new StandardAnalyzer(Version.LUCENE_29));
        var query = parser.Parse(searchTerm);

        // Build the highlighter
        var formatter = new SimpleHTMLFormatter("<span class=\"lucene-highlight\">", "</span>");
        var scorer = new QueryScorer(query);
        var highlighter = new Highlighter(formatter, scorer);
        highlighter.SetTextFragmenter(new SimpleFragmenter(FragementLength));

        var sets = IndexSets.Instance.Sets;
        UmbracoContext context = UmbracoContext.Current;

        ConcurrentBag<SearchResultVM> bag = new ConcurrentBag<SearchResultVM>();

        Parallel.ForEach(indexSets, (index) =>
        {
            var set = sets[index.Item2];
            var dirInfo = new DirectoryInfo(Path.Combine(set.IndexDirectory.FullName, @"Index"));

            using (var indexDir = FSDirectory.Open(dirInfo))
            {
                using (var indexSearcher = new IndexSearcher(indexDir, true))
                {
                    var collect = TopScoreDocCollector.create(3000, true);
                    indexSearcher.Search(query, collect);

                    var docs = collect.TopDocs();
                    for (int i = 0; i < collect.GetTotalHits(); i++)
                    {
                        var rec = docs.ScoreDocs[i];
                        var doc = indexSearcher.Doc(rec.doc);

                        SearchResultVM item;
                        switch (index.Item1)
                        {
                            case SearchIndexType.Content:
                                item = BuildSearchContentItem(rec, analyzer, highlighter, doc);
                                break;
                            case SearchIndexType.Media:
                                item = BuildSearchMediaItem(rec, analyzer, highlighter, doc, context);
                                break;
                            case SearchIndexType.InboxMessage:
                                item = BuildSearchInboxMessageItem(rec, analyzer, highlighter, doc);
                                break;
                            default:
                                Log.DebugFormat("Unrecognized search index type {0}", index.Item1);
                                item = null;
                                break;
                        }

                        if (item != null)
                        {
                            bag.Add(item);
                        }
                    }
                }
            }
        });

        results = bag.ToList();
        return results.OrderByDescending(o => o.Score).ToList();
    }

Copy Link

Tom 161 posts 322 karma points

Jan 02, 2018 @ 12:24

Here's the routine to build the PDFs

   private static SearchResultVM BuildSearchMediaItem(ScoreDoc rec, StandardAnalyzer analyzer, Highlighter highlighter, Document doc, UmbracoContext context)
    {
        var id = doc.GetField("__NodeId");
        int nodeId;
        int.TryParse(id.StringValue(), out nodeId);

        var mediaItem = context.MediaCache.GetById(nodeId);

        var item = new SearchResultVM
        {
            Score = rec.score,
            SearchIndexType = SearchIndexType.Media,
            Id = nodeId,
            Name = mediaItem.Name,
            Url = BuildMediaItemUrl(mediaItem, context),
            LastUpdated = mediaItem.UpdateDate.ToShortDateString(),
            HighlightedFragment = HighlightContent(analyzer, highlighter, doc, @"FileTextContent")
        };

        return item;
    }

Copy Link

John Bergman 483 posts 1132 karma points

Jan 02, 2018 @ 17:51

What's in BuildMediaItemUrl()? The only other place I see that could be taking your time (in my light experience with media in searches) would be the HighlightContent() method. Looks like you need to break it down a little further to locate the bottle neck

Copy Link

Tom 161 posts 322 karma points

Jan 03, 2018 @ 12:08

John:

Thanks do much for replying.

The BuildMediaItemUrl() simple returns a friendly URL for each item.

And when I removed Highlighting from the Media searcch index, it improves performance by just a little.

Do you have any other suggestions. Note: I found Cogworks.ExamineFileIndexer online. Do you have any experience with this. Is it faster than Umbraco's 7.0 Examine v0.1.89?

    private static string BuildMediaItemUrl(IPublishedContent mediaItem, UmbracoContext context)
    {
        var ctypeSvc = ApplicationContext.Current.Services.ContentTypeService;
        var contentSvc = ApplicationContext.Current.Services.ContentService;
        var urlHelper = new UmbracoHelper(context);

        if (string.Equals(mediaItem.DocumentTypeAlias, @"membersOnlyPDF", StringComparison.InvariantCultureIgnoreCase)
            || string.Equals(mediaItem.DocumentTypeAlias, @"membersOnlyFile", StringComparison.InvariantCultureIgnoreCase))
        {
            var cTypeMOHomePage = ctypeSvc.GetContentType("membersOnlyHomepage");
            var moHomePage = contentSvc.GetContentOfContentType(cTypeMOHomePage.Id).FirstOrDefault();
            if (moHomePage != null)
            {
                return $"{urlHelper.NiceUrlWithDomain(moHomePage.Id).TrimEnd('/')}{mediaItem.Url()}";
            }
        }

        if (string.Equals(mediaItem.DocumentTypeAlias, @"File", StringComparison.InvariantCultureIgnoreCase))
        {
            var cTypePWSHomePage = ctypeSvc.GetContentType("Homepage");
            var homePage = contentSvc.GetContentOfContentType(cTypePWSHomePage.Id).FirstOrDefault();
            if (homePage != null)
            {
                return $"{urlHelper.NiceUrlWithDomain(homePage.Id).TrimEnd('/')}{mediaItem.Url()}";
            }
        }
        return string.Empty;
    }

Copy Link

Niels Hartvig 1951 posts 2391 karma points c-trib

Jan 04, 2018 @ 09:04

I believe a bottleneck could be due using the ContentService which is the full CRUD api and thus not optimised for read alone (and not cached).

There's more in-depth information in the Common Pitfals part of the documentation which can be a really good and enlightning read. Here's the section around using the Services in views: https://our.umbraco.org/documentation/Reference/Common-Pitfalls/#using-the-services-layer-in-your-views

Hope this helps!

Best,

Niels...

Copy Link

Tom 161 posts 322 karma points

Jan 03, 2018 @ 13:44

John:

I figured it out. If I bypass the BuildMediaItemUrl and just use the out of box, I am back to sub-second response time for search.

var mediaItem = context.MediaCache.GetById(nodeId); Url = mediaItem.Url,

Thanks for your help and tips.

Copy Link

John Bergman 483 posts 1132 karma points

Jan 04, 2018 @ 01:57

Interesting, the method must be creating a whole bunch of temporary objects, that's the only thing I can thing of that would have a performance hit like you are talking about.

Copy Link

Ismail Mayat 4511 posts 10092 karma points MVP 2x admin c-trib

Jan 04, 2018 @ 10:49

Tom,

Was there a reason for using lucene.net direct and not examine? Was it so you could get highlighter working? Also in lucene and examine you have multi index searcher which allows you to search over more than one index.

Regards

Ismail

Copy Link

is working on a reply...

Flag this post as spam?

Ways to Improve Lucene Search Engine results