Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Darren Ferguson 1022 posts 3259 karma points MVP c-trib
    Jan 06, 2015 @ 19:48
    Darren Ferguson
    0

    Examine and memory usage on index build for large amounts of content

    We are having issues with Out of Memory Exceptions whilst rebuilding an Examine index with large amounts of content.

    There are 145,721 documents in the content tree. We are using Umbraco 6.2.4

    We've seen our IIS process grow to 5gb and finally the developer PC runs out of memory. SQL server also grows to upwards of 5GB.

    We stripped the Examine config back so that only the Internal Indexer was in use.

    We've worked through the code: UmbracoExamine.BaseUmbracoIndexer.PerformIndexRebuild seems to enumerate all DocTypes and call: IndexAll on each type.

    This ultimately ends up in PerformIndexAll();

    We end up with an Xpath Query that gets passed to:

    [Obsolete("This should no longer be used, latest content will be indexed by using the IContentService directly")]
    public XDocument GetLatestContentByXPath(string xpath)
            {
                var xmlContent = XDocument.Parse("<content></content>");
                foreach (var c in _applicationContext.Services.ContentService.GetRootContent())
                {
                    xmlContent.Root.Add(c.ToDeepXml(_applicationContext.Services.PackagingService));
                }
                var result = ((IEnumerable)xmlContent.XPathEvaluate(xpath)).Cast<XElement>();
                return result.ToXDocument();
            }

    If I read it right, we loop over all root nodes, serialise the entire content tree from the database and filter the resulting XML? We end up in Umbraco.Core.Services.EntityXmlSerializer which appears to recurse downwards.

    I'm writing this because I don't think we'll ever be able to rebuild our indexes unless we publish node by node via the UI (unless we add several GB more of memory).

    As I need a solution, Is the following sound theory?:

    • Write an alternative indexer.
    • For the internal indexer - which always needs everything simply get everything from the cmsDocument table
    • Pass it to Umbraco.Core.Services.Export - using the (non deep variation).
    • In a non recusrive fashion pass each of these to Examine.LuceneEngine.Providers.AddNodesToIndex
    Moving forward, is this an issue that those in the core are aware of? Where one has a large amount of content and several indexes would it be wise to have a variation of Examine that enumerates *all content* and checks whether it belongs in an index rather than the current method of each index running queries for itself.
    Lastly, would some refactoring be wise so that index rebuilds for published content are performed against the content cache and for indexes that support unpublished content there is a secondary step whereas all unpublished content is processed after the cached content in the fashion mentioned above?
    Thanks for any insight.

     

     

  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Jan 06, 2015 @ 23:12
    Shannon Deminick
    0

    Hi Darren,

    The way that Examine used to work a very long time ago was only with published content and therefore XPath made sense with regards to the XML cache. Eventually indexing needed to be done on non-published content and therefore a nasty hack needed to be created. This has unfortunately been around for quite some time.

    There's been lots of perf enhancements in the v7.2 codebase regarding this, in fact the code snippet you've referenced above with the Obsolete tag is from v7.2, not from v6. The fixes have been made in revision: https://github.com/umbraco/Umbraco-CMS/commit/4c0f95a93a6a8911fa892aba2c8773e4f2c23ed9

    The XPath thing is no longer used to reindex all of a certain type, we used paged queries to perform the bulk indexing operation.

    So my first piece of advise would be to upgrade to 7.2.1.

    If you can't upgrade then:

    • Would love to know why ;)
    • You will need to create a custom indexer that inherits from UmbracoContentIndexer and override PerformIndexAll, very similarly to that revision referenced above. However, v6 doesn't have these optimized queries for getting paged content/media/member data, so you would have to either write this logic - probably best way would be to attempt to backport the entire content/media services/repositories to v6 (not sure how difficult that would be), or write your own raw Database access with queries to get the data that you want to put into the indexes.
  • Darren Ferguson 1022 posts 3259 karma points MVP c-trib
    Jan 07, 2015 @ 10:13
    Darren Ferguson
    0

    Thanks Shannon.

    It all makes sense.

    I'd love to embark on an embittered rant about why we can't upgrade to 7.2 - but the short answer is time and money.

    Since writing the original post we've audited the huge amount of content - and suspect that a large amount of it is unused. I'm going to speak to the customer and see if we can reduce it (to about 7,000 nodes).

     

    If not then we'll look at back porting some of the 7.2 enhancements.

    Thanks for your help!

    D

  • Darren Ferguson 1022 posts 3259 karma points MVP c-trib
    Jan 11, 2015 @ 11:24
    Darren Ferguson
    0

    Hi Shannon,

    To get things done quickly, we went with the following: https://gist.github.com/darrenferguson/7559866bb8a2800e8f36

    For us - the balance between getting the internal index built reasonably quickly and memory usage was about right.

     

    Thanks for the pointers.

    D.

  • Shannon Deminick 1526 posts 5272 karma points MVP 3x
    Jan 12, 2015 @ 04:54
    Shannon Deminick
    0

    Hi Darren,

    I would recommend trying to use Database.Paged

Please Sign in or register to post replies

Write your reply to:

Draft