Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • onurar 11 posts 81 karma points
    Nov 25, 2016 @ 13:56
    onurar
    0

    How to publish but defer rebuilding of xml cache

    Hello,

    Short version: Can I programmatically and temporarily disable XML cache rebuilding, to be opened later?

    Long version: In my Umbraco application, user can upload a CSV to the system, which triggers a code block to update about 15k nodes under a parent node, one-by-one, i.e. each node is updated programmatically and then published using "SaveAndPublishWithStatus()" method.

    Database update takes several hours, but when I check the logs, I see that there are tons of

    2016-11-19 21:49:28,684 [P3548/D3/T64] INFO umbraco.content - Save Xml to file... 2016-11-19 21:50:40,207 [P3548/D3/T64] INFO umbraco.content - Saved Xml to file.

    entries which happen even after db operation was completed long before. I presume that system tries to rebuild the XML cache after each "save and publish".

    What I want to do is to turn off xml cache rebuild by the start of the operation and then turn it back on when the operation is finished or abruptly terminated.

    Is this even possible?

    Bonus: do you have any suggestion about how to handle such bulk updates?

    Many thanks!

  • Damiaan 442 posts 1301 karma points MVP 6x c-trib
    Nov 25, 2016 @ 15:23
    Damiaan
    1

    I would rather save and not publish, then "save an publish" the parent node and include the children.

    Kind regards Damiaan

  • onurar 11 posts 81 karma points
    Nov 27, 2016 @ 07:25
    onurar
    0

    Hello Damiaan. Thanks for your reply.

    Saving takes significant time, too. Have you ever tried such an approach with bulk updates and measured operation times?

    As far as I know, publishing entire node including children also publishes each children one-by-one, right? But it might serve to my purpose of building xml cache only once. I'll give it a try, thanks!

  • Steve Morgan 1346 posts 4453 karma points c-trib
    Nov 25, 2016 @ 15:50
    Steve Morgan
    2

    Hi,

    I can't answer your disable cache rebuild question (not programmatically but I'm sure you're aware it can be disabled in the config) - but have you tried only publishing / updating nodes that actually need updating? Are all 15k actually changing on each import? Publishing data is expensive in Umbraco - you're also going to be creating yourself a massive version history which is going to grow every time the job runs (hint see ContentServices.DeleteVersions but this is likely to be expensive too!).

    If only a few nodes have changed then getting them from the cache first and checking the values before doing the expensive ContentService get, update and SaveAndPublish would massively help.

    The only other thing I could think of doing was to try only saving and then publishing the parent (including all children). In my quick test this took just as long - test code below if you're interested.

    One thing I did notice was disabling events seemed to have a marginal speed increase.

    var watch = System.Diagnostics.Stopwatch.StartNew();
    var cs = ApplicationContext.Current.Services.ContentService;
    
    var productsNode = cs.GetById(1116);
    
    foreach(var curNode in cs.GetChildren(1116))
    {
        curNode.SetValue("pageTitle", DateTime.Now.ToShortTimeString());
        cs.SaveAndPublishWithStatus(curNode, raiseEvents: true);
    }
    
    watch.Stop();
    var elapsedMs = watch.ElapsedMilliseconds;
    

    Versus publish with children...

    @{
    
        var watch = System.Diagnostics.Stopwatch.StartNew();
        var cs = ApplicationContext.Current.Services.ContentService;
    
        var productsNode = cs.GetById(1116);
    
        foreach (var curNode in cs.GetChildren(1116))
        {
            curNode.SetValue("pageTitle", DateTime.Now.ToShortTimeString());
            cs.Save(curNode);
        }
    
        // do one big save and publish of the parent including children at the end
        cs.PublishWithChildrenWithStatus(productsNode);
    
        watch.Stop();
        var elapsedMs = watch.ElapsedMilliseconds;
    
    }
    

    Anyone else know of a better way of doing this?

    Steve

  • onurar 11 posts 81 karma points
    Nov 27, 2016 @ 07:32
    onurar
    0

    Hi Steve, thanks for your effort for the reply. Much appreciated.

    So you measured the required time for saving and publishing each node versus saving each node and publishing the parent node and found out that it takes about same time, right? Please correct me if misunderstood.

    You mentioned disabling events -- can you guide me or give me a link about how to do it? Can the events be temporarily disabled, to be enabled later?

    But among others, I'm mostly interested in getting the content from cache. How do you get the content from cache without invoking ContentService, i.e. database?

  • Dan Diplo 1554 posts 6205 karma points MVP 5x c-trib
    Nov 27, 2016 @ 12:05
    Dan Diplo
    1

    I can't answer your disable cache rebuild question (not programmatically but I'm sure you're aware it can be disabled in the config) - but have you tried only publishing / updating nodes that actually need updating? Are all 15k actually changing on each import?

    I had a similar issue, with a nightly import of 6,000 products where only a few actually changed. What I did was make an MD5 hash of all the product fields and store that in a property on the product, and then in the import I make a hash of the incoming product fields and compare it to the product to be updated - if the hash is the same it doesn't need to be updated.

    I would agree generally, though, that the ContentService is slow and that it does seem everytime you publish an item the XML cache is updated - would be good to defer it.

  • onurar 11 posts 81 karma points
    Nov 27, 2016 @ 17:59
    onurar
    0

    Hello Dan Diplo,

    But how did you read the MD5 field of your product object without using ContentService?

  • Richard Hamilton 79 posts 169 karma points
    Jan 23, 2017 @ 09:45
    Richard Hamilton
    0

    :)

  • Dan Diplo 1554 posts 6205 karma points MVP 5x c-trib
    Nov 27, 2016 @ 19:48
    Dan Diplo
    1

    You can just read it as IPublishedContent, like you would in any front-end query (assuming it's published, of course). So, for instance, if all your products were under a page called Products you could go to that page and get all the descendants/ children - https://our.umbraco.org/documentation/Reference/Querying/IPublishedContent/Collections

  • Steve Morgan 1346 posts 4453 karma points c-trib
    Nov 28, 2016 @ 08:57
    Steve Morgan
    1

    Hi,

    The raise events flag was included in my code but I think this is a red herring for you as it won't fix your problem. Instead focus on what Dan's saying.

    As I hinted at and Dan's done a much better job of explaining you need to get each node via the IPublishedContent cache (e.g. Umbraco.TypedContent(1234) - check the contents haven't changed and only use the ContentService if the content node needs updating. The slightly confusing thing is you'll be checking the data via IPublishingContent and then having to "get" the content node again (probably by ID) via the ContentService service. This means a bit more code which will "feel" slower but it will execute in seconds as the database is only hit for nodes that require updating rather than 15k - and of course you won't have the expensive cache updates for every node.

    Hope I've explained myself better!

    Steve

  • onurar 11 posts 81 karma points
    Nov 30, 2016 @ 11:27
    onurar
    0

    Thanks to all answers, this is the solution that I implemented:

    1) Implement a checksum mechanism using MD5 hashing, or check each property value against changes. If there are only a few property values, checksum generation might not be necessary. In my case, there are five fields but two of them are used for primary keys, so I only need to compare three fields, which is not a big deal.

    2) Traverse nodes on the cache, not via ContentService, i.e. use UmbracoHelper.TypedContent(id) instead of ContentService.GetChildren(id).

    (I don't use "frontend queries", I prefer to do data related tasks on controller side. Reason is that I am used to MVC architecture. If my approach is wrong in this case, please warn me.)

    3) Only change the node if the checksum of the node and the version in the external data do not match. (or if the property values and new data fields do not match)

    4) Get and save only the changed nodes via ContentService. Don't publish.

    5) Delete all nodes that are not existent in the new external dataset.

    6) Create nodes for the data that are existent in the external dataset but not existent in current database. Save these nodes, do not publish.

    7) If there was at least one changed/added node, publish parent with children.

    This ensures that:

    • XML cache will only be reconstructed after all process is finished.
    • Database will only hit for deletion, creation and updates and only when it is necessary.

    The only drawback I see in my tests is that "publish with children" takes a lot of time. Sometimes I get a "task cancelled" error (that I can only observe when I check the logs). But that might be related with my environment, too.

    I am currently testing this approach with the >15k dataset. I will update this post when I have my results.

    Thanks to everyone who contributed.

  • Steve Morgan 1346 posts 4453 karma points c-trib
    Nov 30, 2016 @ 11:43
    Steve Morgan
    0

    Hi,

    That sounds right to me - I would try with just a simple publish as the cache only dirties what it needs to I believe and you might find it's more stable rather than the publish all children method.

    Nice to hear you've hit on a solution!

    Kind regards

    Steve

  • onurar 11 posts 81 karma points
    Nov 30, 2016 @ 12:21
    onurar
    0

    Hi Steve,

    Would a simple publish be enough for the reflection of the saved (not published) child nodes?

    For example:

    Parent Node

    |

    |---child node (updated, saved)

    |---child node (new, saved)

    If I only publish the parent node, would the changes in the child nodes be reflected to frontend, too? I don't think so but correct me if I'm wrong.

  • Steve Morgan 1346 posts 4453 karma points c-trib
    Nov 30, 2016 @ 13:50
    Steve Morgan
    0

    Hi,

    No - it will only publish that node. What I meant is if you're saving the child node as you go I don't think it's that big a job and you may as well just call SaveAndPublish rather than the just as expensive SaveAndPublish with all the children at the end.

    That was what my simple test showed but with 15k nodes you're in the best place to advise on what's quicker! :)

    Steve

  • onurar 11 posts 81 karma points
    Dec 15, 2016 @ 06:49
    onurar
    0

    Hello,

    It seems like saving changed nodes and then publishing the parent node with children is the best option to go. I didn't have the opportunity to measure times, though.

    Thank you for yor help Steve and everyone in this thread. There is no single post which is the solely solution, so I don't know which one to mark it as solution...

  • Richard Hamilton 79 posts 169 karma points
    Jan 23, 2017 @ 09:48
    Richard Hamilton
    0

    I don't see how this is any quicker - in fact i may take longer. You could also be publishing chld nodes that are unpubished by the user previously and not on the import?

  • onurar 11 posts 81 karma points
    Mar 09, 2017 @ 13:11
    onurar
    0

    Hi Richard,

    Yes this might be the case but in my case that node isn't being tempered with content managers.

Please Sign in or register to post replies

Write your reply to:

Draft