Performance of bulk Content updates with ContentService
I am working on a custom integration piece that is creating and/or updating thousands of content nodes on a scheduled basis, and was wondering if there are any tricks to make it more performant.
Here is a snippet of my current code:
var uSvc = ApplicationContext.Services.ContentService;
foreach (var model in models) // THOUSANDS OF RECORDS { // Check if exists. If not, create the node var modelContent = parentNode.Children().FirstOrDefault(c => c.Name == model.Name); if (modelContent == null) { modelContent = uSvc.CreateContent(model.Name, parentNode, "BoatModel"); } // Update some fields modelContent.SetValue("fieldAlias", model.Name); // Save uSvc.Save(modelContent); }
I have tried using the Save() override that accepts an IEnumerable, however that times out with large data sets, and seems to be just looping through the enumerable list (doesn't seem any more performant than calling the individual Save() in-place)
I have also tried instantiating a new thread for each iteration, but that results in SQL deadlocks (on the parentNode.Children() line).
@Alex Unfortunately I cannot provide all code, however the code above is exactly what is taking HOURS to run.
Also, I don't think getting IPublishedContent nodes will be of much help, since every iteration will require a save (which as far as I know I will always need to get IContent and use ContentService to save).
IContent is managed by IContentService and is targetted at back-end, read-write access to content. It's what you want to use when you need to manage content.
IPublishedContent is managed by the "content cache" and is targetted at front-end, read-only access to content. It's what you want to user when rendering content.
Also I can't understand why do you iterate over 'models' collection and don't use it at all inside foreach. What is parentNode ?
Yes, I understand that IPublishedContent is faster, since that's the method to pull cached content, but I am saying that I need to update every node I am iterating, and to do that I need an IContent reference anyways. I could surely pull the content faster with IPublishedContent, but I'm not sure how that helps in my scenario.
"models" is a collection/list of objects I am retrieving from an external source. I need to take this data and update (or insert if it doesn't exist) the associated content node (i.e. "modelContent.SetValue("fieldAlias", model.Name);"). Right now this process takes MANY HOURS to complete and I am hoping there is a bulk insert/update alternative that I have overlooked.
1. That is a good point. I will see if I can get away with using IPublishedContent here instead, however some of this content isn't actually published yet - won't I end up with duplicates potentially?
2. According to the source code, it simply loops through each IContent in the collection and performs the same tasks that the simple Save(IContent) method performs, resulting in the same amount of round-trips (and therefore gains no performance). See https://github.com/umbraco/Umbraco-CMS/blob/dev-v7/src/Umbraco.Core/Services/ContentService.cs#L915 . Also, as I mentioned above, I did try that method and when you have more than a few hundred IContent objects in your collection, I was getting strange timeout issues - at least if I do the individual save, I can tell exactly which record timed out and can code in some sort of "Continue where we left off" functionality.
Thank you for your help, but it seems like there isn't much room for improvement from what I already have :(
For CMSImport V3 http://soetemansoftware.nl/cmsimport I had the same issue. What I did was first check if a property is changed and only then I save the item otherwise not.
Yes, I've started to go down that same path and it's working much better. Using IPublishedContent to query and check the properties and only load the IContent if there is a change. Seems to be working much better, but I can see that I might run into duplicates if a node is unpublished, which could be a problem as my script will run weeklyover alot of nodes. I'll just have to see how it goes.
I do not know if this could be a solution, but simply a spontaneous idea:
You can use the ContentService.Unpublished event for writing the unpublished nodeIds in a custom table or a cache file. Conversely, of course, you can use the ContentService.Published event for remove the nodeIds from the custom table or cache file. So you can proceed as described with IPublishedContent. At the end of your weekly script you can read the unpublished nodes based of your custom table or cache file and use ContentService only for the unpublished nodes.
That may be a good idea. A spontaneous addition to that is storing all ID's (either published / unpublished) and store a hash of the combination of properties. That way I can do a quick lookup to see if it's changed.
Not sure if creating the hash for each node will be too slow to be worth it.
@David I didn't run into any memory issues on my end, but feel free to take a look at my forked version of the repo on Github - I was able to improve bulk insert performance about 6-fold.
Thanks Kieth... Sorry if this is a silly question (I'm a noob with github), how do I find your forked code? I'd be really interested in seeing what you've done.
@David I apologize. My forked code was for performance updates to the Merchello plugin - I didn't dare touch the Umbraco base. For some reason I thought this was the Merchello topic :/
Hey Everyone, I know this is an old topic now, but I stumbled across the thread and thought I'd share how my project is going after a year in production. Hope to help others out before going down the same path.
Lesson Learnt #1 - Don't try to store a large set of data as content nodes especially if they are being updated frequently
Lesson Learnt #2 - Frequent updates will blow out the database as the rollback feature will grow dramatically. Matt Brailsford's UnVersion plugin saved me here!
To speed things up, I hook into the saved, deleted & trashed events and then update/delete a custom database table that holds the Content ID, A Unique identifier to match the external imported data, a calculated hash of all the editable fields, and a last modified date.
When the weekly script runs to make the updates, I calculate a hash of the new data and compare it with the hash stored in the database
If Changed, I go ahead and load the content node, save changes (which fires the saved event and updates the hash in the DB)
This doesn't help with speed if you know that you need to update every node, but it does help when you have a smaller subset to update.
My project is still running well, but the import is still slow when there are a lot of updates.
So to sum up:
Don't use content nodes for large amounts of data thats regularly updated. You are best to keep the data in a separate data table and manage displaying the data via custom routes.
Performance of bulk Content updates with ContentService
I am working on a custom integration piece that is creating and/or updating thousands of content nodes on a scheduled basis, and was wondering if there are any tricks to make it more performant.
Here is a snippet of my current code:
I have tried using the Save() override that accepts an IEnumerable, however that times out with large data sets, and seems to be just looping through the enumerable list (doesn't seem any more performant than calling the individual Save() in-place)
I have also tried instantiating a new thread for each iteration, but that results in SQL deadlocks (on the parentNode.Children() line).
Are there any other recommendations?
Hi Keith,
Can you provide all code ?
Try to get nodes via IPublishedContent and save only ContentService.
Thanks
@Alex Unfortunately I cannot provide all code, however the code above is exactly what is taking HOURS to run.
Also, I don't think getting IPublishedContent nodes will be of much help, since every iteration will require a save (which as far as I know I will always need to get IContent and use ContentService to save).
Keith, IPublishedContent is much faster than ICOntent, if you need only read data from node you can use it.
https://our.umbraco.org/forum/developers/api-questions/46631-Getting-Umbraco-Content-IPublishedContent-vs-IContent-vs-Node-vs-Document
IContent is managed by IContentService and is targetted at back-end, read-write access to content. It's what you want to use when you need to manage content.
IPublishedContent is managed by the "content cache" and is targetted at front-end, read-only access to content. It's what you want to user when rendering content.
Also I can't understand why do you iterate over 'models' collection and don't use it at all inside foreach. What is parentNode ?
Hello Alex,
Yes, I understand that IPublishedContent is faster, since that's the method to pull cached content, but I am saying that I need to update every node I am iterating, and to do that I need an IContent reference anyways. I could surely pull the content faster with IPublishedContent, but I'm not sure how that helps in my scenario.
"models" is a collection/list of objects I am retrieving from an external source. I need to take this data and update (or insert if it doesn't exist) the associated content node (i.e. "modelContent.SetValue("fieldAlias", model.Name);"). Right now this process takes MANY HOURS to complete and I am hoping there is a bulk insert/update alternative that I have overlooked.
Dear Keith, it's hard task and I can imagine how much time you spent on it ) But we have few suggestions:
1) First of all don't call parentNode.Children() in the foreach.
2) Try to use .Save(List nodes), collect new nodes in foreach and save lists in one calling Save method
Thanks
Hello Alex,
1. That is a good point. I will see if I can get away with using IPublishedContent here instead, however some of this content isn't actually published yet - won't I end up with duplicates potentially?
2. According to the source code, it simply loops through each IContent in the collection and performs the same tasks that the simple Save(IContent) method performs, resulting in the same amount of round-trips (and therefore gains no performance). See https://github.com/umbraco/Umbraco-CMS/blob/dev-v7/src/Umbraco.Core/Services/ContentService.cs#L915 . Also, as I mentioned above, I did try that method and when you have more than a few hundred IContent objects in your collection, I was getting strange timeout issues - at least if I do the individual save, I can tell exactly which record timed out and can code in some sort of "Continue where we left off" functionality.
Thank you for your help, but it seems like there isn't much room for improvement from what I already have :(
Hi Keith,
Did you work out any improvements in your code?
I'm doing the exact same task as yourself and I'm getting random out of memory issues.
Are you experiancing the same problem or simply just timeout errors?
For CMSImport V3 http://soetemansoftware.nl/cmsimport I had the same issue. What I did was first check if a property is changed and only then I save the item otherwise not.
Maybe you can implement something similar?
Best,
Richard
Thanks Richard,
Yes, I've started to go down that same path and it's working much better. Using IPublishedContent to query and check the properties and only load the IContent if there is a change. Seems to be working much better, but I can see that I might run into duplicates if a node is unpublished, which could be a problem as my script will run weeklyover alot of nodes. I'll just have to see how it goes.
Cheers,
Dave
Hi David,
I do not know if this could be a solution, but simply a spontaneous idea:
You can use the ContentService.Unpublished event for writing the unpublished nodeIds in a custom table or a cache file. Conversely, of course, you can use the ContentService.Published event for remove the nodeIds from the custom table or cache file. So you can proceed as described with IPublishedContent. At the end of your weekly script you can read the unpublished nodes based of your custom table or cache file and use ContentService only for the unpublished nodes.
Best,
Sören
Thanks Sören,
That may be a good idea. A spontaneous addition to that is storing all ID's (either published / unpublished) and store a hash of the combination of properties. That way I can do a quick lookup to see if it's changed.
Not sure if creating the hash for each node will be too slow to be worth it.
If I run into problems I might give it a go.
@David I didn't run into any memory issues on my end, but feel free to take a look at my forked version of the repo on Github - I was able to improve bulk insert performance about 6-fold.
Thanks Kieth... Sorry if this is a silly question (I'm a noob with github), how do I find your forked code? I'd be really interested in seeing what you've done.
@David I apologize. My forked code was for performance updates to the Merchello plugin - I didn't dare touch the Umbraco base. For some reason I thought this was the Merchello topic :/
Hey Everyone, I know this is an old topic now, but I stumbled across the thread and thought I'd share how my project is going after a year in production. Hope to help others out before going down the same path.
This doesn't help with speed if you know that you need to update every node, but it does help when you have a smaller subset to update.
My project is still running well, but the import is still slow when there are a lot of updates.
So to sum up:
Don't use content nodes for large amounts of data thats regularly updated. You are best to keep the data in a separate data table and manage displaying the data via custom routes.
Thanks David for sharing, great lessons.
is working on a reply...