Hi,
I have a solution where im importing a lot of products from various webshops into my solution every night. I chose Umbraco to build this, since i feel kinda comfortable using it. Now im not so sure that this was the right choice any more. I find myself making up weird structures for handling basic tasks.
So... importing like 5000 products every night. Im using the ContentService for doing this. When inserting a product (it becomes a node) i first checks to see if the node exists. I found that this was not possible using the ContentService (searching?) so i needed to build a work-around, saving my product in the database with a reference to the node i then create (this seems like an unnecessary work-around for something as simple as a search). So i need to maintain a separate registry in my database with the product SKU and the node id. Then i can check against this table as a work-around to searching. Fair enough...
Inserting 5000 products as nodes takes around 20 minutes. I can live with that - but still 20 mins for only 5000 nodes, seem a long time.
Now comes the worst part...cleaning my node tree. Sometimes i need to reset everything and start over. This means that i delete the entire tree (i have a "Shop" root node with ID 1079, and i delete all descendants below this tree. This operation takes 2-4 hours.
This is what is killing it for me - when I see the SQL being emitted to the database, i see every delete executes 5-6 statements, along with updates, and inserts. Fair enough. But 2-4 hours to delete 5000 products, is a bit too much - and something that would ruin us if we were running on Azure, and almost certain crash any middle-class webhotel (CPU and memory are being pushed here).
We are at a tipping point now, where we are considering building everything from scratch using plain ASP.NET MVC, that will probably reduce the insert to 1-2 minutes and the delete to 30 seconds.
This is the code for deleting
public string ClearData(int id)
{
DateTime start = DateTime.Now;
var contentService = ApplicationContext.Services.ContentService;
contentService.EmptyRecycleBin();
IEnumerable<IContent> descendants = contentService.GetDescendants(id);
foreach (var child in descendants)
{
contentService.Delete(child);
}
DateTime end = DateTime.Now;
TimeSpan duration = end - start;
return string.Format("<h1>Deleting</h1><b>Start:</b> {0}<br/><b>End:</b> {1}<br/><b>Duration:</b> {2}<br/><br>Done...</b>", start, end, duration);
}
Are there any recommendations as to what I could improve? Really want to make this work in Umbraco, because it is easy to manage it afterwards.
Maybe im just trying to squeeze a square peg into a triangular hole - maybe i'm trying to twist Umbraco into something it's not designed for?
Also i found that emptying the trash before doing this increases the speed of the deletion.
Please note that this is not a rant - I love working with Umbraco - this solution, however might just be plain wrong?
P.S. The reason all the products are nodes, is that we then use the fast caching, and makes it easy to edit titles and descriptions for all the products. The alternative would be to build a service to serve this up in the back office, and then build our own editors. But then we are close to building a system from scratch.
In my case, I was calling MoveToRecycleBin, but same idea. One solution would be for the Umbraco core to be modified to support a bulk version of these operations. Maybe the core team would be open to a pull request?
Basically, everything is stored in a database table, and UI-O-Matic gives you a CRUD UI to view/search/edit the data.
You could also consider bypassing the content service and perform SQL statements directly. Pretty sure a member of the Umbraco core team is going to jump in right here and recommend strongly against that, but then they are also not going to have a good way of interacting with the content service quickly. One downside (there are more) of this approach is that you will then have to rebuild the XML cache and Examine indexes after you are done.
Yes, working with a large amount (>100) nodes is very slow. Bulk operations would be great. Maybe a flag, for BeginUpdate(); and then when im done EndUpdate() on the tree, and then it would bulk create the cache. When doing this on my local database, my CPU is almost maxed out and my memory consumption is also through the roof (for these "trivial" operations).
Right now, I really dont see any other alternative than not storing large amount of data as nodes, but rather as custom objects. And then having an editor for them... Then i wont be able to benefit from the cache (?)
I've created quite a few Umbraco sites that pull their content from other sources and then create nodes in Umbraco (including one that has 6000 products every night and others that have hundreds of products every few hours). Whilst the ContentService can be slow for bulk operations there's a few things I'm sure you can do:
"When inserting a product (it becomes a node) i first checks to see if
the node exists. I found that this was not possible using the
ContentService (searching?) so i needed to build a work-around..."
Presumably the item you are searching for is already published, right? So you can use the PublishedContent service to search. So use UmbracoHelper to get an instance of your root page under which products are created and then search descendants using LINQ. Very fast. ie.
// Just make one instance of these and re-use
UmbracoHelper umbHelper = new UmbracoHelper(UmbracoContext);
var productRoot = umbHelper.TypedContent(1234); // 1234 is Id of your root page
var productToFind = productRoot.Descendants().FirstOrDefault(p => p.HasProperty("sku") && p.GetPropertyValue<string>("sku") == "yourCodeToCheck");
Second, presumably all the products don't actually change every night, do they? I'm guessing only a few actually change. So rather than deleting everything and then importing everything from scratch be a little smarter in your code. You can use a query to get what products are NOT in the current set and delete those - there are probably not many. Then the rest MAY have bee updated, but many will not have changed.
So what I do in this circumstance is create an MD5 hash of all the product properties, store this against the product node, and then when I come to import products I get each product and generate an MD5 of the product to be imported against the hash of the existing product - if they are the same nothing has changed, so you don't need to update. So only update products that have actually changed - which will be much quicker.
Thirdly, if you really want to delete all products then presumably they all have the same doctype, right? So you can use the DeleteContentOfType() method on ContentSerice for this. Just pass in the ID of the doc type (which you can find in the back-office) and it will delete all instances.
Thank you for your answer. Yes it is exactly as you describe. The only time i need to delete is:
When a product was removed from the source, but still in my tree
When i reset the database (rarely) - and this is mostly for
development purpose
Otherwise i do exactly as you describe. I will try to look into searching, like you described. I only update price on my products, everything else i need to be able to keep as custom/changed properties.
Deleting using the ContentService takes forever
Hi, I have a solution where im importing a lot of products from various webshops into my solution every night. I chose Umbraco to build this, since i feel kinda comfortable using it. Now im not so sure that this was the right choice any more. I find myself making up weird structures for handling basic tasks.
So... importing like 5000 products every night. Im using the ContentService for doing this. When inserting a product (it becomes a node) i first checks to see if the node exists. I found that this was not possible using the ContentService (searching?) so i needed to build a work-around, saving my product in the database with a reference to the node i then create (this seems like an unnecessary work-around for something as simple as a search). So i need to maintain a separate registry in my database with the product SKU and the node id. Then i can check against this table as a work-around to searching. Fair enough...
Inserting 5000 products as nodes takes around 20 minutes. I can live with that - but still 20 mins for only 5000 nodes, seem a long time.
Now comes the worst part...cleaning my node tree. Sometimes i need to reset everything and start over. This means that i delete the entire tree (i have a "Shop" root node with ID 1079, and i delete all descendants below this tree. This operation takes 2-4 hours.
This is what is killing it for me - when I see the SQL being emitted to the database, i see every delete executes 5-6 statements, along with updates, and inserts. Fair enough. But 2-4 hours to delete 5000 products, is a bit too much - and something that would ruin us if we were running on Azure, and almost certain crash any middle-class webhotel (CPU and memory are being pushed here).
We are at a tipping point now, where we are considering building everything from scratch using plain ASP.NET MVC, that will probably reduce the insert to 1-2 minutes and the delete to 30 seconds.
This is the code for deleting
Are there any recommendations as to what I could improve? Really want to make this work in Umbraco, because it is easy to manage it afterwards.
Maybe im just trying to squeeze a square peg into a triangular hole - maybe i'm trying to twist Umbraco into something it's not designed for?
Also i found that emptying the trash before doing this increases the speed of the deletion.
Please note that this is not a rant - I love working with Umbraco - this solution, however might just be plain wrong?
P.S. The reason all the products are nodes, is that we then use the fast caching, and makes it easy to edit titles and descriptions for all the products. The alternative would be to build a service to serve this up in the back office, and then build our own editors. But then we are close to building a system from scratch.
I have seen similar: http://issues.umbraco.org/issue/U4-6042
In my case, I was calling
MoveToRecycleBin
, but same idea. One solution would be for the Umbraco core to be modified to support a bulk version of these operations. Maybe the core team would be open to a pull request?You might consider using UI-O-Matic as an alternative: https://our.umbraco.org/projects/developer-tools/ui-o-matic/
Basically, everything is stored in a database table, and UI-O-Matic gives you a CRUD UI to view/search/edit the data.
You could also consider bypassing the content service and perform SQL statements directly. Pretty sure a member of the Umbraco core team is going to jump in right here and recommend strongly against that, but then they are also not going to have a good way of interacting with the content service quickly. One downside (there are more) of this approach is that you will then have to rebuild the XML cache and Examine indexes after you are done.
Yes, working with a large amount (>100) nodes is very slow. Bulk operations would be great. Maybe a flag, for BeginUpdate(); and then when im done EndUpdate() on the tree, and then it would bulk create the cache. When doing this on my local database, my CPU is almost maxed out and my memory consumption is also through the roof (for these "trivial" operations).
Right now, I really dont see any other alternative than not storing large amount of data as nodes, but rather as custom objects. And then having an editor for them... Then i wont be able to benefit from the cache (?)
You can always use your own cache. For example, I tend to use this (for some scenarios, Umbraco's cache is slow): https://github.com/rhythmagency/rhythm.caching.core
UI-O-Matic has events you can hook into to update your cache.
Brian,
I've created quite a few Umbraco sites that pull their content from other sources and then create nodes in Umbraco (including one that has 6000 products every night and others that have hundreds of products every few hours). Whilst the ContentService can be slow for bulk operations there's a few things I'm sure you can do:
Presumably the item you are searching for is already published, right? So you can use the PublishedContent service to search. So use UmbracoHelper to get an instance of your root page under which products are created and then search descendants using LINQ. Very fast. ie.
Second, presumably all the products don't actually change every night, do they? I'm guessing only a few actually change. So rather than deleting everything and then importing everything from scratch be a little smarter in your code. You can use a query to get what products are NOT in the current set and delete those - there are probably not many. Then the rest MAY have bee updated, but many will not have changed.
So what I do in this circumstance is create an MD5 hash of all the product properties, store this against the product node, and then when I come to import products I get each product and generate an MD5 of the product to be imported against the hash of the existing product - if they are the same nothing has changed, so you don't need to update. So only update products that have actually changed - which will be much quicker.
Thirdly, if you really want to delete all products then presumably they all have the same doctype, right? So you can use the
DeleteContentOfType()
method on ContentSerice for this. Just pass in the ID of the doc type (which you can find in the back-office) and it will delete all instances.Thank you for your answer. Yes it is exactly as you describe. The only time i need to delete is:
Otherwise i do exactly as you describe. I will try to look into searching, like you described. I only update price on my products, everything else i need to be able to keep as custom/changed properties.
is working on a reply...