I'm sure as Umbraco has stood the test of time it scales well, but I'm interested to know why/if the node hierachy doesn't suffer from performance penalities for large amounts of content in the tree?
The database schema uses a parentId on a non-clustered index to store the hierarchy. I've seen this method before and it's always suffered from bad performance particularly for deeply nested trees. Is there a trick being used in Umbraco to avoid this issue? Are content trees not typically deep enough/big enough for this to be a problem?
If users have experienced problems with this, I wouldn't mind contributing some alternative ways of storing hierarchy relations as I've had some success with dealing with this in legacy systems in the past.
When you publish content, it gets stored to an XML cache file in "App_Data\umbraco.config" (though, that path is configurable, so may not be the same in some installations).
Typically, that is where content comes from when rendering pages (rather than hitting the database). Also, the content service if a further abstraction that allows for content to be store in-memory, which offers a further speed improvement.
Yep, all content [in the content tree] is stored in memory (that includes content like rich text and booleans). Large binary data (media) is stored on the file system (i.e., it's not loaded into memory, aside from when IIS serves it up to the client).
I have posted about "too much" memory being used myself, actually (see https://our.umbraco.org/forum/core/general/61448-Memory-Climbs-2GB-in-11-Hours-in-Umbraco-718 ). Turns out it's just because IIS intentionally avoids doing garbage collection when it doesn't absolutely have to (from what I read, this is for performance reasons). So it's not really anything to do with Umbraco.
For what it's worth, I remember hearing somebody from the Umbraco core team saying that Umbraco should be able to scale up to around 250,000 content nodes. No idea where they got that from, but it should at least be a good indicator of the order of magnitude of data Umbraco can handle.
Giorgio, It depends on your site, if it will have good caching and good code, it's not problem.
Try to use [OutputCache(Duration = 60)], and it will be faster and cheaper for server.
I have never worked on a site with that may nodes, so I wouldn't be able to guess what problems you might run into. You could always test it if you like (e.g., by using the ContentService to create a bunch of nodes and see what happens).
If you do have that many nodes, it's probably time to start thinking about storing some of that content in a database rather than in Umbraco. For example, at a past job, we stored all of our "product" information in a database (with tens of thousands of products and multiple languages and regions, products accounted for the vast majority of pages on the site).
Something to keep in mind in regards to a site with that many nodes and that large of an umbraco.config file is that the umbraco.config file gets regenerated on each publish operation. Having to write 1.7GB to disk on every publish seems like it'd be a very expensive operation. One way to get around that (that I haven't tried) is to set "ContinouslyUpdateXmlDiskCache" to false in umbracoSettings.config: https://our.umbraco.org/documentation/Using-Umbraco/Config-files/umbracoSettings/
In my case, the database was the "cheaper" option (in terms of time to implement) because all of our data was already in a database (SQL Server). The database was on a different server in the same network. Keep in mind this product database was entirely separate from the Umbraco database. The database didn't significantly slow down page load times.
The tricky bit would be to create an interface in Umbraco to edit that data. We already had a tool to edit product data, so we didn't need to build anything in Umbraco.
With a project of the size you are talking about, however, I'd expect you would have the budget to build such an interface.
Is it necessary to keep umbraco.config when you consider how fast data can be delivered from a database?
Data extracted from databases (e.g. events, press releases, product databases) can be cached using .NET controls, and even caching for a small time period can improve speed.
Our existing CMS solution uses SQL Server and we haven't had any performance issues when getting web content data from the database. (I hope I haven't jinxed everything!)
Umbraco has seemed really slow to me when accessing content directly from the database. However, that may have been on something like Azure where the database was on a different machine. if the database is on the same machine, it might be fast.
Did you try to use Examine? If you will create your own index, you won't need to get data from db or umbraco.config, try to look how https://our.umbraco.org/projects/website-utilities/ezsearch is working.
You can do search and traversing login in the Lucene, and then render from xml cache or database, it will be faster.
I haven't used Examine for the purpose of getting content (I've only used it for general site searches), though that might work for scenarios that require large amounts of data and fast performance (at the cost of needing the complexity of extra code and configuration). Good tip, thanks.
I'm in the process of migrating our old site to Umbraco.
The existing site has about 6,500 pages (the site belongs to a local authority) and after my first import, the umbraco.config file grew to about 29.5MB.
I killed the back-office after the initial import because every page was under the root.
After phase 2, i.e. moving pages back into the tree-structure that came with the old CMS, the back-office is back to normal.
It's early days, but I am already looking at ideas to improve performance.
For instance, after the import, I had to move the site homepage to the root of the site, which in turn seems to have dragged everything else along with it. This one move has taken 40 minutes and is still running, updating the umbraco.config file as it goes (although this will be a once-off move).
I'll look at changing the setting for 'ContinouslyUpdateXmlDiskCache' before my next run.
Yes you have to disable ContinouslyUpdateXmlDiskCache I think.
It's good feature for little sites, not like yours.
What version of Umbraco are you using ?
Could you show more your settings ?
Has you SSD on the server ?
By the way, if you are manipulating a lot of content programmatically, you may also consider disabling the Examine indexing, then reenable and perform a reindex when you're all done. Keep in mind that media may not work while the index is disabled.
If I remember correctly, the umbraco.config is only used for filling the in-memory cache - it is not used for serving up the front-end. Having it does slow up publish times on very large sites. You can disable it by setting ContinouslyUpdateXmlDiskCache to false.
I think the trade off is slower load times after an app pool restart, since Umbraco hits the database instead of the xml to replenish the cache.
Scalability and Performance
I'm sure as Umbraco has stood the test of time it scales well, but I'm interested to know why/if the node hierachy doesn't suffer from performance penalities for large amounts of content in the tree?
The database schema uses a parentId on a non-clustered index to store the hierarchy. I've seen this method before and it's always suffered from bad performance particularly for deeply nested trees. Is there a trick being used in Umbraco to avoid this issue? Are content trees not typically deep enough/big enough for this to be a problem?
If users have experienced problems with this, I wouldn't mind contributing some alternative ways of storing hierarchy relations as I've had some success with dealing with this in legacy systems in the past.
When you publish content, it gets stored to an XML cache file in "App_Data\umbraco.config" (though, that path is configurable, so may not be the same in some installations).
Typically, that is where content comes from when rendering pages (rather than hitting the database). Also, the content service if a further abstraction that allows for content to be store in-memory, which offers a further speed improvement.
Correction, the content service hits the database. It's the UmbracoHelper class that gets data from the XML file.
So if there is a large amount of content it's all stored in memory? I'm assuming media and large binary data isn't stored in memory though?
I saw a couple of posts on the forum regarding too much memory being used, is this likely due to incorrect usage of imbraco?
Yep, all content [in the content tree] is stored in memory (that includes content like rich text and booleans). Large binary data (media) is stored on the file system (i.e., it's not loaded into memory, aside from when IIS serves it up to the client).
I have posted about "too much" memory being used myself, actually (see https://our.umbraco.org/forum/core/general/61448-Memory-Climbs-2GB-in-11-Hours-in-Umbraco-718 ). Turns out it's just because IIS intentionally avoids doing garbage collection when it doesn't absolutely have to (from what I read, this is for performance reasons). So it's not really anything to do with Umbraco.
For what it's worth, I remember hearing somebody from the Umbraco core team saying that Umbraco should be able to scale up to around 250,000 content nodes. No idea where they got that from, but it should at least be a good indicator of the order of magnitude of data Umbraco can handle.
How big will the umbraco.config be with 250,000 nodes?
From estimates i've done to data which might get added to our website, we'll have 120,000 nodes and our umbraco.config might get to 1.7gb.
Will it survive?
Giorgio, It depends on your site, if it will have good caching and good code, it's not problem. Try to use [OutputCache(Duration = 60)], and it will be faster and cheaper for server.
http://stefantsov.com/2014/march/umbraco-7-mvc-performance#.VSqVZvmsXs4
Also great helper is @Html.CachedPartial(), we are using it often, it can be dynamic and very flexible.
Thanks
I have never worked on a site with that may nodes, so I wouldn't be able to guess what problems you might run into. You could always test it if you like (e.g., by using the ContentService to create a bunch of nodes and see what happens).
If you do have that many nodes, it's probably time to start thinking about storing some of that content in a database rather than in Umbraco. For example, at a past job, we stored all of our "product" information in a database (with tens of thousands of products and multiple languages and regions, products accounted for the vast majority of pages on the site).
Something to keep in mind in regards to a site with that many nodes and that large of an umbraco.config file is that the umbraco.config file gets regenerated on each publish operation. Having to write 1.7GB to disk on every publish seems like it'd be a very expensive operation. One way to get around that (that I haven't tried) is to set "ContinouslyUpdateXmlDiskCache" to false in umbracoSettings.config: https://our.umbraco.org/documentation/Using-Umbraco/Config-files/umbracoSettings/
Alex, we already have all those caching mechanism in work but I'm afraid of parsing the umbraco.config eventually..
Nicholas, one of the dilemmas we're facing is the tradeoff between parsing a large umbraco.config and creating connections to the database.
From your experience, creating the connections was an expensive and time consuming task or a relatively swift one?
If i may ask, what kind of database were you using? and was it a local one or located on another server?
In my case, the database was the "cheaper" option (in terms of time to implement) because all of our data was already in a database (SQL Server). The database was on a different server in the same network. Keep in mind this product database was entirely separate from the Umbraco database. The database didn't significantly slow down page load times.
The tricky bit would be to create an interface in Umbraco to edit that data. We already had a tool to edit product data, so we didn't need to build anything in Umbraco.
With a project of the size you are talking about, however, I'd expect you would have the budget to build such an interface.
For minor Custom Database management, DEWD is a an excellent package: https://our.umbraco.org/projects/developer-tools/dewd
Easy to create overviews + edit data.
Note: it's only for v6 right now. The developer seems to be inactive.
Is it necessary to keep umbraco.config when you consider how fast data can be delivered from a database?
Data extracted from databases (e.g. events, press releases, product databases) can be cached using .NET controls, and even caching for a small time period can improve speed.
Our existing CMS solution uses SQL Server and we haven't had any performance issues when getting web content data from the database. (I hope I haven't jinxed everything!)
It just seems like unnecessary duplication to me.
DEWD looks neat; thanks for that.
Umbraco has seemed really slow to me when accessing content directly from the database. However, that may have been on something like Azure where the database was on a different machine. if the database is on the same machine, it might be fast.
Hi Nicholas,
Did you try to use Examine? If you will create your own index, you won't need to get data from db or umbraco.config, try to look how https://our.umbraco.org/projects/website-utilities/ezsearch is working. You can do search and traversing login in the Lucene, and then render from xml cache or database, it will be faster.
Thanks, Alex
I haven't used Examine for the purpose of getting content (I've only used it for general site searches), though that might work for scenarios that require large amounts of data and fast performance (at the cost of needing the complexity of extra code and configuration). Good tip, thanks.
I'm in the process of migrating our old site to Umbraco.
The existing site has about 6,500 pages (the site belongs to a local authority) and after my first import, the umbraco.config file grew to about 29.5MB.
I killed the back-office after the initial import because every page was under the root.
After phase 2, i.e. moving pages back into the tree-structure that came with the old CMS, the back-office is back to normal.
It's early days, but I am already looking at ideas to improve performance.
For instance, after the import, I had to move the site homepage to the root of the site, which in turn seems to have dragged everything else along with it. This one move has taken 40 minutes and is still running, updating the umbraco.config file as it goes (although this will be a once-off move).
I'll look at changing the setting for 'ContinouslyUpdateXmlDiskCache' before my next run.
Hi MuirisOG,
Yes you have to disable ContinouslyUpdateXmlDiskCache I think. It's good feature for little sites, not like yours. What version of Umbraco are you using ? Could you show more your settings ? Has you SSD on the server ?
Thanks
At the moment, I'm using
I'm not sure what you mean by SSD.
An SSD is a type of hard drive that is very fast: http://en.wikipedia.org/wiki/Solid-state_drive
By the way, if you are manipulating a lot of content programmatically, you may also consider disabling the Examine indexing, then reenable and perform a reindex when you're all done. Keep in mind that media may not work while the index is disabled.
If I remember correctly, the umbraco.config is only used for filling the in-memory cache - it is not used for serving up the front-end. Having it does slow up publish times on very large sites. You can disable it by setting ContinouslyUpdateXmlDiskCache to false.
I think the trade off is slower load times after an app pool restart, since Umbraco hits the database instead of the xml to replenish the cache.
is working on a reply...