I've been contemplating a PR for Umbraco - and it is something I'll need to spend a lot of time on - and something I'll need to spend quite a lot of time on.
I wanted to float it here, to see if people think it is a useful core addition (especially those who'd review/accept the PR).
Umbraco starts up and loads all published content into memory. This is a good thing because the Umbraco presentation APIs work over this and it is super fast to render out site navigation, new listings etc - without the need to write any of your own caching.
Which works really well 90% of the time, but...
We are inheriting more and more sites with tens of thousands of news items dating back a decade or more. These all get loaded into memory at startup - consuming valuable seconds at startup and some gigagbytes of memory.
The stats and analytics show that some news items from back in 2003 are accessed maybe once a month, but are still valuable content - in a specific example we have a subscription site where people pay to access years worth of archive, and they may want to search for a very specific article from a few years back.
So what is the issue?
The cloud hosting model charges us based upon the use of resources (including memory). Our ideal application will boot fast, utilise as little resource as possible and add additional web servers when they are required (When these are added they should spin up quickly and come into play).
Loading all content (and indexing with Examine) at startup can be slow - and consume a lot of memory.
What is the solution?
I think that being able to configure certain content types to not be placed in the cache and only loaded when needed could address the issues above.
You'd just cache everything you needed to build the site header, footer and navigation - but the news would just exist in the Umbraco database.
All of the hooks and events that you need to store it elsewhere are present in Umbraco already.
Issues?
Non Cached content can't be accessed by the standard Umbraco API - you'd have to query it via some form of search - but the whole pull request would be a piece of configuration - no behaviour would change by default.
The developer/architect can just select their non cached content types - and then choose how to store an access them.
The outcome?
Super fast startup every time - low memory footprint, spin up new instances in seconds :)
There is more!
In order for this all to work - I think I need to complete another side project that allows Examine indexes to be written to Azure Search - or to swap Examine out with a search provider - at present, a new "cloud" web server starting up will build it's own index - which can take some time.
You can already exclude content types from Examine Indexes, but you wouldn't wan't to do that for the back office index (Internal Indexer).
Being able to exclude some content from the cache has been discussed many times, and is probably something we do want to have.
It should not be tremendously difficult to implement by just skipping some nodes when building the XML cache. What would be trickier would be to exclude those nodes from the SQL query we run to populate the cache. And of course there is configuration... maybe a list of content types to ignore?
Unless we want to go the event way: expose an event that users could subscribe to, and which would trigger for each item, so that the user can decide whether to include or not the item in cache.
In any case, definitively something we could have in Core.
As for NuCache. Does not make much of a difference: though not supported now, we could also fully exclude some nodes from the cache. Based upon an event, we could have all sort of logic to decide whether to keep a node or not (eg if the node is older than...).
What NuCache could bring is the possibility to exclude some properties. Though one might argue that it could also be done with the Xml cache - in fact if you look closely at Core's code you will find traces of internal and dirty events that were obviously created with that kind of feature in mind (eg edit the XML of a node before it is inserted in the cache).
OK, what NuCache could bring that Xml cannot is some sort of lazy loading mechanism, for properties that would not be in cache. Or maybe even for content nodes: the cache could contain a "placeholder" indicating that some content exist. Lightweight object.
So... definitively interested in your PR. As you mention, the excluded content cannot be accessed via the cache API so it needs another... way... whatever.
I think a list of content types to exclude is sufficient configuration.
I think the event to exclude content from cache exists during the publishing sequence, but I don't think this is thrown at startup when the cache is built. I'll look into it.
In future, supported Lazy loading via the API would be really good, but I believe in small fast increments, so maybe that is another conversation for later :)
Turns out support is in the core to update the cache excluding certain content types:
I've tried it out and it works reasonably easily. All that I'd need to do is to add a list of excluded content somewhere in the configuration.
Then of course I'd need to give some thought to Examine and the Internal indexer, the whole excercise is pretty pointless if we need to wait for that index to be be built by each new instance.
Nice thoughts these. I'd love to see optimization of memory and content cache.
While following this, I couldn't help but think that it'd be nice to have the cache "learn" or use some statistics. Automatically cache often used content, and just leave seldom used content for lazy loading to the cache when read.
AFAIK there's no statistics built-in, so I guess it's not possible OOTB, but might be worth diving into when already extending the cache with predicates.
I find your solution very interesting Darren, but unfortunately umbraco has changed its code quite a bit since your reply.
I tried to mimic your idea on the same area, but code confused on how the code works now.
Is it possible to adapt your solution for today's code? (I've also replied your PR)
I read your post, but the over simplified solution would still fit me better I believe, and would be quicker for me to test.
The archived solution in your post is somewhat complicated for me to maintain, especially because of creating and supporting the custom version of XmlPublishedContent in case umbraco core changes (as it does). I feel I wouldn't be able to adapt it if necessary.
I really want the 'over simplified' solution to work for me :)
Any ideas regarding the actual code structure?
The need for non cached content?
I've been contemplating a PR for Umbraco - and it is something I'll need to spend a lot of time on - and something I'll need to spend quite a lot of time on.
I wanted to float it here, to see if people think it is a useful core addition (especially those who'd review/accept the PR).
Umbraco starts up and loads all published content into memory. This is a good thing because the Umbraco presentation APIs work over this and it is super fast to render out site navigation, new listings etc - without the need to write any of your own caching.
Which works really well 90% of the time, but...
We are inheriting more and more sites with tens of thousands of news items dating back a decade or more. These all get loaded into memory at startup - consuming valuable seconds at startup and some gigagbytes of memory.
The stats and analytics show that some news items from back in 2003 are accessed maybe once a month, but are still valuable content - in a specific example we have a subscription site where people pay to access years worth of archive, and they may want to search for a very specific article from a few years back.
So what is the issue?
The cloud hosting model charges us based upon the use of resources (including memory). Our ideal application will boot fast, utilise as little resource as possible and add additional web servers when they are required (When these are added they should spin up quickly and come into play).
Loading all content (and indexing with Examine) at startup can be slow - and consume a lot of memory.
What is the solution?
I think that being able to configure certain content types to not be placed in the cache and only loaded when needed could address the issues above.
You'd just cache everything you needed to build the site header, footer and navigation - but the news would just exist in the Umbraco database.
All of the hooks and events that you need to store it elsewhere are present in Umbraco already.
Issues?
Non Cached content can't be accessed by the standard Umbraco API - you'd have to query it via some form of search - but the whole pull request would be a piece of configuration - no behaviour would change by default.
The developer/architect can just select their non cached content types - and then choose how to store an access them.
The outcome?
Super fast startup every time - low memory footprint, spin up new instances in seconds :)
There is more!
In order for this all to work - I think I need to complete another side project that allows Examine indexes to be written to Azure Search - or to swap Examine out with a search provider - at present, a new "cloud" web server starting up will build it's own index - which can take some time.
You can already exclude content types from Examine Indexes, but you wouldn't wan't to do that for the back office index (Internal Indexer).
Feedback appreciated!
Being able to exclude some content from the cache has been discussed many times, and is probably something we do want to have.
It should not be tremendously difficult to implement by just skipping some nodes when building the XML cache. What would be trickier would be to exclude those nodes from the SQL query we run to populate the cache. And of course there is configuration... maybe a list of content types to ignore?
Unless we want to go the event way: expose an event that users could subscribe to, and which would trigger for each item, so that the user can decide whether to include or not the item in cache.
In any case, definitively something we could have in Core.
As for NuCache. Does not make much of a difference: though not supported now, we could also fully exclude some nodes from the cache. Based upon an event, we could have all sort of logic to decide whether to keep a node or not (eg if the node is older than...).
What NuCache could bring is the possibility to exclude some properties. Though one might argue that it could also be done with the Xml cache - in fact if you look closely at Core's code you will find traces of internal and dirty events that were obviously created with that kind of feature in mind (eg edit the XML of a node before it is inserted in the cache).
OK, what NuCache could bring that Xml cannot is some sort of lazy loading mechanism, for properties that would not be in cache. Or maybe even for content nodes: the cache could contain a "placeholder" indicating that some content exist. Lightweight object.
So... definitively interested in your PR. As you mention, the excluded content cannot be accessed via the cache API so it needs another... way... whatever.
Thanks Stephan,
I think a list of content types to exclude is sufficient configuration.
I think the event to exclude content from cache exists during the publishing sequence, but I don't think this is thrown at startup when the cache is built. I'll look into it.
In future, supported Lazy loading via the API would be really good, but I believe in small fast increments, so maybe that is another conversation for later :)
Turns out support is in the core to update the cache excluding certain content types:
I've tried it out and it works reasonably easily. All that I'd need to do is to add a list of excluded content somewhere in the configuration.
Then of course I'd need to give some thought to Examine and the Internal indexer, the whole excercise is pretty pointless if we need to wait for that index to be be built by each new instance.
Nice thoughts these. I'd love to see optimization of memory and content cache.
While following this, I couldn't help but think that it'd be nice to have the cache "learn" or use some statistics. Automatically cache often used content, and just leave seldom used content for lazy loading to the cache when read.
AFAIK there's no statistics built-in, so I guess it's not possible OOTB, but might be worth diving into when already extending the cache with predicates.
Update - I created a PR here: https://github.com/umbraco/Umbraco-CMS/pull/1540
And an issue here: http://issues.umbraco.org/issue/U4-9110
I find your solution very interesting Darren, but unfortunately umbraco has changed its code quite a bit since your reply. I tried to mimic your idea on the same area, but code confused on how the code works now. Is it possible to adapt your solution for today's code? (I've also replied your PR)
Thanks
I've since written about the solution here:
https://www.moriyama.co.uk/about-us/news/blog-the-need-for-archived-content-in-umbraco-and-how-to-do-it/
The PR was a bit of an over simplification!
I read your post, but the over simplified solution would still fit me better I believe, and would be quicker for me to test.
The archived solution in your post is somewhat complicated for me to maintain, especially because of creating and supporting the custom version of XmlPublishedContent in case umbraco core changes (as it does). I feel I wouldn't be able to adapt it if necessary.
I really want the 'over simplified' solution to work for me :) Any ideas regarding the actual code structure?
The post is really the conclusion of a good few weeks of looking into the issue and the best way to do it.
We decided that the simple pull request doesn't really work for a number of reasons!
is working on a reply...