I'm attempting to gather information to fix a performance issue we are having at work with an Umbraco instance hosted on Azure. We are experiencing 2+ second page load times which over 6x slower than on any development machines where most pages served at around 150-200 ms.
Our setup is as follows.
Web App – Standard, Medium (2 cores, 3.5 GB Memory), 1 instance
Both the web app and the database are in the same region.
To tweak our performance a little I've set the ExamineSettings.config provider settings to use useTempStorage="Sync" and also set the ASP.Net File Change Notification (FCN) in the web.config to use a single observer. Everything else is out-of-the-box.
I haven't enabled output cache in any of our controllers so far as I want to determine the cause of the issue rather than mask it.
Unfortunately Miniprofiler is not working for some reason (Throws an error stating "Too much recursion") so I can't dig into what is taking the time. We have NewRelic installed but all it is telling us is that 96% of the time is being spent in the controller and that db access is minimal.
Time is supercritical just now so any experienced advice would be most useful.
What are the minimum requirement you set your sites up with?
We recently went live with a new website in Azure and had issues after launch. The main issue we had was (we believe) with a bug in 7.2.6. Every now and again resources (CPU/RAM) would spike and leave the site unresponsive unless we killed it. We've since upgraded to 7.3.1 and that is no longer happening. That doesn't help you I know but it might help someone else...
Before upgrading to 7.3.1. we looked at any other optimisations we could make as it seemed the servers could not handle a high load of traffic, and we hadn't properly configured Load Balancing yet. Even if we had the site running on a large instance (7GB/4 CPU) the spiking still occurred. We would get alerts from Azure that the HTTP Queue had reached 100 and when this happened the site was no longer responsive. Therefore I started to investigate how we could offload as much of the web traffic as possible off the Azure web servers.
We implemented your Image Processor Azure File System Provider in order to get all of the Media off the web server. I then configured a CDN to server all of the Media. I've no doubt you've already done all of that but I've put it here for other's reference. One difference we might have is that we injected the CDN path into our templates rather than rely on your Media virtual path provider as this removed the initial 302 request for each and every image.
So for a general page load now the web server is only really serving the HTML content (and the Optimus generated minified CSS/JS at the moment).
I did also implement caching both at the Output and Object level. I've added Output caching via Route Hijacking for the most resource consuming pages - the home page, the News rollup page and also the News article pages. I then tried to (Object) Cache any collections that were a result of intensive Umbraco data queries (although I'm not sure this is required/beneficial as my understanding is the whole of Umbraco content is stored in memory anyway??).
All that said I'm not sure our site still serves a content page under 2 seconds!? What are you using to measure this? Is that just the HTML loading or all resources too? That doesn't seem too bad to me!
We're running the S1 database (i.e. less than you) and that seems to have no issues at all, chugging along using 2-3% resources (DTUs) most of the time with the odd spike, but that's probably the backups running.
Now that we've configured the Media directory as such and upgraded to 7.3.1 we've also been able to test the new load balancing features by adding more Azure instances. I'm aware the guidance is to have the backoffice on a single node but we've not had any noticeable issues not doing that, possibly because there is only a small number content editors.
Additionally now that we have configured auto-scaling (when CPU etc. hit's certain thresholds) we've been able to reduce to using the Medium instance size (the same as yourself). Admittedly the auto-scaling seems to take 20 minutes or so to kick in which is less than ideal but hopefully MS will improve that.
As a suggestion, it might be worth deploying an empty Umbraco site (or with the starter site) into the same Service Plan / DB Server and seeing how fast that is, i.e. zero or very little custom code.
I'd suggest that the "Too much recursion message" may be the key to this. your specs seem adequate power for much faster response times.
Is the site under serious load though? Some sites are fine in staging - but when multiple users hit them in prod - something horrible like .net session state locking up threads kills them.
Have you done the simple test of having a doctype with a blank template - and hammering that by hitting F5 - to see what the response times are like then?
If you still get slow response times I'd suggest reverting the web config to a vanilla version corresponding to the version of Umbraco you are using (in case it is an http module etc)
James S makes a good point about using Redis for session state - we've solved plenty of problems with high traffic sites by eliminating session state altogether.
You should be able to hammer a site in Azure webapps (of your spec) using blazemeter with a couple of hundred concurrent requests and see good performance.
I'd honestly start by creating a staging environment of equivalent spec and load testing with empty templates.
I'm not on our that often, but happy to talk through if timezones allow!
First off apologies for the slow reply. I've been struggling with some difficult deployments whilst trying to gather enough information to reply with.
James - We've now deployed a second site to Azure with which we've included some of your suggestions and we will look into implementing some others. The performance of this site is much better though that's due mostly to caching.
So far we have added DonutCaching to our pages with custom params to ensure clearout. This is not without issue though as we often use nodes as sub components to a page rather than using them as an actual page which means any changes have do not get picked up on through route hijacking based caching. You have to republish the parent page to clear the cache.
Darren, If you have any documentation or code samples you could share regarding your Redis setup that would most useful.
With regards to the "Too much recursion" message... As soon as we add Ditto to a page we get that message. I think this is due to the way Ditto will recursively look for media items and logs each transformation for each loop. Sadly the recursion value is not configurable in MiniProfiler and I couldn't get custom build of that to do anything.
I'm a little bit disappointed with Azure, truth be told. While it's good practise to add caching etc, I should be able to run a site on their without the extra optimisation. Azure is quite expensive for the performance it provides.
With your last statement - I don't think Azure is the issue. We've a number of successful Azure deployments. I'm still anticipating some other custom code - or maybe even something like Ditto....
Hi, we are currently having the same issue, site is running ok with a few users, but running a blazemeter test with 20 makes it slower (4.5) seconds, and with 100 it is almost not responding.
I did check a lot of things, and also did an upgrade to 7.3.2 but the issue remains. Is there recently maybe something changed in Umbraco which applies to Azure?
No, we do no Database calls at all, but we might found it as we speak, it looks like log4net might slow down the website, we are fixing configs now, and do a new loadtest, i keep you updated.
Ok we just did a test with 100 concurrent users, with the fixed log4net settings the website is running fine and reacts instantly, so for people with Azure websites, check your log4net settings :)
Interesting to note the Log4Net issue. We noticed in NewRelic that it was using a great proportion of the overall time but I thought that was simply initialisation code.
We are seeing similar issues like James Strugnell. When we first deploy the site, after the initial warmup the site is super fast, most requests coming back 100-200ms. After a while the site starts to slow down to around 500-1000ms seconds with some random spikes. After a random amount of time one of the instances becomes unresponsive to the point that we have to restart the app service. Sometimes it takes a couple hours or a day or so before this happens.
Some more details:
Web App Standard, Medium (2 cores, 3.5 GB Memory), 4 instances
We initially started with 1 instance and had auto-scale setup but it seemed the site would get overwhelmed faster than we could scale and it would become unresponsive.
SQL Database Standard Database Pool with 400DTU's, 100 max for umbraco
Other db's are in here but the umbraco database is averaging 3-5% DTU
Umbraco v7.3.1
site was orginally built on v4 and was running lightning fast on azure vm's, ~100ms
upgraded to v7 but still using xslt because large code base and didnt have time to rewrite everything because moving to azure web apps was the goal
9k published content nodes
~2 million records in propertydata table
log4net is set to 'Error' since 'Info' was writing too much
Traffic
~60k requests an hour
css/js are on cdn and pulled from the site
media images are pulled from azure storage
other images are pulled from cdn
majority of hits are to homepage and 2-3 rss feeds which should all be cached
95% of the macros we use are cached for 24hrs (used to be 5-60 minutes on v4 but we bumped them up recently to help with perf issues)
We are still looking into the issue but any help would be appreciated.
I've been researching performance issues with a site we've upgraded to 7.3. It has the same issues you're describing. Higher CPU load than usual and sudden HTTP queue buildups.
Not sure we've killed the HTTP queue purp, but at least we've identified a few new issues in Umbraco, and the site is now running with "normal" CPU usage and can handle the load it had before.
Our installation is a multilingual, multi-site/domain solution integrated with UCommerce.
In all pages, we use the services on UmbracoContext.Current.Application.Services directly or indirectly to get domains, languages, dictionary items etc. There's also quite a lot of links to misc. content, which generates URLs, which indirectly uses the domain service.
There's also a couple of cases where we use public access, which have also been refactored.
Turns out these services weren't really intended by HQ to be used heavily from the front-end. Never the less, they've been used all over the place since the legacy API is being factored out.
Beneath the services are a new cache similar to identity maps and L2 cache in ORMs. It should do wonders for the backoffice, but it is really CPU intensive when used from the front-end, or with multiple GetAll() calls.
So Umbraco continues to perform well with sites that don't need any business logic, and don't do very much URL building.
To limit the CPU usage on our site, we've replaced the built-in repositories for domains, languages and permissions. Here's a git repo with those.
You also need a nightly build of Umbraco for now to be able to override the domain repo factory method, but I'll PR that immediately.
(NB - I didn't add any cache invalidating from updates in the backoffice, so it shouldn't be used as is.)
I've been going through this with HQ and we agreed that the way forward is a good case for the dev. mailing list. I'll post my suggestions for changes to the current cache implementation, and a better front-end API there soon.
Hope this can aid you guys in your endeavours. :)
Would also be interesting to know what kind of data you're using in the front-end. (Or might this also be a back-end issue?)
For us the performance tends to be fairly stable on the front-end when we make use of caching. Time to first byte is still a lot slower than we would like particularly for static resources like the CSS/JS (sometimes up to 500ms) but I apprecate again that's an Azure issue potentially solved with a CDN - but it does demonstrate how slow it can be for static files.
In my case I'm running:
Standard S0 app instance
Standard S0 database
Umbraco 7.3.0
No plugins
Typical CPU usage reported at around 10-15%
Memory usage around 80%
Primarily my issues with performance on Azure is related to the back-end including very slow preview generation (minutes not seconds)
Performance is very irregular - seemingly random - and back-end performance doesn't seem to be affected by differing loads on the front-end.
We have noticed an improvement changing the default file logging level to ERROR. So it does lend weight to the theory that it is caused by the slow file system storage on Azure - if nothing else the XML cache is not being invalidated and showing 'fix' several times per day since making that change.
I'll be doing much more in-depth testing on configurations and Azure for the next couple of weeks to try to find a fix for our case. I'm happy to share any additional findings - I appreciate this thread continuing to see comments, it looks like unfortunately most people have abandoned Azure when they've encountered these issues so there's not a huge amount of detail to go on - I'm hoping we can improve things and avoid that outcome.
Lars, according to how your website is developed and your post here: https://groups.google.com/forum/#!topic/umbraco-dev/6YcSISxabKs your performance issues seem to be due to using umbraco Services on your front-end. I realize this should be fixed/optimized in Umbraco, but the other mentions on this thread aren't really related to those types of performance problems.
Azure websites do have a slow file system - and more to the point it is a network based shared file sytem. This is very important to realize and even more important is to note that any excessive file writing does not play well with this setup. This would include logging and more specifically Lucene. I would recommend when running on Azure websites:
Use the LocalOnly (or Sync) settings for Examine, the documentation is currently found here: https://our.umbraco.org/Documentation/Getting-Started/Setup/Server-Setup/Load-Balancing/files-shared under the Lucene/Examine configuration heading. This will cause your Lucene indexes to operate on the local machine in ASP.Net TEMP files. We have had a few clients notice high CPU usages when Lucene is operating over a network file share and this fixes the issue.
Change your log4net settings to a minimum of WARN when operating in production websites
You could also setup a non file based log4net appender, i'm sure there might be a few options for that
I'm not sure if this helps solve some of your issues but please let me know if it does.
I also realize we need to write up some Azure websites specific documentation/best practices
@Shannon, you're right of course. There are things to be aware of and configure "right" with Azure. But I'm sure when those are fixed, the issues I rant about are still relevant for any 7.3 site.
@Shannon - thank you for taking the time to post additional guidance on Azure, I can confirm we hadn't changed our examine indexes - I had seen this mentioned on the Umbraco issue tracker although the comments seemed to conclude it may not be necessary - so it's good to have the additional confirmation to do this.
Ironically we upgraded from 7.2.6 to 7.3.1 and that stopped our spiking CPU/memory issues. Our site is now relatively stable in Azure although does "autoscale up" randomly due to high CPU, but that might just be due to increased load. We have configured Examine indexes to run locally but we haven't changed the Log4Net level from its default. Going to do that now...
@Lars - are you willing to volunteer a code snippet or two of how you were using the services on the front end - and how they were called?
I ask because in my experience of training Umbraco there is often another way.
For example calling a service Method in a View would always be a no - but people often forget about Umbraco Macros - and how they can be flexibly cached etc etc.
Thanks @Darren, but I'm quite on top of it. The site runs quite well now.
When I say front-end, I mean anything not backoffice.
Controllers, business-objects and the like.
When we hit the services, and subsequently the cache layer, it's through IPublishedContent.Url, library.GetDictionaryItem (or localizationservice) and Access.GetAccessingMembershipRoles.
All of these now use the services. They even have obsolete attributes pointing to the new services.
For instance:
If you have say 6.000 items in the cache dictionary (god knows what, Umbraco does that), and you list URLs for say 50 content items in a page, you'll end up with 50*6.000 string comparisons of the cache keys. Accessing an URL goes through GetAssignedDomains, which does a Service.GetAll().
So one page view equals 300.000 extra cycles, when it could've just spent 1 to return one list of domains from memory.
Add to this some permission checks, and you've got a million key searches.
For one page!
I could of course cache it better myself, and that's what I did.
The site ended up spending 40-50% less CPU, and this is an S2 instance on Azure.
Oh, and those things happen during the routing too. So before you ever hit your controllers and views, Umbraco has already spent some time thrashing cache to find the content matching the URL.
You're welcome to join the discussion about whether this should be fixed over at the dev. mailing list.
Indeed, but you won't see them until you load test, profile, or even go live.
Sites run extremely well on a dev box. Probably better than most servers.
So I thought the OP might have biased the focus of the thread by "blaming" Azure, even though this new issue exists.
I too have had issues on Azure, where on a performance vs. cost point, Azure does not compare to a VM of a similar cost.
Obviously it has other strengths and benefits (peace man!), but the underlying architecture is quite different, and changes, such as more caching is required to get decent performance.
We are currently having major performance issues with an Umbraco site running on Azure, which fits the symptoms of this thread, with the site becoming completely unresponsive on publish, for between 5 to 15 minutes.
The following steps made a big impact:
Ensuring Cached Partials are used where possible
@Html.CachedPartial("Name", Model, 1080, true, false)
Someone once told me that any caching is better than no caching, so even fairly small times can be useful. I not sure how Umbraco deals with this these days, but I would avoid excessively large durations.
Adding useTempStorage="LocalOnly" to Indexers and Providers
See ExamineSettings.config
Upgrading to Umbraco 7.4.0
This resolved a lot of errors in the log regarding getting Media.
Umbraco.Web.PublishedCache.XmlPublishedCache.PublishedMediaCache - Could not retrieve media 1736 from Examine index, reverting to looking up media via legacy library.GetMedia method
I had upgraded incrementally from 7.2.0, but this is the first update that has made a big impact.
Double checking to ensure when things aren't cached they aren't be initialized more than once. When properties / objects need to be reused, ensure these are declared as variables.
Sorry to spam this thread :) It's also worth noting that the performances fixes regarding the cache dictionaries as Lars mentioned have been fixed in 7.3.7+, so these fixes aren't just applied to 7.4.0
7.3.3 site on Azure web app service stopped in it's tracks with uplift from 6K sessions/day to 22K sessions/day. Read this thread and others and decided to go for an upgrade to 7.4.1. This has helped. Less RAM and CPU in use, but only just. The site is still slow and the CPU is up at 90+%. The Azure instance is:-
Web App: S3 4 cores 7GB RAM
Database: SQL Server, S3 Standard (100 DTUs)
Redis Cache: C2 Standard 2.5GB Cache
All in Western Europe
Using the AzureRedisSessionStateStore. Nothing special for media.
Is there anything we can look at to get some more speed out of this set up. We can upscale if necessary.
Biggest ones that will benefit you if you are not using them is to change to fcnMode="Single" and set useTempStorage="LocalOnly" for all Examine searchers and indexers.
You can also set the log level to ERROR instead of Info/Warn.
Other than that: check your logs and if you have CPU spikes, the best way to figure out what's using all those CPU cycles is to take a memory dump and analyze it.
If you don't have additional indexes apart from the default ones in Umbraco, your ExamineSettings.config should look like this:
<?xml version="1.0"?>
<!--
Umbraco examine is an extensible indexer and search engine.
This configuration file can be extended to add your own search/index providers.
Index sets can be defined in the ExamineIndex.config if you're using the standard provider model.
More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com
-->
<Examine>
<ExamineIndexProviders>
<providers>
<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
supportUnpublished="true"
supportProtected="true"
analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" useTempStorage="LocalOnly"/>
<add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine"
supportUnpublished="true"
supportProtected="true"
analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" useTempStorage="LocalOnly" />
<!-- default external indexer, which excludes protected and unpublished pages-->
<add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" useTempStorage="LocalOnly" />
</providers>
</ExamineIndexProviders>
<ExamineSearchProviders defaultProvider="ExternalSearcher">
<providers>
<add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" useTempStorage="LocalOnly" />
<add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" useTempStorage="LocalOnly" />
<add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcard="true" useTempStorage="LocalOnly"/>
</providers>
</ExamineSearchProviders>
</Examine>
If you have additional/different indexes, just add the useTempStorage="LocalOnly" attribute to all of them (both indexers and searchers). This will help a lot.
Make sure to read up on the useTempStorage attribute and it's implications. We're currently preferring LocalOnly over Sync because we've seen some problems with syncing in the past (though not after we fixed those errors, but we're being a bit cautious).
Set the logging level to "Error" and the ExamineSettings.conf now looks like this:-
<?xml version="1.0"?>
<!--
Umbraco examine is an extensible indexer and search engine.
This configuration file can be extended to add your own search/index providers.
Index sets can be defined in the ExamineIndex.config if you're using the standard provider model.
More information and documentation can be found on CodePlex: http://umbracoexamine.codeplex.com
-->
<Examine>
<ExamineIndexProviders>
<providers>
<add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
supportUnpublished="true"
supportProtected="true"
analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" useTempStorage="LocalOnly" />
<add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine"
supportUnpublished="true"
supportProtected="true"
analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" useTempStorage="LocalOnly" />
<!-- default external indexer, which excludes protected and unpublished pages-->
<add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine" useTempStorage="LocalOnly" />
</providers>
</ExamineIndexProviders>
<ExamineSearchProviders defaultProvider="ExternalSearcher">
<providers>
<add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net" useTempStorage="LocalOnly" />
<add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" useTempStorage="LocalOnly" />
<add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcard="true" useTempStorage="LocalOnly" />
</providers>
</ExamineSearchProviders>
</Examine>
Rebuilt indexes. Site not much faster.
"Does Umbraco HQ do paid support over a weekend?" the client asked. Big launch on Monday.
We'll def look at implementing partial caching tomorrow. In the meantime, we've gone down from 20 seconds to 2-3 seconds load time by implementing the suggestions on logging, examine indexes and also adding <httpErrors existingResponse="PassThrough" /> in web.config. That had been missed on the last upgrade. It seemed to unblock the whole site. Load on the site has gone down slightly now (11pm UK time) so just hoping the site holds up for the next USA West Coast wake up. But it's looking hopeful. It's been a stressful day :)
My Examine config is set to Sync. I upgraded to 7.3.7 this week and everything seemed much more stable (previously on 7.3.1). However, we've just scaled up to 2 instances (in Azure) for the weekend and have noticed that the "non-master" is getting a load of "Could not retrieve media * from Examine index, reverting to looking up media via legacy library.GetMedia method" warnings in the logs. The master instance is not reporting these issues. I've tried re-indexing but that doesn't seem to effect the secondary server.
Is this a known issue with 7.3.7 (a scaled out instance doesn't index properly) and/or is this something that might be solved by changing the config setting to "LocalOnly"? How can I force a re-index on the non-master instance?
Apologies if this should be its own question, but I happened to notice the LocalOnly suggestion in this thread.
Further to my previous comment - I've tried the LocalOnlyExamine setting and that made no difference to my issue. My problem, it seems, is that non "Master" instances aren't creating their own indexes automatically.
To solve the problem short term I had to keep deleting the ARRAffinity cookike until I landed on the instance which had no index. From there I could log in and run the index manually. After that the errors mentioned above stopped occurring.
Can anyone else validate that on 7.3.7 (at least) no automatic indexing is taking place on the non-Master instance.
If it helps, prior to my recent upgrade to 7.3.7 I would see this message in the logs everytime I scaled up -
Umbraco.Core.Sync.DatabaseServerMessenger - No last synced Id found, this generally means this is a new server/install. The server will rebuild its caches and indexes and then adjust it's last synced id to the latest found in the database and will start maintaining cache updates based on that id
I've not seen that message since I upgraded. Maybe as part of my upgrade I've lost some previously configured settings. Starting to think this should be it's own forum question...
Fixing poor performance on Azure
Hey friends,
I'm attempting to gather information to fix a performance issue we are having at work with an Umbraco instance hosted on Azure. We are experiencing 2+ second page load times which over 6x slower than on any development machines where most pages served at around 150-200 ms.
Our setup is as follows.
Both the web app and the database are in the same region.
To tweak our performance a little I've set the ExamineSettings.config provider settings to use
useTempStorage="Sync"
and also set the ASP.Net File Change Notification (FCN) in the web.config to use a single observer. Everything else is out-of-the-box.I haven't enabled output cache in any of our controllers so far as I want to determine the cause of the issue rather than mask it.
Unfortunately Miniprofiler is not working for some reason (Throws an error stating "Too much recursion") so I can't dig into what is taking the time. We have NewRelic installed but all it is telling us is that 96% of the time is being spent in the controller and that db access is minimal.
Time is supercritical just now so any experienced advice would be most useful.
Cheers
James
Hi James,
We recently went live with a new website in Azure and had issues after launch. The main issue we had was (we believe) with a bug in 7.2.6. Every now and again resources (CPU/RAM) would spike and leave the site unresponsive unless we killed it. We've since upgraded to 7.3.1 and that is no longer happening. That doesn't help you I know but it might help someone else...
Before upgrading to 7.3.1. we looked at any other optimisations we could make as it seemed the servers could not handle a high load of traffic, and we hadn't properly configured Load Balancing yet. Even if we had the site running on a large instance (7GB/4 CPU) the spiking still occurred. We would get alerts from Azure that the HTTP Queue had reached 100 and when this happened the site was no longer responsive. Therefore I started to investigate how we could offload as much of the web traffic as possible off the Azure web servers.
We implemented your Image Processor Azure File System Provider in order to get all of the Media off the web server. I then configured a CDN to server all of the Media. I've no doubt you've already done all of that but I've put it here for other's reference. One difference we might have is that we injected the CDN path into our templates rather than rely on your Media virtual path provider as this removed the initial 302 request for each and every image.
So for a general page load now the web server is only really serving the HTML content (and the Optimus generated minified CSS/JS at the moment).
I did also implement caching both at the Output and Object level. I've added Output caching via Route Hijacking for the most resource consuming pages - the home page, the News rollup page and also the News article pages. I then tried to (Object) Cache any collections that were a result of intensive Umbraco data queries (although I'm not sure this is required/beneficial as my understanding is the whole of Umbraco content is stored in memory anyway??).
All that said I'm not sure our site still serves a content page under 2 seconds!? What are you using to measure this? Is that just the HTML loading or all resources too? That doesn't seem too bad to me!
We're running the S1 database (i.e. less than you) and that seems to have no issues at all, chugging along using 2-3% resources (DTUs) most of the time with the odd spike, but that's probably the backups running.
Now that we've configured the Media directory as such and upgraded to 7.3.1 we've also been able to test the new load balancing features by adding more Azure instances. I'm aware the guidance is to have the backoffice on a single node but we've not had any noticeable issues not doing that, possibly because there is only a small number content editors.
Additionally now that we have configured auto-scaling (when CPU etc. hit's certain thresholds) we've been able to reduce to using the Medium instance size (the same as yourself). Admittedly the auto-scaling seems to take 20 minutes or so to kick in which is less than ideal but hopefully MS will improve that.
Next up I'm going to look at configuring Redis for Session and Output cache as recently blogged by Scott Hanselman. This should remove even more load from the web servers and reduce duplicated memory usage - http://www.hanselman.com/blog/UsingRedisAsAServiceInAzureToSpeedUpASPNETApplications.aspx
As a suggestion, it might be worth deploying an empty Umbraco site (or with the starter site) into the same Service Plan / DB Server and seeing how fast that is, i.e. zero or very little custom code.
I'm interested to here how you get on and how anyone else is doing it. I also commented on this question which is related: https://our.umbraco.org/forum/umbraco-7/using-umbraco-7/72315-umbraco-7-backend-and-load-balancing-what-are-the-problems
Hey James, sorry - been offline most of today.
I'd suggest that the "Too much recursion message" may be the key to this. your specs seem adequate power for much faster response times.
Is the site under serious load though? Some sites are fine in staging - but when multiple users hit them in prod - something horrible like .net session state locking up threads kills them.
Have you done the simple test of having a doctype with a blank template - and hammering that by hitting F5 - to see what the response times are like then?
If you still get slow response times I'd suggest reverting the web config to a vanilla version corresponding to the version of Umbraco you are using (in case it is an http module etc)
James S makes a good point about using Redis for session state - we've solved plenty of problems with high traffic sites by eliminating session state altogether.
You should be able to hammer a site in Azure webapps (of your spec) using blazemeter with a couple of hundred concurrent requests and see good performance.
I'd honestly start by creating a staging environment of equivalent spec and load testing with empty templates.
I'm not on our that often, but happy to talk through if timezones allow!
Hi James, Darren,
First off apologies for the slow reply. I've been struggling with some difficult deployments whilst trying to gather enough information to reply with.
James - We've now deployed a second site to Azure with which we've included some of your suggestions and we will look into implementing some others. The performance of this site is much better though that's due mostly to caching.
So far we have added DonutCaching to our pages with custom params to ensure clearout. This is not without issue though as we often use nodes as sub components to a page rather than using them as an actual page which means any changes have do not get picked up on through route hijacking based caching. You have to republish the parent page to clear the cache.
Darren, If you have any documentation or code samples you could share regarding your Redis setup that would most useful.
With regards to the "Too much recursion" message... As soon as we add Ditto to a page we get that message. I think this is due to the way Ditto will recursively look for media items and logs each transformation for each loop. Sadly the recursion value is not configurable in MiniProfiler and I couldn't get custom build of that to do anything.
I'm a little bit disappointed with Azure, truth be told. While it's good practise to add caching etc, I should be able to run a site on their without the extra optimisation. Azure is quite expensive for the performance it provides.
Hi James,
Here is a good primer on the redis cache setup: http://blogs.msdn.com/b/webdev/archive/2014/05/12/announcing-asp-net-session-state-provider-for-redis-preview-release.aspx
It is basically a replacement for the default ASP.net session state.
BTW - Session state is evil and often a course of performance issues in itself - avoid using it whenever and wherever you can: http://stackoverflow.com/questions/3629709/i-just-discovered-why-all-asp-net-websites-are-slow-and-i-am-trying-to-work-out
With your last statement - I don't think Azure is the issue. We've a number of successful Azure deployments. I'm still anticipating some other custom code - or maybe even something like Ditto....
Hi, we are currently having the same issue, site is running ok with a few users, but running a blazemeter test with 20 makes it slower (4.5) seconds, and with 100 it is almost not responding.
I did check a lot of things, and also did an upgrade to 7.3.2 but the issue remains. Is there recently maybe something changed in Umbraco which applies to Azure?
@sjors Is your site making a lot of DB calls on the front end? If so, try to minimize this as SQL azure will throttle pretty quickly on smaller tiers.
No, we do no Database calls at all, but we might found it as we speak, it looks like log4net might slow down the website, we are fixing configs now, and do a new loadtest, i keep you updated.
Ok we just did a test with 100 concurrent users, with the fixed log4net settings the website is running fine and reacts instantly, so for people with Azure websites, check your log4net settings :)
What did you change?
Just changed the logging level from debug to error.
We've seem the custom log4net build that comes with Umbraco spike the CPU to 100% before and disabling it has helped performance.
It seems that the async appender seems to block threads when flushing to disc - causing other threads to back up behind it...
Debug logging in production generally isn't a great idea though!
Yes, agree, we just never changed it in our config transformations ;)
Interesting to note the Log4Net issue. We noticed in NewRelic that it was using a great proportion of the overall time but I thought that was simply initialisation code.
You can completely disable it:
And re-run your tests. I always swear by the empty template test though to see whether that improves things - and gradually add stuff back.
We are seeing similar issues like James Strugnell. When we first deploy the site, after the initial warmup the site is super fast, most requests coming back 100-200ms. After a while the site starts to slow down to around 500-1000ms seconds with some random spikes. After a random amount of time one of the instances becomes unresponsive to the point that we have to restart the app service. Sometimes it takes a couple hours or a day or so before this happens.
Some more details:
We are still looking into the issue but any help would be appreciated.
Hi guys,
I've been researching performance issues with a site we've upgraded to 7.3. It has the same issues you're describing. Higher CPU load than usual and sudden HTTP queue buildups.
Not sure we've killed the HTTP queue purp, but at least we've identified a few new issues in Umbraco, and the site is now running with "normal" CPU usage and can handle the load it had before.
Our installation is a multilingual, multi-site/domain solution integrated with UCommerce.
In all pages, we use the services on
UmbracoContext.Current.Application.Services
directly or indirectly to get domains, languages, dictionary items etc. There's also quite a lot of links to misc. content, which generates URLs, which indirectly uses the domain service.There's also a couple of cases where we use public access, which have also been refactored.
Turns out these services weren't really intended by HQ to be used heavily from the front-end. Never the less, they've been used all over the place since the legacy API is being factored out.
Beneath the services are a new cache similar to identity maps and L2 cache in ORMs. It should do wonders for the backoffice, but it is really CPU intensive when used from the front-end, or with multiple
GetAll()
calls.So Umbraco continues to perform well with sites that don't need any business logic, and don't do very much URL building.
To limit the CPU usage on our site, we've replaced the built-in repositories for domains, languages and permissions.
Here's a git repo with those.
You also need a nightly build of Umbraco for now to be able to override the domain repo factory method, but I'll PR that immediately.
(NB - I didn't add any cache invalidating from updates in the backoffice, so it shouldn't be used as is.)
I've been going through this with HQ and we agreed that the way forward is a good case for the dev. mailing list. I'll post my suggestions for changes to the current cache implementation, and a better front-end API there soon.
Hope this can aid you guys in your endeavours. :)
Would also be interesting to know what kind of data you're using in the front-end. (Or might this also be a back-end issue?)
Lars-Erik
For us the performance tends to be fairly stable on the front-end when we make use of caching. Time to first byte is still a lot slower than we would like particularly for static resources like the CSS/JS (sometimes up to 500ms) but I apprecate again that's an Azure issue potentially solved with a CDN - but it does demonstrate how slow it can be for static files.
In my case I'm running:
Primarily my issues with performance on Azure is related to the back-end including very slow preview generation (minutes not seconds)
Performance is very irregular - seemingly random - and back-end performance doesn't seem to be affected by differing loads on the front-end.
We have noticed an improvement changing the default file logging level to ERROR. So it does lend weight to the theory that it is caused by the slow file system storage on Azure - if nothing else the XML cache is not being invalidated and showing 'fix' several times per day since making that change.
I'll be doing much more in-depth testing on configurations and Azure for the next couple of weeks to try to find a fix for our case. I'm happy to share any additional findings - I appreciate this thread continuing to see comments, it looks like unfortunately most people have abandoned Azure when they've encountered these issues so there's not a huge amount of detail to go on - I'm hoping we can improve things and avoid that outcome.
Personally I don't think it's Azure at all. I think it's easier to spot due to some black magic, but essentially it's the changes in 7.3.
Lars, according to how your website is developed and your post here: https://groups.google.com/forum/#!topic/umbraco-dev/6YcSISxabKs your performance issues seem to be due to using umbraco Services on your front-end. I realize this should be fixed/optimized in Umbraco, but the other mentions on this thread aren't really related to those types of performance problems.
Azure websites do have a slow file system - and more to the point it is a network based shared file sytem. This is very important to realize and even more important is to note that any excessive file writing does not play well with this setup. This would include logging and more specifically Lucene. I would recommend when running on Azure websites:
Lucene/Examine configuration
heading. This will cause your Lucene indexes to operate on the local machine in ASP.Net TEMP files. We have had a few clients notice high CPU usages when Lucene is operating over a network file share and this fixes the issue.I'm not sure if this helps solve some of your issues but please let me know if it does.
I also realize we need to write up some Azure websites specific documentation/best practices
@Shannon, you're right of course. There are things to be aware of and configure "right" with Azure. But I'm sure when those are fixed, the issues I rant about are still relevant for any 7.3 site.
@Shannon - thank you for taking the time to post additional guidance on Azure, I can confirm we hadn't changed our examine indexes - I had seen this mentioned on the Umbraco issue tracker although the comments seemed to conclude it may not be necessary - so it's good to have the additional confirmation to do this.
I'll try these changes and let you know.
Ironically we upgraded from 7.2.6 to 7.3.1 and that stopped our spiking CPU/memory issues. Our site is now relatively stable in Azure although does "autoscale up" randomly due to high CPU, but that might just be due to increased load. We have configured Examine indexes to run locally but we haven't changed the Log4Net level from its default. Going to do that now...
@Lars - are you willing to volunteer a code snippet or two of how you were using the services on the front end - and how they were called?
I ask because in my experience of training Umbraco there is often another way.
For example calling a service Method in a View would always be a no - but people often forget about Umbraco Macros - and how they can be flexibly cached etc etc.
Code snippets would really help :)
We are still using xslt, do you see that as a performance issue?
Thanks @Darren, but I'm quite on top of it. The site runs quite well now.
When I say front-end, I mean anything not backoffice.
Controllers, business-objects and the like.
When we hit the services, and subsequently the cache layer, it's through IPublishedContent.Url, library.GetDictionaryItem (or localizationservice) and Access.GetAccessingMembershipRoles.
All of these now use the services. They even have obsolete attributes pointing to the new services.
For instance:
If you have say 6.000 items in the cache dictionary (god knows what, Umbraco does that), and you list URLs for say 50 content items in a page, you'll end up with 50*6.000 string comparisons of the cache keys. Accessing an URL goes through GetAssignedDomains, which does a Service.GetAll(). So one page view equals 300.000 extra cycles, when it could've just spent 1 to return one list of domains from memory. Add to this some permission checks, and you've got a million key searches. For one page!
I could of course cache it better myself, and that's what I did.
The site ended up spending 40-50% less CPU, and this is an S2 instance on Azure.
Oh, and those things happen during the routing too. So before you ever hit your controllers and views, Umbraco has already spent some time thrashing cache to find the content matching the URL.
You're welcome to join the discussion about whether this should be fixed over at the dev. mailing list.
Yup, these are the optimizations we'll fix soon... but really has nothing to do with Azure specifically ;)
Indeed, but you won't see them until you load test, profile, or even go live.
Sites run extremely well on a dev box. Probably better than most servers.
So I thought the OP might have biased the focus of the thread by "blaming" Azure, even though this new issue exists.
I'll quickly add this before I forget!
I too have had issues on Azure, where on a performance vs. cost point, Azure does not compare to a VM of a similar cost.
Obviously it has other strengths and benefits (peace man!), but the underlying architecture is quite different, and changes, such as more caching is required to get decent performance.
We are currently having major performance issues with an Umbraco site running on Azure, which fits the symptoms of this thread, with the site becoming completely unresponsive on publish, for between 5 to 15 minutes.
The following steps made a big impact:
Ensuring Cached Partials are used where possible @Html.CachedPartial("Name", Model, 1080, true, false)
Someone once told me that any caching is better than no caching, so even fairly small times can be useful. I not sure how Umbraco deals with this these days, but I would avoid excessively large durations.
Adding useTempStorage="LocalOnly" to Indexers and Providers See ExamineSettings.config
Upgrading to Umbraco 7.4.0 This resolved a lot of errors in the log regarding getting Media.
I had upgraded incrementally from 7.2.0, but this is the first update that has made a big impact.
Double checking to ensure when things aren't cached they aren't be initialized more than once. When properties / objects need to be reused, ensure these are declared as variables.
Setting
fcnMode="Single"
will also benefit you very much: http://shazwazza.com/post/all-about-aspnet-file-change-notification-fcn/For a web app I think you can add the fcnmode attribute to the httpruntime element in web.config as you can't access machine.config
http://imageresizing.net/docs/fcnmode
It's also worth noting that we ship with fcnMode="Single" by default now.
Sorry to spam this thread :) It's also worth noting that the performances fixes regarding the cache dictionaries as Lars mentioned have been fixed in 7.3.7+, so these fixes aren't just applied to 7.4.0
fcnMode="Single" is the default for ImageProcessor now also. Thanks for all the input everyone!
7.3.3 site on Azure web app service stopped in it's tracks with uplift from 6K sessions/day to 22K sessions/day. Read this thread and others and decided to go for an upgrade to 7.4.1. This has helped. Less RAM and CPU in use, but only just. The site is still slow and the CPU is up at 90+%. The Azure instance is:-
Using the AzureRedisSessionStateStore. Nothing special for media.
Is there anything we can look at to get some more speed out of this set up. We can upscale if necessary.
Any advice would be appreciated.
Which of the above optimizations did you apply?
Biggest ones that will benefit you if you are not using them is to change to
fcnMode="Single"
and setuseTempStorage="LocalOnly"
for all Examine searchers and indexers.You can also set the log level to
ERROR
instead of Info/Warn.Other than that: check your logs and if you have CPU spikes, the best way to figure out what's using all those CPU cycles is to take a memory dump and analyze it.
Thanks for getting back so quickly. Really appreciated.
We're already using fcnMode="Single" in the web.config.
I don't see useTempStorage anywhere in the ExamineSettings.config. Do I add it as a separate tag? If so what's the syntax?
In log4net, is it the NHibernate Level node that should be changed?
Thanks.
If you don't have additional indexes apart from the default ones in Umbraco, your ExamineSettings.config should look like this:
If you have additional/different indexes, just add the
useTempStorage="LocalOnly"
attribute to all of them (both indexers and searchers). This will help a lot.For log4net this should do:
Make sure to read up on the
useTempStorage
attribute and it's implications. We're currently preferringLocalOnly
overSync
because we've seen some problems with syncing in the past (though not after we fixed those errors, but we're being a bit cautious).http://issues.umbraco.org/issue/U4-7614
Thanks Sebastiaan.
Set the logging level to "Error" and the ExamineSettings.conf now looks like this:-
Rebuilt indexes. Site not much faster.
"Does Umbraco HQ do paid support over a weekend?" the client asked. Big launch on Monday.
Have you implemented Output Caching or Partial Caching? It's a good idea for most sites but certainly for Azure Web Apps it's almost a must....
No Jeavon, is that an Azure thing or an Umbraco thing?
Hi Craig,
CachedPartial is built into Umbraco and probably the easiest thing you can do. See the "Caching" section at the bottom of https://our.umbraco.org/documentation/Reference/Templating/Mvc/partial-views
Output Caching is a ASP.NET thing, we mostly use Donut Output Caching by DevTrends (for the holes) but the default ASP.NET Output Caching is also an option. There are many others also such as Redis Output Caching
For Output Caching you will need to hook into Umbraco events to ensure it's cleared when needed or some other mechanism.
Hope that's helpful?
Jeavon
p.s. The Supercharge your Umbraco session from CodeGarden 2014 goes through various different caching methods http://stream.umbraco.org/video/9949630/supercharge-your-umbraco
Thanks Jeavon,
We'll def look at implementing partial caching tomorrow. In the meantime, we've gone down from 20 seconds to 2-3 seconds load time by implementing the suggestions on logging, examine indexes and also adding
<httpErrors existingResponse="PassThrough" />
in web.config. That had been missed on the last upgrade. It seemed to unblock the whole site. Load on the site has gone down slightly now (11pm UK time) so just hoping the site holds up for the next USA West Coast wake up. But it's looking hopeful. It's been a stressful day :)My Examine config is set to Sync. I upgraded to 7.3.7 this week and everything seemed much more stable (previously on 7.3.1). However, we've just scaled up to 2 instances (in Azure) for the weekend and have noticed that the "non-master" is getting a load of "Could not retrieve media * from Examine index, reverting to looking up media via legacy library.GetMedia method" warnings in the logs. The master instance is not reporting these issues. I've tried re-indexing but that doesn't seem to effect the secondary server.
Is this a known issue with 7.3.7 (a scaled out instance doesn't index properly) and/or is this something that might be solved by changing the config setting to "LocalOnly"? How can I force a re-index on the non-master instance?
Apologies if this should be its own question, but I happened to notice the LocalOnly suggestion in this thread.
Thanks.
Further to my previous comment - I've tried the LocalOnlyExamine setting and that made no difference to my issue. My problem, it seems, is that non "Master" instances aren't creating their own indexes automatically.
To solve the problem short term I had to keep deleting the ARRAffinity cookike until I landed on the instance which had no index. From there I could log in and run the index manually. After that the errors mentioned above stopped occurring.
Can anyone else validate that on 7.3.7 (at least) no automatic indexing is taking place on the non-Master instance.
If it helps, prior to my recent upgrade to 7.3.7 I would see this message in the logs everytime I scaled up -
Umbraco.Core.Sync.DatabaseServerMessenger - No last synced Id found, this generally means this is a new server/install. The server will rebuild its caches and indexes and then adjust it's last synced id to the latest found in the database and will start maintaining cache updates based on that id
I've not seen that message since I upgraded. Maybe as part of my upgrade I've lost some previously configured settings. Starting to think this should be it's own forum question...
Created a separate question for my issue:
https://our.umbraco.org/forum/umbraco-7/using-umbraco-7/75289-azure-website-not-initializing-examine-index-after-scaling
is working on a reply...