I run an Umbraco 7.5.2 site in Azure App Service and for the last several weeks we are receiving seemingly random outages on Azure which last around 5-15 minutes.
I've been trying to diagnose what's causing the issue, the Umbraco logs don't indicate that the hosting environment is being shut down or restarted and just start getting filled up with:
System.Threading.Tasks.TaskCanceledException: A task was canceled.
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Umbraco.Web.Scheduling.KeepAlive.<PerformRunAsync>d__4.MoveNext()
2017-01-20 16:37:27,082 [P1988/D3/T57] ERROR Umbraco.Web.Scheduling.ScheduledPublishing - Failed (at "https://<omitted domain>/umbraco").
Does this mean Umbraco is still running under the hood but Azure unable to route the requests?
When I look at the failure history on Azure it shows container health at 100% and organic health at 0% during the outage.
There's nothing interesting before or after the logs, no exceptions that I can see that would be causing the app pool to get recycled.
We're hosting on the S2 App service plan with an S0 database, when it's available it's generally pretty quick.
As a side-note we did look at Umbraco Cloud as a solution to minimize the operational headaches but the fact we can't integrate it with Azure Active Directory means we can't really go down this route unfortunately.
Does anyone have any experience or pointers to help debug issues like this? Any help would be appreciated.
I've seen that happen before but not recently with these small outages, the machine name is staying constant on the logs.
I can upscale the database but as our DTU usage is generally very low (usually 5-10%) I'd drawn the same conclusion as you as unlikely this is the cause of the issue.
Do you have "Always On" enabled on the "Application Settings", I would recommend trying this if you don't have it on already?
Also do you know what the app normal boot time is? The issue with using S0 and S1 database tier is that boot up time can be prolonged due to DTU limit, once started up as you report DTU usage is very low (utilising Elastic pools is the best solution if you have enough dbs to share).
We've got always on enabled but good point - our usual boot time is around 5-10 minutes following a restart,
I'd kind of discounted this as I couldn't see anything that was causing it to restart in the logs, but if it is restarting on a slower DB tier that could explain things.
If you think it's worthwhile I can trial the S2 database tier and see if this improves things! I appreciate your help and advice so far
I would say give it a try and see how it effects your app startup time.
I'm pretty sure that Azure Web Apps do recycle every 1740 minutes (the IIS app pool default) although I can't actually find any reference to this specifically for Azure Web Apps, perhaps it could be this you are seeing?
We've just had another restart with Azure returning a 503 error, I can see the web worker process has stopped.
Ideally I'd like to figure out the cause of the restarts as apparently 'Always On' should prevent the IIS recycle - it's just tricky given the limited view within Azure Apps to get the data out of it.
If anyone has any tips on trying to diagnose the cause of the restart that would be appreciated.
Just had a look through all the trace logs around the time of the incident, they all seem to relate to Azure / Kudu tasks e.g. loading the event log, /env etc - I can't see anything that indicates a restart in there.
I looked at the health and performance monitor and interestingly enough the web worker process wasn't there at all during the downtime, I actually had to restart the app service to get the web worker back - I've not seen this before so not sure if it's usual but it looks like the worker had stopped completely - no disk IO or CPU usage at all.
After I manually pressed restart it came back up pretty quickly, within a minute or so - I would imagine if Umbraco was starting it'd have been using the CPU and disk to rebuild indexes so very strange.
Just as an update we had the problem again but while the service was restarting I managed to grab the logs and inspect - if I press the 'restart' button in Azure it comes up pretty much straight away so I'm not convinced it's an Umbraco issue.
I think there's something odd going on with our Azure package - it's like the web worker goes away and eventually comes back, unless you reboot it and force it sooner.
Did you manage to find an actual solution to this issue? I have experienced the same thing and while some messing with the indices has lessened the problem it still happens from time to time. The restart usually only takes a couple of minutes but it's still really annoying.
In terms of resolving the issue - we don't have it any more and the site is larger in every sense of the word and more stable.
We made several changes and found a few issues - I'm not sure which of them was the silver bullet or if it was a combination of things.
We had a lot of pointless requests coming in from bad Outlook / Exchange config looking for an XML file on our domain that wasn't there, eventually, we fixed the config but as a stop gap we put an empty .xml file there so it wasn't returning a 404 but it almost felt like we were doing a denial of service attack against ourselves at times.
We regularly update Umbraco
We turned off index rebuilds on a restart and do those manually when they get out of sync so if it does restart it doesn't take long to come back up, our indexes are only used in the back office although I believe some of the Umbraco
We've moved media file hosting to blob storage
We ran some maintenance scripts/plugins to clean out old versions of nodes and ran some maintenance scripts on the database to reduce table fragmentation
We optimized various template files/macros and fixed a few recurring errors that could occur on some views - I suspect this had the biggest impact as it may have been restarting due to error count.
A quirk I've noticed with Azure deployment slots and the app service, in general, is sometimes you just seem to get a bad instance, we used to swap staging and live frequently and one of the two slots seemed to have a lot of issues despite having the same code and config, we ended up deleting the slot and recreating it so it may just be a quirk of the Azure app service (or unrelated and I'm imagining things)
Help diagnosing Azure App Service outages
Hi all,
I run an Umbraco 7.5.2 site in Azure App Service and for the last several weeks we are receiving seemingly random outages on Azure which last around 5-15 minutes.
I've been trying to diagnose what's causing the issue, the Umbraco logs don't indicate that the hosting environment is being shut down or restarted and just start getting filled up with:
Does this mean Umbraco is still running under the hood but Azure unable to route the requests?
When I look at the failure history on Azure it shows container health at 100% and organic health at 0% during the outage.
There's nothing interesting before or after the logs, no exceptions that I can see that would be causing the app pool to get recycled.
We're hosting on the S2 App service plan with an S0 database, when it's available it's generally pretty quick.
As a side-note we did look at Umbraco Cloud as a solution to minimize the operational headaches but the fact we can't integrate it with Azure Active Directory means we can't really go down this route unfortunately.
Does anyone have any experience or pointers to help debug issues like this? Any help would be appreciated.
Hi Dan,
Do you know if your app is changing server when the outages occur, you should notice the name of the log file changing as it contains the server name.
Also it's worth noting that S2 is the recommend minimum SQL Azure database tier (not that I think it's having a impact on your issue).
Jeavon
Hi Jeavon, this issue is causing me problems on distributed environment on Azure, how can we fix this? I'm currently using Umbraco 7.6.3.
Arlan
Hi Jeavon,
I've seen that happen before but not recently with these small outages, the machine name is staying constant on the logs.
I can upscale the database but as our DTU usage is generally very low (usually 5-10%) I'd drawn the same conclusion as you as unlikely this is the cause of the issue.
Dan
Hi Dan,
Do you have "Always On" enabled on the "Application Settings", I would recommend trying this if you don't have it on already?
Also do you know what the app normal boot time is? The issue with using S0 and S1 database tier is that boot up time can be prolonged due to DTU limit, once started up as you report DTU usage is very low (utilising Elastic pools is the best solution if you have enough dbs to share).
Jeavon
Hi Jeavon,
We've got always on enabled but good point - our usual boot time is around 5-10 minutes following a restart,
I'd kind of discounted this as I couldn't see anything that was causing it to restart in the logs, but if it is restarting on a slower DB tier that could explain things.
If you think it's worthwhile I can trial the S2 database tier and see if this improves things! I appreciate your help and advice so far
Dan
Hi Dan,
I would say give it a try and see how it effects your app startup time.
I'm pretty sure that Azure Web Apps do recycle every 1740 minutes (the IIS app pool default) although I can't actually find any reference to this specifically for Azure Web Apps, perhaps it could be this you are seeing?
Jeavon
Hi Jeavon
We've just had another restart with Azure returning a 503 error, I can see the web worker process has stopped.
Ideally I'd like to figure out the cause of the restarts as apparently 'Always On' should prevent the IIS recycle - it's just tricky given the limited view within Azure Apps to get the data out of it.
If anyone has any tips on trying to diagnose the cause of the restart that would be appreciated.
Kind regards, Dan
Hi Dan,
Have you checked the Trace logs in Kudu?
There are in
D:\home\LogFiles\kudu\trace
Jeavon
Hi Jeavon,
Just had a look through all the trace logs around the time of the incident, they all seem to relate to Azure / Kudu tasks e.g. loading the event log, /env etc - I can't see anything that indicates a restart in there.
I looked at the health and performance monitor and interestingly enough the web worker process wasn't there at all during the downtime, I actually had to restart the app service to get the web worker back - I've not seen this before so not sure if it's usual but it looks like the worker had stopped completely - no disk IO or CPU usage at all.
After I manually pressed restart it came back up pretty quickly, within a minute or so - I would imagine if Umbraco was starting it'd have been using the CPU and disk to rebuild indexes so very strange.
Dan
Just as an update we had the problem again but while the service was restarting I managed to grab the logs and inspect - if I press the 'restart' button in Azure it comes up pretty much straight away so I'm not convinced it's an Umbraco issue.
I think there's something odd going on with our Azure package - it's like the web worker goes away and eventually comes back, unless you reboot it and force it sooner.
Hi Dan,
I was wondering if you found the cause to this issue?
Regards Niklas
Did you manage to find an actual solution to this issue? I have experienced the same thing and while some messing with the indices has lessened the problem it still happens from time to time. The restart usually only takes a couple of minutes but it's still really annoying.
In terms of resolving the issue - we don't have it any more and the site is larger in every sense of the word and more stable.
We made several changes and found a few issues - I'm not sure which of them was the silver bullet or if it was a combination of things.
We had a lot of pointless requests coming in from bad Outlook / Exchange config looking for an XML file on our domain that wasn't there, eventually, we fixed the config but as a stop gap we put an empty .xml file there so it wasn't returning a 404 but it almost felt like we were doing a denial of service attack against ourselves at times.
We regularly update Umbraco
We turned off index rebuilds on a restart and do those manually when they get out of sync so if it does restart it doesn't take long to come back up, our indexes are only used in the back office although I believe some of the Umbraco
We've moved media file hosting to blob storage
We ran some maintenance scripts/plugins to clean out old versions of nodes and ran some maintenance scripts on the database to reduce table fragmentation
We optimized various template files/macros and fixed a few recurring errors that could occur on some views - I suspect this had the biggest impact as it may have been restarting due to error count.
A quirk I've noticed with Azure deployment slots and the app service, in general, is sometimes you just seem to get a bad instance, we used to swap staging and live frequently and one of the two slots seemed to have a lot of issues despite having the same code and config, we ended up deleting the slot and recreating it so it may just be a quirk of the Azure app service (or unrelated and I'm imagining things)
is working on a reply...