broken database in a load balanced environment after document type modification

Press Ctrl / CMD + C to copy this to your clipboard.

Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at

David Vainer 15 posts 96 karma points

Sep 27, 2023 @ 08:16
0

"Broken" application in a load-balanced environment after document type modification
We're using Umbraco 8.2.2 with a large number of nodes (tens or even hundreds of thousands) in a load-balanced environment, and operations such as creating, modifying, or deleting document types were always very slow, taking up to 15 minutes or more. This also spikes our CPU and Memory usage in our servers.

Our setup is on AWS, with:
- Two EC2 instances: one Master for content editors and one Front for customers, both Windows servers.
- An RDS for the Umbraco database.
- An application load-balancer to manage requests.
Lately, after modifying or creating a new document type, there is a temporary slowness, and while the Master server functions as expected, the Front server seems to be malfunctioned and continuously "restarts" itself, and basically the application never loads there.
Restarting IIS, servers, and the database doesn't this problem

Our logs show:
- Boot failures with a timeout exception related to acquiring a lock.
2023-08-21 05:27:50,652 [P6360/D10/T10] ERR Umbraco.Core.Runtime.CoreRuntime - Boot failed. (90193ms) [Timing 0596be1] Umbraco.Core.Exceptions.BootFailedException: Boot failed. ---> System.TimeoutException: Failed to enter the lock within timeout. at Umbraco.Core.AsyncLock.Lock(Int32 millisecondsTimeout) in D:\a\1\s\src\Umbraco.Core\AsyncLock.cs:line 114 at Umbraco.Core.MainDom.Acquire() in D:\a\1\s\src\Umbraco.Core\MainDom.cs:line 170 at Umbraco.Core.Runtime.CoreRuntime.AcquireMainDom(MainDom mainDom) in D:\a\1\s\src\Umbraco.Core\Runtime\CoreRuntime.cs:line 232 at Umbraco.Core.Runtime.CoreRuntime.Boot(IRegister register, DisposableTimer timer) in D:\a\1\s\src\Umbraco.Core\Runtime\CoreRuntime.cs:line 146
- Warnings about the wait lock timing out and application shutdown.
2023-08-21 05:29:58,630 [P6360/D11/T24] WRN Umbraco.Core.Sync.DatabaseServerMessenger - The wait lock timed out, application is shutting down. The current instruction batch will be re-processed.
- A single warning about loading content from local database.
2023-09-26 05:45:15,613 [P2368/D2/T1] WRN Umbraco.Web.PublishedCache.NuCache.PublishedSnapshotService - Loading content from local db raised warnings, will reload from database.

AWS Performance Insights indicates that the most demanding SQL operations are:

SELECT value FROM umbracoLock WHERE id=@0 and UPDATE umbracoLock SET value = (CASE WHEN (value=1) THEN -1 ELSE 1 END) WHERE id=@0

Examining the [UmbracoLock] table, which changes frequently as I know, shows consistent problematic lock states:

Since this isn't the first time we encounter this malfunction, this time we tried to address it like with the following:
- We upgraded the EC2 instances into a stronger ones.
- Modifications were made directly via the Master server, bypassing the load-balancer.
- Customer traffic was reduced during the change (though not entirely stopped).
- Minimal changes were implemented to limit potential issues - 2 data types, 1 document type.
- Manual modifications we did to the [UmbracoLock] table reverted quickly by Umbraco.
The only solution that worked was restoring a database backup, resulting in lost modifications.
Interestingly, a week before this issue, document type creation and modification were successful.
Basically, apart from the temporary slowness we experience every time, until a month ago we didn't encounter this problem at all.

This problem now hinders any modifications or additions to document types, making it a critical issue. We're seeking insights and advice.

Thanks!
Copy Link
is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Please Sign in or register to post replies

Flag this post as spam?

"Broken" application in a load-balanced environment after document type modification