publishedcontentcach locked and db lock timeout errors on load balanced umbraco 8

Gabor Ferencz 40 posts 181 karma points

May 24, 2022 @ 06:59

PublishedContentCach locked and DB lock Timeout errors on Load balanced Umbraco 8

Umbraco version: 8.18.3 Multilingual site containing about 10000 pages. Nucache DB size: 1.4GB

I'm in a bit of a pickle. We have a site that runs on Kubernetes, in a docker container. This is something we've done before and it has worked. The setup is as follows:

1 CMS Pod, 3 Front-end Umbraco Pods 3 Nuxt applications communicating with the Umbraco Front-end pods.

We have with lower data volumes been all right, but since loading in 10000 pages, containing a bunch of baked HTML content, we cannot start up our sites. Each pod takes well over 10 minutes to start, and then we run into a

The process cannot access the file 'C:\inetpub\wwwroot\site\App_Data\TEMP\NuCache\NuCache.Content.db' because it is being used by another process.

I have these in the app settings, but since this is Kube and docker, I don't thing they're toooo relevant.

<add xdt:Transform="Insert" key="Umbraco.Examine.LuceneDirectoryFactory" value="Examine.LuceneEngine.Directories.TempEnvDirectoryFactory, Examine" />
    <add xdt:Transform="Insert" key="Umbraco.Core.LocalTempStorage" value="EnvironmentTemp" />

We don't use examine on the front-end servers, so WE've also implemented this:

public class DisabledExamineComposer : IUserComposer
    {
        public void Compose(Composition composition)
        {
            if (MofConfigurationManager.Current.IsFrontEndServer)
            {
                // replace the default
                composition.RegisterUnique<IUmbracoIndexesCreator, InMemoryExamineIndexFactory>();

                // replace the default populators
                composition.Register<MemberIndexPopulator, DisabledMemberIndexPopulator>(Lifetime.Singleton);
                composition.Register<ContentIndexPopulator, DisabledContentIndexPopulator>(Lifetime.Singleton);
                composition.Register<PublishedContentIndexPopulator, DisabledPublishedContentIndexPopulator>(
                    Lifetime.Singleton);
                composition.Register<MediaIndexPopulator, DisabledMediaIndexPopulator>(Lifetime.Singleton);
            }
        }
    }

SO I'm at a bit of a loss. It doesn't feel like 10000 pages, plus about 5000 more images should do this to the cache. What are we doing wrong, and what can we do to fix the startup time and reduce the lock times?

One other note to mention: On the CMS, when saving pages, we got all these errors while the other pods were starting up:

enter image description here

We have a number of composers in the system that are reading from the cache, and on save, some also validate and update the database in custom tables, though I doubt that entity framework has too much to do with the cache loading, or lock period timeout exceeded errors...

We really need to hand this site over soon, and while everything is looking great, we cannot have more than one instance of the application running, and it doesn't look great that we get all these lock timeouts...

Copy Link

Marc Goodson 2157 posts 14435 karma points MVP 9x c-trib

May 24, 2022 @ 23:04

Hi Gabor

When this phenomenon occurs what is the activity on your Sql instance? Is it at 100%...DTUs maxed out??

What I've seen, even with correct config in place, is the attempt to get the MainDom lock via Sql fails because Sql instance is maxed out and Umbraco keeps retrying every two seconds to get the lock.. But unfortunately, this ensures the Sql instance stays maxed out and it gets into a bit of a loop...

If this is your scenario ... then scaling up the DB instance is a possible short term experiment...

There is also possibility to move the SqlMain Dom Lock to a different database... https://github.com/umbraco/Umbraco-CMS/pull/11075

and there is a setting for increasing the Timeout value for the MainDom lock:

<add key="Umbraco.Core.SqlWriteLockTimeOut" value="6000" />

https://our.umbraco.com/documentation/Reference/Config/webconfig/#umbracocoresqlwritelocktimeout

Not sure that helps, but thought I'd mention in case it turned out to be relevant in your case.

regards

Marc

Copy Link

Gabor Ferencz 40 posts 181 karma points

May 25, 2022 @ 14:43

Thanks Marc,

The CPU does not seem overly high. It spikes for a bit, then settles down. It's an S2 DB on azure. I've done the timeout update, also changin the serailizer for nucache, and the packet size, but nothing seems to make a difference.

Just trying to ignore the cache file entirely, to see how that works...

Thanks for the suggestions... Incidentally, do you have a large cache? How long does it take to load?

Thanks! Gabor

Copy Link

Marc Goodson 2157 posts 14435 karma points MVP 9x c-trib

May 26, 2022 @ 07:59

Hi Gabor

It seems too long! I work for an agency so tend to see lots of different setups, the more content you have the more likely it seems the SQL instance gets maxed out, which was why it was my first thought.

have had V8 site with around 7,000 nodes be slow to startup, but not 10 minutes slow! - more a couple of minutes, moving to ExamineX (Azure Search based examine indexes helped there).

I suspect your issue maybe to do with running Umbraco 8 in Kubenettes...

eg if you put CMS and Front End instances into Azure Web Apps - I reckon you'd be fine with that size cache!

In a web app, you'd also set:

<add
    key="Umbraco.Core.MainDom.Lock"
    value="SqlMainDomLock" />
<add
    key="Umbraco.Core.LocalTempStorage"
    value="EnvironmentTemp" />

Also did a bit of a google to see if anyone else... but only found your comment on this one:

https://our.umbraco.com/forum/using-umbraco-and-getting-started/105139-is-umbraco-kubernetes-ready

Did you make the deployed instances have different SiteNames?? eg if they think they are the same, are they clashing in some way...using EnvironmentTemp for the nucache location would be a good experiment to discount this.

Finally what is going on in the logs when it starts up for 10 minutes? where is the bottlenecks? it feels like something is timing out to be that long!

Is it the lack of speed writing to the local disk for a container? and if so yes, disabling localdb would be worth doing:

https://our.umbraco.com/Documentation/Reference/V9-Config/NuCacheSettings/#additional-settings

Callum Whyte, has written an article here: https://skrift.io/issues/umbraco-docker-and-kubernetes-should-we-care/ where he talks about applications that reply heavily on local disk performance might not be the right choice for containersiation!

Sorry to not be more helpful!

regards

Marc

Copy Link

Gabor Ferencz 40 posts 181 karma points

May 26, 2022 @ 10:27

Thank you for the detailed response!

So this issue happens locally too, if I connect to the large DB.

Incidentally, if I connect with another, clean version of Umbraco, the system starts up in a couple of minutes, even when connecting to the large DB.

So there must be something in the config, or in the custom composers that is overriding functionality. We have an entity framework instance in the application too,

The way I see it, kubernetes would be acting just like any other load balanced machine on a VM, so I doubt the issue lies there.

I have just created a clean install of Umbraco, and pointed it at the large DB. The smaller, local one works ok, but requires a SQL connection timeout of at least 120 secs. The large DB dies. I set the SQL connection timeout to 240 and after 6 minutes, the app says boot failed.

Umbraco times out and CPU is maxed out for that time.

Copy Link

Gabor Ferencz 40 posts 181 karma points

May 27, 2022 @ 06:45

Hi Marc,

I found the following. When the app tries to start up, it tries to run the following sql and falls over. I get a database timeout issue.

(@0 uniqueidentifier)DELETE FROM cmsContentNu WHERE cmsContentNu.nodeId IN ( SELECT id FROM umbracoNode WHERE umbracoNode.nodeObjectType=@0 )

On the QA environment, the one that contains the large amount of data, this query takes 10 minutes to run. When multiple pods running the same code start up and run this, they don't have any way of completing, so I will look at staggering the deploy process, to pull up the admin pod first, then the front-end pods.

Another interesting thing is that when the app starts up on the admin pod it tries to rebuild the CMSContentNu table, which takes a long time. Why would it want to? Is there a flag somewhere that says the table needs rebuilding? We want a cold start, but a full db rebuild cold.

Copy Link

Marc Goodson 2157 posts 14435 karma points MVP 9x c-trib

May 27, 2022 @ 11:22

Hi Gabor

Yes, that's closer to what I've seen, which is why I asked about SQL instance initially, as that too much is trying to be done via the SQL server during startup, and it can't handle everything and gets maxed out and then doesn't recover...

Think your large amount of content is making the DELETE WHERE IN statement under performant!

Looks like it's fired here in a RebuildContentDbCache method (and same exists for Members and Media)

https://github.com/umbraco/Umbraco-CMS/blob/0134199035cf4caf0deddc24648be5cdeb27d2f9/src/Umbraco.PublishedCache.NuCache/Persistence/NuCacheContentRepository.cs#L154

and tracing this through, it gets called from PublishedSnapshotService also lots of interesting comments here:

https://github.com/umbraco/Umbraco-CMS/blob/0134199035cf4caf0deddc24648be5cdeb27d2f9/src/Umbraco.PublishedCache.NuCache/PublishedSnapshotService.cs

I wonder if it would help to have the MainDomLock in another database instance? so whilst that delete statement is running, Umbraco can still determine the main dom lock:

https://our.umbraco.com/documentation/Reference/Config/webconfig/#umbracocoresqlwritelocktimeout

The instances would still be slow to come online but the lock might be ok?

With super large sites for V7, we kept the published cache size down by introducing the notion of archived content, obviously not a quick fix here, but I'm wondering if any of your content is regularly edited? - does it need to be in the 'published cache'?

https://moriyama.co.uk/about-us/news/blog-the-need-for-archived-content-in-umbraco-and-how-to-do-it/

Was about to say we should raise with HQ, but can see now your chatting with PMJ and he is probably the best person to work it all out!

regards

Marc

Copy Link

is working on a reply...

This forum is in read-only mode while we transition to the new forum.

You can continue this topic on the new forum by tapping the "Continue discussion" link below.

Flag this post as spam?

PublishedContentCach locked and DB lock Timeout errors on Load balanced Umbraco 8