cache in load balanceed environment

Vidar A. Westrum 23 posts 129 karma points

Oct 06, 2016 @ 13:24

Cache in load balanceed environment

Hi,

We now have some serious problems with our load balanced front end servers. once in a while, they all of a sudden use 25s to answer a call for the API we have created. And it all starts to go really bad, and it'll get quicker again after some minutes. And given that Umbraco serves the front pages for seven webshops, this has gotten a bit important :)

That was the abstract, let's describe the environment a bit:

So we have five servers, one admin that is set as master by code, and four front end servers (flexible load balance) set as slave and the ones that answers the web application that runs (about 40 servers).

The API is getting a published page at root by name, then picks the first node of a given doctype, then at last an ID given from a property on the previous selected node. and convert this to a json that suits our needs.

Most of the time, it answers the webapp at less than 14ms, which is fantastic. But some times it spends 25 seconds to answer. We have been using New Relic to have a look at this, and found some interesting stuff.

We can see that there is something going on with ResolveRequestCache and some application code after umbracocacheinstructions sql select. yeah, look at image.

This is what happens when it slugs down

So I looked into the Umbraco log and saw that, yes, there has been a publish right before this happened. And a deeper dive telle me that everytime it got slow, there was a publish. But not everytime a publish has happened it gets slow. so yeah.... bah.

We looked deeper down in the report and looked at the server. And the memory usage isn't that bad, but the CPU. The CPU jumps to 30%, CPU usage when it's slow which will be ca. one core of the four the server has. And by looking one more step, we can see that the IO increases as well.

IO when it's slow

SO, we have narrowed the theory down to:

A publish is made on admin
when one of the four frontend servers gets a request, it sees that it needs to update it's cache
Read from SQL and loads creates the XML and put it in it's cache, and then writes the umbraco.config
Something dodgy is then happening and slows it down when writing umbraco.config, because of the IO report.

What we HAVE tried, is to set ContinouslyUpdateXmlDiskCache to false to perhaps make it write the config file async, but that really made a mess on the servers and all of a sudden we got a couple of days old pages showing instead of the one that should be present and some of the pages was published but not in cache, so after a lot of yelling from the marketing department, we turned it back to true. :)

Oh, yeah, we run 7.4.1 in production, I have now upgraded to 7.5.3 but it's not in production yet, it's in DEV in the hands of QA.

If you are still reading, I really want to thank you for doing so :)

Anyone have any ideas for what we can do to fix this?

BR Vidar Aune Westrum

Flag this post as spam?

Cache in load balanceed environment