Copied to clipboard

Flag this post as spam?

This post will be reported to the moderators as potential spam to be looked at


  • Chris Hall 11 posts 81 karma points
    Dec 07, 2021 @ 11:40
    Chris Hall
    0

    Reverting Content, HTTP 417s and strange Azure Behaviour

    We have an installation of 8.6.4 running on an Azure App Service, in a single instance.

    The project has been live for about 18 months, with no issues - however recently, some clients started reporting difficulties using the backoffice to us - complaining of content updates disappearing, and errors.

    We initially assumed this was user error, however we recently needed to put live a large amount of content, and started experiencing these issues ourselves.

    At the moment, I'm assuming this is an Azure error (see email to Azure support beneath..). However, wanted to rule out any Umbraco misconfiguration.

    To me this presents itself as if we were running on load-balancing without configuring it.

    Changes will be made, published, then suddenly the session will expire prematurely, and after logging back in, parts of the content will have reverted.

    For example - I can sort a collection of nodes, and save. The changes will be reflected in the tree.

    Then, we will revisit this page and the order will have reverted BUT the order is still correct within the sort action of the contextual menu. After this point re-saving does nothing.

    After a while, there are numerous XHR's failing with HTTP 417 codes, to the point the backoffice becomes practically unuseable.

    One theory I had was this may have started when I copied a live DB to stage. Possibly, I thought, having the live server listed in the umbracoServers table, was somehow causing it to think stage and live were instances of the same site?

    Whilst I experienced this I was logged into both stage and live to copy across content, but i dont think this is the case - can this be ruled out?

    interestingly, when we encounter these issue we also seem to get azure issues whereby

    • Site doesn't respond to Restart requests within Azure portal
    • cshtml doesn't trigger recompilation
    • updates to web.config (via FTP direct into the azure app service) - don't trigger an app restart (BUT some filesystem changes are there e.g. uploading a text file, or making a change to umbracoSettings.config IS reflected in health checker).

    It feels like filesystem changes aren't triggering recompilation. It also feels like there could be a worker process on azure which is hanging, effectively giving us 2 instances when there should be one.

    Has anyone ever seen anything similar?!


    Email to Azure


    We are experiencing an intermittent issue, on an Umbraco website hosted within an App Service - which presents symptoms which would consistent with running the site across multiple instance without configuring for load-balancing.

    However, our site is limited to a single instance.

    When we experience these issues we see:

    • The app doesn't respond to a Restart within the Azure portal
    • Manual changes to web.config via FTP to app service hosting does not trigger an app restart
    • runtime generated code (e.g. cshtml razor views) - don't appear to recompile.

    We started noticing this some months ago, but it has become more impactful recently as we started noticing app errors that only occur whilst we have the same issues listed above.

    During these app errors

    • content changes with the CMS appear to revert to earlier states
    • sessions are abandoned triggering login long the sessions expected expiry
    • XHR / Fetch requests error with HTTP 417 codes

    I wonder if it's possible that after a worker swap, there is a hanging worker which is still receiving traffic?

    I had previously raised a ticket for another app of ours, using app service for containers, where the issue was identified as hanging containers. I wonder if something similar can happen with app services?

    This problem is intermittent, we are unsure of the trigger. It was present all day yesterday, but today appears to be working as expected.

  • Marc Goodson 1718 posts 11411 karma points MVP 7x c-trib
    Dec 11, 2021 @ 14:41
    Marc Goodson
    0

    Hi Chris

    This could be completely unrelated, but have you launched a cookie plugin on the front end of your site? That allows users to accept/reject cookies?

    The only reason I mention it is the 417 error... I have seen quite a lot... means that something has disrupted the backoffice cookie that keeps the editor logged in, and I've found editors lose work because of it, eg they work on something, and hten press preview (this loads the front end destroys the cookie) then they return back to the backoffice and carry on editing, but there work isn't saved because of the cookie has been invalidated by the front end cookie popup library... each time I've had this I've listed the backoffice cookies as ones to be ignored and normality to the whole site has resumed...

    ... anyway caused enough pain for me to mention it here as an aside, in case 'that is it'...

    ... if not apologies for the distraction on your issue...

    I don't think being logged into stage and produciton would be the issue or having the names of the servers in the server table in different environments, the two sites would need to share the same database instance to trigger load balancing...

    if you haven't already, it may be worth reviewing your Examine/cache settings for running Umbraco on Azure Web App: https://our.umbraco.com/Documentation/Fundamentals/Setup/Server-Setup/azure-web-apps-v8 if your Web App fails over to another server, you can have issues with locks etc if you don't have the recommended configuration... but my bet is a cookie plugin popup!

    regards

    marc

  • Chris Hall 11 posts 81 karma points
    Dec 13, 2021 @ 12:00
    Chris Hall
    0

    Hi Marc,

    Many thanks for your reply, and for your insight.

    We do run a cookie preferences manager - but it is a custom solution that I wrote personally, and all it does it to wrap script tags and prevent their containing JS from executing until consent is given. We run this across nearly all of our Umbraco sites and this is the only affected installation.

    There is no code to manipulate any of the cookies themselves directly.

    I'm still thinking this is an Azure issue.

    Just now it's actually come back and shown me as logged into another users session! A user I don't have credentials for, which is - disturbing!

    We don't have any 3rd party caching, so no idea what is happening at the moment. Still awaiting on Azure, their front-line support have referred the issue further.

    Thanks, Chris

  • Chris Hall 11 posts 81 karma points
    Dec 13, 2021 @ 12:31
    Chris Hall
    0

    enter image description here

    I just took this screengrab from umbracoServers table, that I recently cleared out...

    This to me, I think, shows 2 Azure workers, on the same identical domain, both active.

    Can anyone verify that this is what I'm seeing?

    Thanks, Chris

  • Mats Stam 60 posts 209 karma points
    Dec 13, 2021 @ 13:46
    Mats Stam
    0

    Could it be that there are other issues with preview. Like the editor changing domains, and therefore loosing the cookie? Something like going from an Azure url to the actual live url, or vice versa? That would invalidate the authorization cookie.

  • Chris Hall 11 posts 81 karma points
    Dec 13, 2021 @ 14:34
    Chris Hall
    0

    Hi Mats,

    That's a good point - I had test and checked redirects.

    We redirect the Azure URL to the live URL, and the same with www subdomain so it's always forced onto the same URL.

    As far as I can from the addresses and statuses of the XHR/Fetch's there's no redirects taking requests off domain.

    I can replicate this most of the time at the moment, without saving, publishing or previewing. I just need to log in and click around for a minute or so. Then I will be logged out and, in the case of earlier this morning, logging back into someone else's active session :/

  • Marc Goodson 1718 posts 11411 karma points MVP 7x c-trib
    Dec 14, 2021 @ 07:20
    Marc Goodson
    0

    Hi Chris

    Is it scaled out at all?

    Umbraco will automatically switch to use Load Balancing (doesn't need to be configured) if more than one instance accesses the same database.

    regards

    Marc

  • Chris Hall 11 posts 81 karma points
    Dec 14, 2021 @ 10:46
    Chris Hall
    0

    Hi Marc,

    No, that's what's unexpected - it's supposed to be a single instance on Azure. The rest of the behaviour would make sense if it was running across a load balanced configuration without the proper configuration within Umbraco.

    I've had a similar issue with Azure in the past, with App Service for Containers, where they would fail to be destroyed correctly and hang around for weeks afterwards. I suspect something similar in this case...

    Thanks, Chris

  • Marc Goodson 1718 posts 11411 karma points MVP 7x c-trib
    Dec 14, 2021 @ 13:00
    Marc Goodson
    0

    Hi Chris

    Quick test... Rename the database? Update connection string on your good app... Clear the table of server names...

    See if the rogue instance adds itself back in...

    If it does then how did it know about the connection string change... Problem is it appears scaled out... When you know its not...

    If doesn't add back in.. Shows its another instance somewhere else... Hanging around and messing things up

    Regards

    Marc

  • Chris Hall 11 posts 81 karma points
    Dec 14, 2021 @ 14:14
    Chris Hall
    0

    Hi Marc,

    I could try that, although I'm a little hesitant of creating errors on a production site. (If some users were seeing another instance and that instance was erroring because of the DB connection).

    Am I understanding you correctly that in scenario 1 (if it re-registered itself despite the change) - that is consistent with scaling out - as in, would it be expected when apps scale out in azure that filesystem changes are synced between the scaled out instances?

    If so, I believe that's consistent with changes I've seen. e.g. - I can add a .txt file and see that instantly, despite touching web.config not triggering a restart, and .cshtml changes not being recompiled and shown...

    Thanks, Chris

  • Marc Goodson 1718 posts 11411 karma points MVP 7x c-trib
    Dec 14, 2021 @ 16:31
    Marc Goodson
    0

    Hi Chris

    If you want to avoid errors, copy the database instead of renaming and point your good web app to the copy.

    So, if it's one instance scaled out, each scaled instance will announce itself back into the database table, after the connection string change.

    But if your theory is correct about another instance hanging onto the connection somewhere in the ether, then that will have the 'old' connection string, and won't appear back in the database table, cos it won't know about the connection string change... but if a user somehow 'hits' that site, it will still serve the site based on the 'old database'...

    after your experiment is over you can revert back to the initial database...

    but it will tell you if you are reporting to Azure about your instance being 'scaled out' - when you have the settings to say don't scale in place... vs something else is still out there connecting to my live site's db...

    regards

    Marc

  • Chris Hall 11 posts 81 karma points
    Dec 15, 2021 @ 18:54
    Chris Hall
    0

    It turned out that the client had added a second A record to their DNS, pointing to cloudflare-esque proxying and security-as-a-service provider.

    I never checked the DNS beyond pinging from CMD prompt to see the resolved IP. Won't be making that mistake again...!! The first handful of times I tried I got the Azure IP, just happened to see the second A record today after eliminating all other possibilities.

    The second instance was just another deployment slot that i'd spun up a few days ago :-|

    All is well now!

  • Marc Goodson 1718 posts 11411 karma points MVP 7x c-trib
    Dec 15, 2021 @ 19:45
    Marc Goodson
    0

    Phew Chris

    Glad you got it sorted and resolved the mystery...

    Yes, if you have multiple slots running at the same time sharing same database then the flexible load balancing will automatically kick in...

    ... But yeah... real issue was the additional dns resolution...

    Wouldnt have suggested that!

    But I guess we will in the future :-)

    Regards

    Marc

    .

Please Sign in or register to post replies

Write your reply to:

Draft