It started a Monday, later known as “Disaster Monday”, in a way that seemed quiet at Voxel.
In the sprint we had a Tech User Story that dealt with the framework upgrade of one of our Aspx applications. Specifically, we wanted to upgrade from Net Framework 4.6.1 to 4.8.
We started developing the User Story and managed to upgrade to Net Framework 4.8 without much trouble.
We uploaded this version of the application to a demo environment and did the relevant tests to prove how the application worked. An important fact for later, these tests were done with a Firefox web browser.
So far everything seemed fine, so in the afternoon we decided to deploy to Production.
Once in production, manual tests were performed and everything seemed fine.
We ran our automated smoke tests with Cypress (run in a Firefox web browser) and they passed.
What a great day it seemed: framework uploaded and uploaded to production with no errors. #DeveloperLifeEnjoyer
Before starting the “Disaster Monday”, a bit of context
In order to get a better understanding of what happened, it is necessary to explain a bit about how we manage how a user login on Voxel’s applications.
When a user wants to enter the platform, a login screen appears. Once the credentials are entered, they are validated in our Identity validation service and a callback is launched from it to make a redirect to the website that our user initially wanted to access.
This Identity validation service has a document cache that stores the temporary access tokens.
Once we have a little more context, we can start “Disaster Monday”.
After the deployment to the Production environment, we got different clients’ reports that they could not access any of the websites.
We looked at our log service and saw a lot of errors on the streams. The logs indicated that there were connection problems between the Identity validation service and the cache.
We notified the Infrastructure department and we started to work together on this issue.
- It turns out that of the 3 cache nodes we had, one of them went down and this meant that the Identity validation service could not connect to the cache.
- We didn’t have a Balancer in production to balance the different active cache nodes.
- The Identity validation service failed when trying to connect to the cache and this meant that no one could get past the login on the websites.
Once the problem was identified, we removed the failed cache node and wrote down the improvements to plan them in the following sprints to prevent it from happening again.
We tested the access to the websites (via Firefox) and everything seems to work correctly again.
Just when we thought we could close for the day with the bug resolved and go for a quiet snack, we saw that some of our colleagues reported that they could not access websites using the Google Chrome browser. From that moment on, customers reported to us that they had the same problem.
And here’s where the fun begins.
After some research, we found the problem: Starting with Chrome version 80, Google decided to change the original behaviour of a cookie attribute called SameSite.
Originally this parameter admitted 2 values:
- Same Site Lax: Indicated that the cookie should be sent within the same site or via GET from your site to other sites.
- Same Site Strict: Limited the cookie to requests originating from within the same site.
What did Google do? It updated the standard, adding a new value:
- Same Site None.
- Changed the default value to Same Site Lax.
Why did Google make this change?
It is said to be intended to force the use of https urls on websites.
Why does this affect us?
OpenId Connect operations (login / logout) send POST requests from an external site (Identity validation service) to the site of origin of the request (Applications).
With this change Chrome “breaks” OpenId Connect logins.
From the change in Chrome, these operations will need to be excluded from the Same site, not setting the property, to ensure that these cookies are sent during their workflows.
As there are different web browsers and each of them can interpret Same Site in a different way (that’s why the error occurred only with Chrome and not with Firefox), .Net applications will need to be modified to send this parameter as Same Site None.
Why is this happening now?
Because of the Framework upgrade, since .Net changes the behaviour of its Same Site attribute from:
- Net Framework 4.7.2.
- Net Core 2.1.
How do we solve this problem?
Once we identified the problem, we decided to make a wrapper around the default nuget handling of Owin.OpenIdConnect so we could set the Same Site parameter to None in the uploaded application.
What lessons have we learned?
We learned and remembered several things that day.
The first one was to make sure that you can balance the different cache nodes and make sure that the Identity validation service does not crash if you run out of cache.
As self-criticism we have to comment that we failed to only test with a single browser. Since we know that our customers use different browsers and each of these can implement changes that others do not.
Having a demo environment the same as the production environment, we could have detected this bug long before reaching production and affecting customers.
Once we see everything that has happened, how can we prevent it from happening again? We came up with several initiatives that we transformed into user stories to prevent in the future:
- Use a cache node balancer to avoid problems if one of the nodes goes down.
- Make the identity validator service resilient to work without the cache.
- Perform automatic tests that test in several browsers.
Finally, I’m going to use this post to thank all the people who stepped up to help when these different problems arose. It changes a lot when there is a day like this, but you feel supported by people working with you towards a common goal.
So as a reflection, I encourage you to keep doing small acts for your teammates. Just sending a message and asking if there is anything you can do to help can make a difference.
About the author