Sunday, January 28, 2024

When AWS Ruins Date Night

Full disclaimer. This will probably be pretty dry, boring, and maybe overly technical. But...it provides some insight as to what I do for work, and for people who work in a similar field, it might be relatable to problems that you deal with. And besides, the primary purpose is to document events for myself. If other people enjoy reading about them, great.

For the last seven years, I have worked for a company called 365 Retail Markets. We provide hardware and software for micro-markets. I have spent most of my time with the company working in the vending division. My current title is Manager of Vending Technology. I head up a small team that is responsible for three different card reader devices, a website for managing these devices, and an api that allows the devices to communicate to our backend. I primarily work from home, from my little office in our basement.

It was Friday and the work day was winding down. I had plans to go see a movie with Jeanell and a few of my kids. (We were going to see Mean Girls Musical Movie. First it was a movie. Then it was a musical. Now it's a musical movie. And Tina Fey has laughed all the way to the bank every time). It had just passed 5:00 and I stepped away from my computer. We weren't planning to leave for the movie until 6:15 or so, so I thought I might go relax for a bit, maybe read a bit of my book.

Suddenly, I got an alert on my phone saying that our database server had gone down (our backend consists of two server instances in AWS, one which hosts our database, and the other that hosts our web applications). I headed back down to my computer and when I couldn't login to the database server, I pinged Rob, a co-worker who manages our AWS infrastructure. He is in the Eastern time zone so it was just after 7:00 for him, but he said he was logging back into his laptop.

While we were getting logged into the AWS console to see if we could determine the problem and get the server backup, the server suddenly came back online. Rob looked in the console and there was a notification that our server instance had failed a status check and since it was set for auto-recovery, AWS had automatically created a new server instance. It looked like we were good.

I use software called New Relic to monitor the performance of our websites and I checked it to make sure that things were going back to normal. It quickly became apparent that while things were back up, they were most certainly not back to normal. The average web request response time was much higher than it had been and our APDEX score (a measure of what percentage of web requests receive a response in an acceptable amount of time) was about half of what it had been. I logged into the website and found it to be running extremely slowly. We weren't out of the woods yet.

At 6:00, with problems still ongoing, I texted Jeanell and told her I probably wouldn't be able to go to the movie. I was still hopeful that maybe things would start working again in the next few minutes, but it wasn't looking likely. She came down to my office, looking gorgeous. I couldn't believe I wasn't going to be able to go.

I pinged my boss Chad and our database administrator Kris and made them aware of the situation. They are both in the Eastern time zone as well, but both willingly jumped on a call, along with Rob and me. We were trying to get things going, but nothing was working. Jeanell was still waiting for me, but time was up, I wasn't going to be able to go. Jeanell and Lila left to go to the movie.

We spent the next 2.5 hours looking at logs, monitors, and doing everything we could think of to get the performance back to normal. We rebooted both servers. We tried to launch yet another new instance of our database server. Nothing seemed to work. Finally, we let Rob and Kris step away and Chad and I reached out and arranged to get on a call with AWS Support.

With the AWS support engineer, we again looked at various logs and performed various queries against our database, trying to figure out where the problem was. We were on this call for another hour without finding anything. Finally, he said he would send us some instructions of some logs to get that we could upload for another team to review. We ended the call.

We waited for the email with instructions on the logs that they needed. By this time, it was 10:00 my time. We wanted to get the logs before we stepped away so someone could be reviewing them while we slept. Chad pinged Rob and Kris for help with getting the logs we needed and even though it was now midnight EST, Rob responded and said he could get the logs.

That process took about a half-hour, but we got a zip file of the logs and were able to upload it to AWS for their review. At this point, we stepped away. I let Chad know that I would be up at 6:00 the next morning and if there was something for us to do, we could start working on it.

Jeanell and Lila returned home and we went to bed. I happened to wake up at 3 AM and AWS had responded with some additional suggestions. They suggested initializing one of our hard drives and updating some drivers. I went back to bed and got up at 6:00. We needed Rob's help with the additional steps and understandably he wasn't up yet. We pinged him and by 7:30 we were back on a call to proceed working on the problem.

But the suggestions didn't really make sense to Rob or any of us. We went back and forth for a bit, but finally decided to request another call with AWS. Finally, we got an engineer who had some insight into what was really going on, due to a similar case he had worked on a couple of days before.

He noted that our database server instance type was m4 and said that he had seen some newly created m4 instance types having performance problems. We had in effect created two new m4 instances, one automatically by AWS and then we had tried the process again manually. Both had resulted in the performance problems.

He suggested upgrading our instance to an m5 instance type and thought that would resolve the problem. We stopped our websites, took a backup of the current server, updated some AWS drivers to later versions that were needed for the m5 instance type, and then performed an in-place upgrade of our database server. After completing the upgrade, we started the database server back up, started the websites again, and held our breath.

I watched New Relic and breathed a sigh of relief when the performance metrics went back to what they had been before the initial crash. Chad logged into the website and found it to again be fast and responsive. It looked like this had solved the problem. It was now 9:30 AM Saturday morning.

Fortunately, this kind of thing doesn't happen too often, but it was still frustrating to miss my date with Jeanell and the kids due to something outside of my control.

Either way, it felt good to finally have the problem solved.