Our new Member Services site and mobile apps experienced a rather significant outage the week of Nov 14 – Nov 16, 2017. We had to take them down for a couple of days while we worked out some database performance issues. We want to take a few minutes to pull back the curtain and shed some light on exactly what happened. We’ll attempt to be both technical and non-technical in one post. Hopefully, there’ll be a little bit for everyone to enjoy. So, let’s dive right in.
Our tech team is responsible for keeping three different types of users happy and productive; Players, League Operator staff, and National Office staff. There are two software platforms that are critical to the daily operation of an APA league; Member Services (the new and old versions) and our back office applications. We take these three groups of users into account with every decision we make. Sometimes what makes one group happy comes at the cost of another groups happiness.
The architecture that we’ve chosen requires that we move towards a single database to run the entire organization. This is great in terms of keeping things simple and not so great when something goes awry. In a nutshell, the new Member Services site/app had some database requests that were taking a really long time to return and there were a lot of people making those requests. A database server can only service a certain number of requests at a time and if those requests are slow to return this caused a traffic jam of database calls. The result was that ALL of our users (Players, League Operator, and National staff) were negatively regardless of which application they were using (with the exception of the old Member Services site which we’ll touch on later). It was a very sad day indeed.
Our initial course of action was to take new Member Services offline to clear up the traffic jam and it worked. League Operator staff and National Staff were able to keep up with business as usual. Thankfully, our old Member Services site (Online Member Services, aka OMS) is still running a more complex and cumbersome process that allows it to use its own database to show players all…well many of the things players want to know. Our team kept the new site stayed offline until we got the issues resolved. When the issues were resolved we brought it back up quietly to ensure no further slowness issues were present. In our efforts to simplify things (consolidation to one database) we exposed ourselves to an issue where one application caused performance issues for all applications and subsequently, we had to completely shut down one user group’s application.
Geek out time
The real issue was that the database indexes had been rebuilt over the weekend, something that happens every weekend and typically does not cause issues. In this case, it caused the database engine to optimize some query paths in a sub-optimal way. At first, we thought it was just some queries that needed to be tuned, but after addressing several and still not getting close to our original execution times, we realized it had to be something more systemic. We restored a copy of our DB to our dev server (which is significantly smaller than our production server) and the queries ran as fast as they should in production without any additional tuning. So, we re-rebuilt the indexes in production and just like that, all the queries started running properly again. Some were even faster thanks to our additional tuning efforts.
Wrapping things up
To put a pin in this, I’ll address some questions you might be having:
1) We are not going to perpetuate the multi-database architecture unless it comes in the form of a read-only secondary and that’s actually something we’re considering.
2) We’re reviewing the database maintenance processes that we have in place to make sure things like this are less likely to happen again. If they do, we’re poised to quickly resolve the issue because know more now than we did last week.
3) The new Member Services site is up and functioning as fast as ever, but we’ll continue to look for ways to make it faster. (I don’t think this ever stops)