Your legacy database is outgrowing itself

A note: this was originally posted on my good friend’s blog, at unstructed.tech.

Do you feel your database is growing too big or too old? Hard to maintain? Well, I hope I might be able to help you a bit with that. The text you’re about to read is a real life experience of scaling a monolith database to be able to support a top 250-website (according to alexa.com). At the moment of writing this article, alexa.com ranked chess.com at 215th place in the world. We’ve got over 4M unique daily users and over 7B queries hitting all our MySQL databases combined. Went from under 1M unique users a day a year ago to 1.3M in March, to over 4M at the moment, with over 8M games played each day. I know it’s no where near the biggest players on the market, but our experience could help you “fix” your monolith database, and scale it to new heights.

Disclosure: This is my first ever article, and it’s too long as it is – but I had to cut probably as much text as you see here in order to make it actually readable. So, some things might be confusing or not well explained, and I’m sorry for that. Hit me up on LinkedIn, and we can get into a deeper discussion.

Update: after reading a lot of comments, I’d like to add/clarify a few things. We do use caching, extensively. If we didn’t we wouldn’t survive a day. We do use Redis, often pushing it to its extremes. We’ve tested MongoDB and Vitess, but they didn’t do it for us.

The state in which we were just couple of years ago

Somewhere mid-2019 we’ve started noticing that our main DB cluster is slowly growing a bit too big. We had three smaller and less used databases on the side, but everything was always added to the main one. Surprisingly, it was in a fairly decent state for a database that started its life over 12 years ago. Not many unused/redundant indexes, the ones that were there were mostly good. We constantly monitored and improved heavy/slow queries. A nice chunk of data was denormalized, as well. No foreign keys, many things were done in the code itself (filtering, sorting, etc, to make sure the DB only ever uses the most efficient indexes) running on the latest version of MySQL, etc, etc. It wasn’t neglected and it evolved over time into something that did a good job for us.

Disclaimer: I’m not recommending anyone to do these sorts of micro-optimizations. They are working for chess.com, on its scale. We benchmark almost anything we’ve never done before, and see how it turns out, before actually implementing it. We know these work for us.

The biggest problem we were facing at that point in time was that altering almost any table required taking half the hosts out of rotation, run the alter there, put it back in rotation, alter the other half, put them back in. We had to do it off-peak, as taking half the hosts out during the peak time would probably result in the other half crashing. As the website was growing, old features got new improvements, so we had to run alters a lot (if you’re in the business, you know how it goes). Altering would’ve been far less stressful if we could could pull just a small set of tables out of rotation, instead of the entire DB. So, we created a 5 year plan for our main cluster (and boy, were we wrong about our timeline), in which we were to split the database into many smaller ones, and that would make things easier to maintain (we were right about this, at least). The plan assumed ~25% yearly growth, which was what we were seeing in previous years.

Where we were ~2 years ago

The REAL problem appears and our plan implodes

You’re all familiar with COVID-19, and how all of that went. Guess what – we weren’t really ready (or we didn’t expect it to have the impact it had on or traffic). Interest in chess exploded when (most of) Europe went into a lockdown. Fun fact – we could tell which country went into a lockdown just by looking at our registrations by country – it was so clear. And numbers skyrocketed across the board. Funnily enough, all our databases were doing fine (not great, but they handled the traffic). But at the same time we noticed that our “reports” host was constantly struggling to keep up with the production hosts (i.e., it was often 30-60 seconds behind on replication), which prompted us to analyse our replication stream and its remaining capacity. And it was close to being full, at peak traffic reaching over 95% of the capacity. At that point we knew that USA (where most of our players are from) would also go into a lock down, soon, and that would mean that our replicas wouldn’t be able to keep up with all the writes going to the master (or even if we just keep growing slowly). This would mean the end of chess.com, as the code isn’t ready to handle big replication delays when reading data from replicas, and sending all the selects to master would take it down. That made our goal clear: decrease number of writes going to the main cluster, and do it as quickly as possible! This was actually part of our aforementioned plan, but that plan stretched over 5 years. Now we had just one or two months to do that.

The solution

How to decrease number of writes to a single DB? Sounds simple – identify tables that are written the most and rip them away from the DB. That way the number of writes handled remains the same, but they are just split into two separate streams. These can be either tables that get a lot of inserts, or not as many inserts, but with records that are frequently updated. Store them elsewhere. How to do it with no downtime? As you’d imagine – not really that simple.

We first identified all the tables that had most updates (inserts, deletes or updates). Most of them were nicely grouped together based on the feature for which they were used (most of game related tables would be written at a similar pace, etc), and we managed to gather a list of 10-15 tables, for 3 different website features. As soon as we started investigating them, another problem was revealed – since we can’t “join” between databases on different hosts, we would need to move as many tables as the feature used to make the project simpler (we were already aware of this, since when “the plan” first came into place, we did something similar for 3 smaller, less used tables as a PoC, and it turned out fine).

The 3 features we identified as high-traffic were logs (not really a feature, and not really logs, they were just poorly named), games (like you’d expect on a chess gaming platform) and puzzles. For logs we’ve found just 3 very isolated tables, which meant not a lot query/code changes were required. Similar for games, as well. But puzzles had over 15 tables to move, and the queries on those tables had lots of joins towards tables that were to remain in the main database cluster. We’ve rallied our troops, pulled over half of our backend developers, split them into teams, and started pushing on all of these in parallel.

It took just one week to move logs into its dedicated database, running on two hosts, which gave us some breathing space, as those had the most writes, by far. One more month to move games (which was oh so scary, as any mistake there would be a disaster, considering that’s the whole point of the website), and puzzles took over 2 months. After these, we were well under 80% of the replication capacity in the peak, which meant we had time to regroup, and plan upcoming projects a bit better.

The execution

So, how did we do it?

As you might assume, there are two sides to a project like this – code and database, and both require a lot of work. On the code side, there are a few prerequisites for a project like this one. First of all, we need some sort of feature flag (aka feature toggle) system, either internal or 3rd party. Ours is custom built, pretty extensive, and maybe a topic for another post. The bare minimum the feature flag system needs to provide is to allow or deny access to a percent of checks from 0 to 100 percent. Another really useful thing to have is a good test coverage. Our entire codebase was rewritten from scratch a few years ago, so we were lucky enough to have that, as it saved us a couple of times.

Between code changes and database changes, some things can be done in parallel and some require going by numbers. The easiest thing to start with is to set up the new database. All our new databases pulled from the main one (we call them partitions, and this process partitioning) end up on a 2-host setup (master and failover-master (replica)), but it can really be anything we want. On the partition cluster, we create a new database with identical schema to the one we’re trying to split (just name it according our needs, in this case the first one was named logs). Then a backup of the main database is imported, after which we hook up the master of the partition cluster to be a replica of the main master (this is why we need the identical schema and backup import). This way the new cluster becomes just like any main cluster replica, only with a differently named database. Then it just sits there, looking pretty, replicating traffic, and being up-to-date with the rest of the cluster while we work on the code side of this project.

This is what the cluster looks like after adding new hosts

Before we started working on this project, we essentially had 2 connections open to the database from the code: read only connection hitting replicas and read/write connection pointing to the master. Both actually hitting HAProxy, to get where they need to go. First thing we did is to create a parallel set of connections, where read/write connection goes to the Partition Master and read-only connection to the Partition Replica.

Chess.com is written in PHP, so I’ll use PHP examples to illustrate the needed changes, but I’ll keep it pseudo enough so that anyone can understand what’s going on (you’d be surprised how many websites in the top 1k are written in PHP, and how easy it is to scale PHP to those heights. Maybe a topic for another post.).

Code changes boil down to 3 things:

  • Removing query joins to tables that will no longer live in the same database
  • Aggregating, merging and sorting data in code (as we no longer can do some of those in the DB)
  • Using feature flag system to decide to which hosts queries go

Reading the data

Removing joins, even though it sounds simple, is not. In the following examples, I will assume we’re moving games and additional_game_data tables to a new partition DB (also, these are all pseudo examples, don’t expect them to be perfect). So, a query that looked like this:

SELECT u.user_id, u.username, u.rating, ugd.start_rating, ugd.end_rating, gd.result
FROM user_game_data ugd 
INNER JOIN game_data gd on ugd.game_id = g.game_id
INNER JOIN users u on u.user_id = ugd.user_id 
WHERE 
    g.finished = 1 AND 
    g.tournament_id = 1234 AND 
    u.banned = 0
ORDER BY u.rating DESC
LIMIT 5;

Now becomes this, since we can no longer join users table:

SELECT ugd.user_id, ugd.start_rating, ugd.end_rating, gd.result
FROM user_game_data ugd
INNER JOIN game_data gd on ugd.game_id = g.game_id
WHERE
    g.finished = 1 AND
    g.tournament_id = 1234;

We’ve removed join on users table, users table selected columns, users column condition in WHERE clause, sorting and LIMIT clause, for obvious reasons. So, now we need to do all of that in the code. First thing first, we’re missing all the information about users. So, lets fetch it:

SELECT u.user_id, u.username, u.rating
FROM users u
WHERE
u.user_id IN (:userIDs) AND
u.banned = 0
ORDER BY u.rating DESC
LIMIT 5;

:userIDs is the list of user IDs we fetched in the first query. Now all that’s left is to merge these two data sets and put it all behind a feature flag check. The end result of this all is something along these lines, and both return the exact same results (again, very pseudo code):

if ($this->features->hasAccess('read_logs_partition')) {
    // assume this result set is mapped by user_id, doesn't really matter how it's done
    $partitionData = $this->partitionConnection->query('
        SELECT ugd.user_id, ugd.start_rating, ugd.end_rating, gd.result
        FROM user_game_data ugd
        INNER JOIN game_data gd on ugd.game_id = g.game_id
        WHERE
            g.finished = 1 AND
            g.tournament_id = 1234;
    ');

    $mainDbData = $this->mainConnection->query('
        SELECT u.user_id, u.username, u.rating
        FROM users u
        WHERE 
            u.user_id IN (:userIDs) AND 
            u.banned = 0
        ORDER BY u.rating DESC
        LIMIT 5;
    ');

    $result = [];

    foreach ($mainDbData as $singleRecord) {
        $result[] = array_merge($singleRecord, $partitionData[$singleRecord['user_id']]);
    }

    return $result;
}

return $this->mainConnection->query('
    SELECT u.user_id, u.username, u.rating, ugd.start_rating, ugd.end_rating, gd.result
    FROM user_game_data ugd
    INNER JOIN game_data gd on ugd.game_id = g.game_id
    INNER JOIN users u on u.user_id = ugd.user_id
    WHERE
        g.finished = 1 AND
        g.tournament_id = 1234 AND
        u.banned = 0
    ORDER BY u.rating DESC
    LIMIT 5;
');

A purist might complain that this isn’t optimal code, as it might require multiple iteration, merging and filtering data in the app, etc. That’s all correct, but, the hit a database takes is actually bigger. Overall, this is a far cheaper way to do things. One of things that has really helped our DB scale is the fact that we’ve moved a lot of sorting, merging, filtering from the DB into the code itself.

Writing the data

This would’ve been a simple paragraph, if those darn auto-increments didn’t exist. Please take another look at that image of how the data flows after we’ve set everything up.

I’ll start with a solution for the auto-increment problem – all writes are either going to the main database or all writes are going to the new partition (the feature flag will be toggled from 0% to 100% at once). Nothing in between. If there is a glitch, and just a single insert goes to the new partition, partition hosts will stop replicating completely, and a few queries will have to be skipped by hand, or the complete work on database level started over (partition DB deleted and recreated). For example, if a table we’re migrating reached auto-increment value of 1000, and there’s a glitch that sends inserts to the new partition, it will write values 1001, 1002, etc. Glitch is fixed, and now the writes go the the main database again starting with 1001, 1002… When those queries are replicated to our new hosts, mysql will figure out those already exist there, and throw a unique constraint violation. And now we have to fix everything up.

All this is not a problem if UUIDs are used. I’ll explain in a bit more details later in the article.

Code example (if plain SQL is used, it’s even simpler, just use different connection instances in if/else):

$game = new Game($user, $result);

if ($this->features->hasAccess('write_logs_partition')) {
    $this->logsEntityManager->persist($game);
    $this->logsEntityManager->flush();
} else {
    $this->mainEntityManager->persist($game);
    $this->mainEntityManager->flush();
}

The actual switch

Well, this is where having a good DB monitoring tool (like PMM), error monitoring (like Sentry) and feature toggling do their magic. We start simple – by sending the lowest possible number of SELECTs to the new partition hosts (that number depends on the feature flag system). Monitor for errors/exceptions. Rinse and repeat until 100% is reached (we usually take over a week just for this). Using database monitoring tools, we verify that there are no more SELECTs for these tables in the main database cluster.

Slowly switching reads to the partition cluster

Now, if tables being migrating rely on UUIDs, we’re in luck. We just repeat the same process as above, just for writes. This will work because at this point all SELECTs go the the new partition hosts. And if we write directly to them in any percentage, they will have all the data (both inserted there directly, and replicated from the main database master). But the main database won’t have the data that was written to the new hosts directly (it would be possible to set up bi-directional replication to solve this, but it would just add another layer of complexity to the whole project). This means that we can’t send SELECTs back there, as the data is incomplete (More on undoing the whole process later). And similar to what we did for reading, keep increasing percentage of queries until 100% is reached, and confirmed in monitoring tools and the switch is officially over. Now we just cut off the replication between the new partition cluster master and the main cluster. And that’s it.

If auto-increment primary keys are used – this isn’t as smooth. Or, it is, depending on how you look at it. But it’s definitely more stressful. We need to flip our feature from 0% to 100% at once. And now it’s either all working as expected or not. (Having a few hosts laying around that can be used to test the cutover is really useful. Highly recommended 🙂 ) If everything looks good (or acceptably so), similar as for UUIDs – replication is killed and any minor issues fixed later. If things didn’t go as expected, we need to undo everything.

Instant switch of reads from the main cluster master to the partition master (for auto-increment primary keys)
The final result

Undoing everything

Over the course of last couple of years we’ve created 6 of these partitions, and only once did our plan fail (it was actually failure of the feature flag system in combination with auto-increments). And here, again, the trouble you’ll have undoing things depends on whether you’re using UUIDs or auto-increments.

If you’re on UUIDs, undoing is fairly simple. Send both writes and reads (using your feature flag) back to the main cluster. Stop the replication to the new partition hosts. Now, depending on the traffic, you have two options: manually figure out data missing from the main cluster, and copy it there or dig into binlogs (if enabled), get your hands dirty, and copy the missing data. Could’ve been worse (just check the next paragraph).

This part is about auto-increments. As above, send both writes and reads to the main cluster. But, now you can’t really just copy the data, since your main cluster already started filling up with same PKs (auto-increments) that already exist in our new partition. You’ll need to dig deep to figure this one out. Simple data copying is probably out of the window if you store these PKs elsewhere as Foreign keys. You might have more luck finding all these in the binlogs (if enabled).

Unfortunatelly, even though we were always prepared for undoing any mess that happens even before we ever had chance to see it for the first time, when it hit us – we decided to ignore it. We were just lucky. Because our feature flag system failed on a single host, after switching all the queries to the main cluster, we only lost a few tens of thousands of automated messages sent by the system to our users. And we lost them because we didn’t think it’s worth recovering all those records, since they were using auto-increments. The problem we had was ongoing for maybe 5 minutes, and users themselves weren’t really affected. So, yeah, we were very, very lucky.

After all that

The sweet part. We were able to set all of our new partitions (I think) in a way that 95% of queries go to the master, and the remaining 5% go to the replica/failover master just to keep MySQL primed (in case the failover needs to jump in and handle some real traffic). This means that when monitoring the cluster, we only need to look at the state of the master. Running alters is super simple, as don’t have to worry if one host (while we’re running the alter on the other) is capable of handling all the traffic.

The not so sweet part is in the code – you’ll need to remove all those conditions you sprinkled in your codebase 🙂 On the database side, we need to drop all the migrated tables from the main database, and keep only those tables in the new partition. With replication not working between the two, this is now safe to do.

I’ll wrap it up with a comment from our legendary SysAdmin, after Netflix’es show “The Queen’s Gambit” cause another huge wave of traffic (you can read more about it here):

Bonus 1: Some gotchas

  • SELECT queries that went to main cluster’s master had to keep going there until we’ve switched writes to partition master. This is because we couldn’t risk any replication delay that might happen between the two masters.
  • Always check if feature flags got changed on every single host. As I mentioned, it has happened to us that on one host they didn’t change when we were switching writes on one of the partitions. It has caused us a lot of pain.
  • We had to duplicate our Doctrine entities, in order to remove relations towards entities that would live in other DBs. Duplication was required because we needed to keep the “legacy” code behind a feature flag, while at the same time using new entities that don’t have joins.
  • We had to keep new entities in a new namespace, because of how Doctrine works
  • Caching might be a problem, if you rely on a code that caches based on connection/DB, etc.

Bonus 2: Some stats

If you’re anything like me, you like graphs. Here are some (all from the main cluster’s master).

MySQL command counter – We started Working on cleaving data away in March. As you can see, even though our traffic kept going up, we’ve managed to reduce number of queries over time
Disk IO stats – In July we got in a place where almost no queries trigger disk interaction (both by partitioning and by improving queries that caused temporary tables on disk)
Disk usage – It just looks beautiful, doesn’t it?
Disk utilization

Related Post

17 Replies to “Your legacy database is outgrowing itself”

    1. Nothing 🙂 I know it sounds weird, we’re doing everything by ourselves. We did try couple of solutions (Vitess and ProxySQL), but they didn’t do the trick for us for various reasons. So, we decided to be the ones to handle it all, without any 3rd party software.

      1. Interesting! I’m a Customer Success Manager at Continuent and I was thinking chess.com has similarities to some of my customers (Riot Games, Garmin, AdobeSign). https://www.continuent.com/
        Anyway I forgot to say your graphs are beautiful – thank you again for the great read!
        Cheers!
        Sara

    1. With Redis and Clickhouse at your disposal alongside MySQL in 2021, I’m shocked you’re running such a complex master/salve/replica infrastructure of “solely” MySQL.

      1. Hey Paul,

        Of course we’re not “solely” on MySQL 🙂 We do have both Redis (used to extremes) and Clickhouse, as well. We’ve also tested MongoDB, but it failed to support the traffic

  1. Hey Aleksandar, thank you for the in-depth article. I enjoyed reading about your scaling journey.

    I have a question around how you migrated writes to the new partition. You mentioned you had to make all-or-nothing kind of a switch if auto-increment keys were involved. I wanted to understand if dual-writing from your application layer would have made it simpler (and less stressful) to make this migration? This would work with UUID primary keys too. I am sure you must have considered it, but decided to not go this way maybe because of the performance hit, or some other reason.

    I am just curious to understand how you made the decision.

    Thank you again for the write up 🙂

    1. Hey Ketan,

      Yeah, dual writing from the app is definitely possible. And that’s something we did a couple of times (actually, we did one just last week 🙂 ), where applicable (lower traffic tables).

      But in high traffic tables, where we have over 100 inserts per second, auto-increment consistency becomes a problem. There’s no guarantee that records will have the same order in both hosts we would write to, and the order is important (because we have references to those records elsewhere). So, by relying on single point of write, we let MySQL guarantee the consistency of PKs on all hosts.

      Hope that makes sense, and explains what you asked? 🙂

      1. Hi Aleksandar,

        Yes, that explains what I was asking, thank you for taking out the time to respond 🙂

        I was thinking that one of the reasons you didn’t go ahead with dual writing would be the auto-increment consistency. But, as I am sure you know, I thought one could explicitly pass a primary key while making a write. So, you would write to the primary/sharded database, and then use the ID returned to write to the other database.
        In this case, was your concern with the performance hit (because you have to do two writes in series, and that might affect the latency of your high traffic endpoints) that didn’t justify this as a viable option, or were there other/more considerations? This is what I am curious about.

        I am sorry for prodding you more 😅, I am just trying to learn from your decisions. Thank you again for your kind response 😄

        1. Yeah, you can explicitly set a PK on write, of course. The problem is PHP’s distributed nature and the number of concurrent requests/writes we have. No matter how fast the things work, we would still have tons of cases where 2 or more hosts are concurrently handling inserts, and thus each one of them would try to write the next PK. This could be solved by having each host write “different” PKs (like odds and evens, just multiplied by the number of hosts. I’m not sure what the right word is, in English 🙂 ), but that just adds even more complexity to the entire thing.

Leave a Reply

Your email address will not be published. Required fields are marked *