Post Mortem: New database server
5 years ago
Norway

Moin everyone,

most of you reading this will probably have noticed the sudden performance problems and inconsistency bugs throughout the site. Let me just give a little bit of background as to what happened during the last two days.

We have been struggling to keep up with the growth of the speedrunning community for a while now and are constantly looking for ways to improve the site's performance. Right now the site is a big monolith, running on a single, beefy server. To reduce the server's load we now rented a second dedicated machine just to host the database. On Saturday we made the switch and moved MariaDB over to that new server.

In the hours following that we saw that performance decreased substantially, e.g. mean page load times went from 100ms to 400ms. To give the new server time to warm up its caches, we decided to let the experiment run until Monday morning, closely watching the server's metrics in the mean time.

Unfortunately the situation did not improve and we saw a lot more failed and slow requests throughout the nights. This plus the fact that things like user registration, game requests and other things broke made us decide to revert the experiment a few hours ago. We moved the database back and the site should be behaving normal (meaning "not great, but certainly much better than over the weekend") again.

We've learned that the database indeeds requires 80-90% of this server's performance and that all other services (nginx, Redis, PHP, e-mail) are negligible in terms of memory and CPU usage. We also learned how valuable our monitoring is and improved that setup a lot over the weekend.

Our next action items are

  • Debug why moving the database caused weird consistency errors. We configured MariaDB to run in full ACID compliance, so we don't expect transactions to just disappear without any kind of error.
  • Improve our database queries in general. We had a 100MBit connection between the two servers and our queries alone nearly saturated that link. We could also see that MariaDB spent a considerable amount of time just sending out packets. Reducing the resultset sizes should free up some time for actual query logic.
  • Further improve the caching layer and make more use of Redis in general. During the weekend we saw that the API rate limiting was causing 90% of all locks in MariaDB and just moving that logic to Redis was small, yet quick win for the performance.
  • We will most likely run a similar experiment with a dedicated database server in the future. We are thinking about replication and running multiple read-slaves, but still fear the additional complexity in our setup.

We apologise for the disruptions over the weekend. As they say, you can't make an omelette without breaking eggs.

-- The Team

andyrockin123, Goodigo and 20 others like this

cool but when add pm system tho

andyrockin123 and Maiguels like this
France
xDrHellx
He/Him, It/Its
5 years ago

[quote] cool but when add pm system tho [/quote]

Honestly even if it would be nice to have that, i think it's not really necessary anymore, everyone has discord or twitter nowadays (and if they don't, making accounts is fast and simple, and gives lots of advantages)

Edited by the author 5 years ago
Imaproshaman, blueYOSHI and 3 others like this
Canada

Thank you for this. Progress can be a little slow around here sometimes (not criticizing, I fully understand why. It's a big, complicated site that has a lot of needs and not a lot of people working on them), and it can be a little disconcerting not knowing what, if anything, is being done to improve the site. While the experiment was unsuccessful, it's really nice to see the active efforts being taken to fix the server load issue (as well as some insight into why it's not a simple fix).

Imaproshaman and starsmiley like this
Netherlands

@xDrHellx On the contrary, I think a message system would be very welcome. Having a single point of internal communication for every user on the website would be very nice. Discord and Twitter and other socials as of right now all serve as alternative contact methods simply because the site currently does not offer such a functionality and I feel these really are also sources that serve more as a reliability rather than a necessity.

Please also realize that a lot of effort in this website is done by dedicated people in their free time during their potentially busy lives. As much as a message system would be welcome, there are topics that require more priority right now such as the things SgtKabukiman listed.

Alayan, Imaproshaman and 5 others like this
Minnesota, USA

As much as this was needed. I think a warning could have been instilled. Explaining to people what was coming I think would have been better because some marathons I'm involved in are in a panic because of this roll back. Glad to see efforts being made, but maybe let us know ahead of time.

Valencia, Spain

I agree with Zojalyx, at least a warning to the organizations could've been helpful.

There are some marathon that kept all his information, but in the case of Distant Star Cares, all the submissions received during more than 2 weeks have disappeared.

Don't misunderstand me, I really appreciate all the work and effort you're doing for the site, but a simple warning would have saved a lot of work and time of other people and organizations.

Regarding the message system @xDrHellx, I found it very helpful.

Yes, you can have Twitter, Twitch, Youtube and a lot of social media, but not all people are active there, write their names correctly, or updated it if they change the name or account, but at the same time the user is active here in speedrun.com.

With the same reason you said that everyone has Twitter or Discord account, why sr.com should have Forums or Resources for each game if those resources can be found in its respective Discord server as well?

It has no sense to reject an internal way of contact, even if it's a simple one, that could solve a lot of miscommunication between users (specially with those leaderboard moderators that don't have any other social media or barely use).

Even that can help with the moderations requests you can see every day in this thread: https://www.speedrun.com/The_Site/thread/63nr7/339

Some of them would be really easy to solve if that pm system exists.

Edited by the author 5 years ago
Imaproshaman, Sanjihimura and 2 others like this
Germany

can we not turn a sticky about server-side site improvements into the 20th thread about the messaging system in the past few months? We've had plenty of staff explain why and where that's stuck various times and it has literally nothing to do with the topic at hand. Nevermind that a messaging system requires a database, so it would've made the situation during the experiment a TON worse and could've lead to leaking personal data shared through DMs.

Imaproshaman, blueYOSHI and 2 others like this
European Union

[quote] could've lead to leaking personal data shared through DMs[/quote]

Ya mean, like that other incident recently?

Imaproshaman likes this