NorwaySgtKabukiman4 years ago

Moin everyone,

most of you reading this will probably have noticed the sudden performance problems and inconsistency bugs throughout the site. Let me just give a little bit of background as to what happened during the last two days.

We have been struggling to keep up with the growth of the speedrunning community for a while now and are constantly looking for ways to improve the site's performance. Right now the site is a big monolith, running on a single, beefy server. To reduce the server's load we now rented a second dedicated machine just to host the database. On Saturday we made the switch and moved MariaDB over to that new server.

In the hours following that we saw that performance decreased substantially, e.g. mean page load times went from 100ms to 400ms. To give the new server time to warm up its caches, we decided to let the experiment run until Monday morning, closely watching the server's metrics in the mean time.

Unfortunately the situation did not improve and we saw a lot more failed and slow requests throughout the nights. This plus the fact that things like user registration, game requests and other things broke made us decide to revert the experiment a few hours ago. We moved the database back and the site should be behaving normal (meaning "not great, but certainly much better than over the weekend") again.

We've learned that the database indeeds requires 80-90% of this server's performance and that all other services (nginx, Redis, PHP, e-mail) are negligible in terms of memory and CPU usage. We also learned how valuable our monitoring is and improved that setup a lot over the weekend.

Our next action items are

  • Debug why moving the database caused weird consistency errors. We configured MariaDB to run in full ACID compliance, so we don't expect transactions to just disappear without any kind of error.
  • Improve our database queries in general. We had a 100MBit connection between the two servers and our queries alone nearly saturated that link. We could also see that MariaDB spent a considerable amount of time just sending out packets. Reducing the resultset sizes should free up some time for actual query logic.
  • Further improve the caching layer and make more use of Redis in general. During the weekend we saw that the API rate limiting was causing 90% of all locks in MariaDB and just moving that logic to Redis was small, yet quick win for the performance.
  • We will most likely run a similar experiment with a dedicated database server in the future. We are thinking about replication and running multiple read-slaves, but still fear the additional complexity in our setup.

We apologise for the disruptions over the weekend. As they say, you can't make an omelette without breaking eggs.

-- The Team

andyrockin123, Goodigo and 20 others like this
NorwaySgtKabukiman5 years ago

I was watching Highspirits' any% run ( ) and was noticing that it's classified as PS3 -- but in the VOD he says he's playing on PS4 (I cba to find the exact timestamp). From the accidental save screens it sure does look like the PS4 as well.

NorwaySgtKabukiman7 years ago

Hi everyone,

we have just changed how timezones are handled throughout the site. Until now, we used a cookie to render times in your timezones on the server. This, over the time, caused all sorts of issues, starting with endless reload loops on the PS4 and some other browsers and ending in a lot of hacks to handle DST changes. Also, this prevents us from caching HTML output, as it depends on each user's preferences.

On top of that, when the site was moved to a new server, we also switched from running in CET/CEST (UTC+1/2) to using UTC. The many hacks throught the site to compensate for the original timezone lead to some problems (like new run times being off by an hour), which motivated us to finally do something about it.

The old cookie-based approach was replaced by outputting HTML5 <time> tags and using JavaScript to convert those into the browser's timezone. There are two minor downsides to this approach:

¤ On large pages with lots of times (like the forums index), this takes a few milliseconds and the page is delayed by a moment. In most cases you won't notice it. ¤ Users without JavaScript enabled will not see localized times anymore, but rather UTC values. As many things on depend on JavaScript, we don't think this will affect many users.

A note to marathon managers: We noticed that times seem off by 1-2 hours for newer marathons (with our schedule being the only source for the start time, it's hard to verify if it's correct ;-)). Please check your schedules and make sure the start time is correct. Please note that you need to configure it in UTC, not -- as earlier -- in your local timezone.

Havi, zoton2 and 5 others like this
NorwaySgtKabukiman8 years ago

Welcome back, oh dearest and most patient friends,

it took a while, but the site should now be a little bit more secure. I was focussing on the most pressing issues, trying to find a balance between reworking most of the things and making the site and its data available again.

So please, if you find security issues, do not hesitate to tell us in private. I promise, security reports will be taken seriously and we will fix them ASAP.

For now, the following things have changed:

  • In most places, instead of removing a bunch of seemingly evil characters, HTML encoding is now in place. With this, we now allow for basically all characters in game/categories/variables names. Usernames are still restricted, though. This might change in the future.
  • The username/password cookies are gone. If you still have those, they will be automatically removed (so to be 100% accurate: if you are reading this, your cookies are already gone). Instead, we now issue simple session cookies that will deleted when you close your browser. Yes, this means you now have to log-in more frequently. The session cookies are httponly, so it's not acessible from JavaScript (and hence safe against XSS attacks).
  • Instead of MD5, passwords are now hashed using bcrypt (with a cost factor of 10). All existing hashes will be automatically upgraded to bcrypt on the first login of each user. Using bcrypt instead of MD5 dramatically improves password security in case an attacker gains access to the database.

With all that being said, there are still open issues:

  • CSRF attacks are still possible. It will take time to convert all state-changing requests to POST and introduce a CSRF token. We're working on it.
  • Everything is still using HTTP. I'm not aware of concrete plans to change this. Using CloudFlare's "halfass" SSL would be an option, even though I personally would much rather see a simple cert on itself.
  • It's very possible that I introduced a few bugs into the site. I'm sorry, but that's the way things are. Please report them, so we can fix them.

Thanks for your patience during the outtage.

Lighnat0r, Joshimuz and 14 others like this
NorwaySgtKabukiman8 years ago

I was instructed in #tdawg91 to post this:

Thank you.

Zachoholic and guywith like this
NorwaySgtKabukiman9 years ago

Hi there,

for Kabukibot it would be most awesome to have a way of getting the world records in a computer-readable format, e.g. via an RSS feed (that could be useful for others as well) or a small, read-only API. I would need the run time, the player name and the date for my bot. An RSS feed should have the video embedded as well, though.

Greetings, Sgt. Kabukiman

About SgtKabukiman
9 years ago
3 years ago