the last week of fail

published 29 Sep 2014

The last week has been quite a roller-coaster ride as far as keeping osu! above the water. In the interest of transparency I am writing up the various problems that arose over this period.

this kind of sums things up...

Issue #1

Internal IP address of main webserver was revealed. Onslaught of DDoS attacks hitting box directly rather than getting stopped at cloudflare. Digitalocean has a null route policy, which means that box is inaccessible for three hours after any DDoS hits. They also managed to find the bancho server IP, likely as it hasn’t changed for many years and was revealed in the distant past.

Daily DDoS attacks knocking osu! off the internet (and null-routing the web server).

null routes are annoying, but unavoidable

Resolution

Figure out how the IP was revealed. Found a few methods and patched them all:

  • PTR records (digitalocean makes these public based on hostname, so a scan of the digitalocean IP range searching for relevant hostnames is feasible). Moral of the story: never name your digitalocean droplets anything distinguishable (use random names).
  • Postfix. This is probably the silliest of all: the IP address was in mail headers, as all mail was sent from that server. To fix this I made a new relay server for mail which removes sensitive header information before performing the final send operation.
  • phpbb. When posting a forum post, phpbb was checking image dimensions with a call to getimagesize. This would run even on remote URLs, which meant the box’s IP was revealed. I removed this dimensions lookup as it wasn’t even being enforced.

To combat the DDoS attacks quickly, I ramped up around 10 web servers from a snapshot and had them waiting as hot backups should one of them get null-routed. This allowed for minimal downtime.

This meant reassigning new IP addresses to all services which may have previously been revealed.

I’m quite glad this came up, as patching the above security flaws makes me feel a lot more at ease, going forward.

Issue #2

Over the past few weeks, many of my digitalocean droplets were having sudden IO starvation, during which their IOPS would drop to zero for sometimes up to 5 minutes at a time.

Resolution

Digitalocean support suggested I redeploy the droplets, as my “old” droplets were running on an old version of their hypervisor code (and on older hardware). So I did this, which may have been my biggest - yet unavoidable - mistake.

Issue #3

Redeployed droplets were seeing very high and spiky steal% (yellow in the graph below), suggesting high host contention. I took the redeployment as an opportunity to upgrade the master database (32gb -> 64gb RAM, 12 -> 20 cores), but regardless of this it was performing so badly that the site would often come to a halt during peak.

yellow is stolen cpu time

Resolution

After a 10+ page back-and-forth with digitalocean I really didn’t get anywhere. The final solution was to keep redeploying until performance was satisfactory.

Even after redeploying I am still noticing spikes of steal%, which results in sudden CPU starvation. This is an ongoing issue to which I have no solution (although I do have some leads which I am investigating actively).

Keep in mind switching master database servers is not an easy task either. It requires synchronising multiple things happening at once: ensuring all slave servers are stopped at the same point in time (while the old master is in read-only mode); switching slaves to new master and ensuring they are still in sync with; updating configuration of all services reliant on the database; switching monitoring to understand the new database layout.

I had to do this master switch twice this week, which was a huge time-sink. I was able to partly automate the process along the way, which is a nice bonus.

Issue #4

A new kind of DDoS arose which was not being blocked by cloudflare. Someone was making use of a wordpress botnet to flood http requests at osu.ppy.sh. This totaled around 300mbit of incoming requests, which is enough to bring the most powerful of servers to a halt.

that's megabytes, not bits.

Resolution

Initial resolution was to switch cloudflare to “I’m under attack” mode, which forces every visitor’s browser to perform javascript computations before allowing access to the site. This required adding special rules to allow bancho and other services (which can’t perform javascript).

Longer term solution was to add filtering rules at an nginx level to avoid passing such bogus requests on to php workers. This reduces the bulk of the stress on the server allowing it to continue operation even under such an attack.

Update: I found out that this blocking can be done at a cloudflare level using a specific WAF rule.

Issue #5

When I was finally happy with a database deploy at digitalocean (lowish steal%), it went radio silent during peak one night, without any notice. Upon following up with digitalocean support, they said there was a problem with the hypervisor it was running on which in turn triggered a reboot of all droplets running on it.

But not only that, mine rebooted without a kernel, and thus had no networking (apparently). This happened due to another bug at DO’s end involving deploying from snapshots, but that is unimportant.

Resolution

This was completely out of my control. I’m waiting for follow-up on exactly how this happened from DO support, while also looking at my options to switch key infrastructure away (back?) to dedicated hosting, while leaving a hot backup at DO in case of failure. I will likely post an update if/when I decide to migrate to somewhere else.

So let me clarify: DigitalOcean are an amazing host. They offer computing power at prices which make sense, rather than inflated server rental rates that are oh-so-common in today’s market. They have been working very closely with me to overcome the aforementioned issues, going out of their way to do what they can. Their support is so far the most personal and expident of any datacenter I have tried (and I’ve been around..).

Much of what has occurred has not been their direct fault; mostly a series of unlucky events which happened to overlap. If I do decide to move away, it will not be moving away from DigitalOcean, but from cloud hosting in general (returning to self-managed infrastructure). I would not even consider another cloud provider due to the unrealistic costs.

At the end of the day, I would still recommend DigitalOcean. If you decide to give them a try, you’re welcome to use my referral link to help offset the costs of osu! servers (and gives you free credit too).

Let me also mention that while I felt that my infrastructure was robust to withstand such failures, I have determined a few areas which can be improved. And you can count on me to improve those areas.

comments

A quick update

published 28 Aug 2014

And so another month has passed. I probably don’t have as much to tell you as usual since I have been super-busy in real life, but it is all for a good cause! I have been working towards improving my working environment – and also allowing for expanding the osu! team – by renting a small office. This takes quite a bit of paperwork in Japan (especially as a foreigner) so it was quite a celebration for me to actually succeed in this.

Still in the process of moving stuff and getting well setup, but afterwards I should be able to livestream a whole heap more. We actually have a live camera of the office which you can view here. Keep in mind it will only be set to public at some times. I’ll likely tweet about it if we’re doing something interesting.

As for things on the osu! front:

  • The new update system is completed and mostly live. You can switch to it from the existing test build by clicking the little popup at the main menu.
  • Due to this system going live, the old test build is now officially decommissioned (even though you can still use it for the time being).
  • With this new system brings the ability to publish multiple update streams, including experimental ones (like an OpenGl only build which may be the future of osu!) for testing purposes. Look out for these in the near future!
  • I also managed to get osu! to “install” and update from a single executable, removing the need for the “osume.exe” updater. The result is quite magical!
  • Smoogipoo is beginning the rather huge rewrite of osu!mania to fix all the small issues that exist in the current implementation. He has made good progress on the key binding system and is working on skinning currently. The end result will be a new editor, a better working play mode and an overall better experience.
  • The new osu! website has gone through another iteration and is being designed actively. All I can say for now is that it looks amazing; if you saw the “old” new design then this one is just going to blow your mind!
  • I’ve spent a lot of my time fighting issues with infrastructure issues that are mostly not my fault and very hard to resolve. Things are still pretty stable, so I’d say I’m doing a good job even though you will never hear about it ;).
  • We have begun restructuring the team to resemble how the modding environment will be further down the line. This should hugely streamline the ranking process even before the complete new system is implemented.
  • The osu!idol karaoke contest is running again this year. Huge interest in this, so hurry if you want to take part in it!
  • I’m making good progress on restoring the osu!store and a stock of tablets. Expect to see availability again in early November, all going well. At least in time for the holiday season!
  • RBRat3 made a cool 3D version of the new osu! logo.
  • I learned that osu! can help with hearing loss.
  • Thanks to Rev3Games for an enjoyable trip along the history of the iNiS tapping series and the transition to osu!.

For those that missed it, I also answered quite a few questions in my previous post. Feel free to post more questions there if you have anything sensible to ask!

comments

ask me things

published 15 Aug 2014

I have an aversion to ask.fm, but I do know a lot of people out there have a lot of things they would like to ask. I plan on doing an AMA on reddit some day, but until then I’d like to leave this post here to gather questions in the comments which people would like answers to. I will post follow-up entries here answering the top-voted questions (or any I feel deserve an answer).

You’re welcome to ask any questions with no scope limits.

  • I will not answer questions that are stupid.
  • I will not answer questions from anonymous or invalid email addresses.
comments

A Quick Update

published 01 Aug 2014

What has happened in the last two months?

  • Had an amazing time at Japan Expo, meeting a huge number of French players and hopefully introducing as many new ones!
  • Gave an updated talk about the history of osu!, along with Q&A and live play in dual language (French and English). Watch it here!
  • I answered even more questions on episode 8 of osu!talk. Thanks to ztrot for having me on the show!
  • I received some open source mini-keyboard controllers for osu! from some chinese users. You can already make them yourself, but I also hope at some point we can offer these for sale at a low price.
  • osu! saw more development activity over the last month from people who aren’t me than ever before. This is very exciting to see, and makes me a little more confident that the osu! codebase isn’t in as bad a state as I perceive it to be!
  • Progress is being made towards an open source osu!. Piece by piece I am separating git repositories of various components so they can be released separately from each other as required.
  • We hit 200k likes on facebook. Hooray!
  • I completely rehauled the banning system behind the scenes to allow for more automation, as it was getting out-of-hand for us to handle manually. The results are promisingly good (or bad, in a sad way).
  • Work continues on a new update/release system which will allow for multiple release streams to exist. Users will be able to switch between stable/beta/cutting-edge. This will also allow for mgiration to dotnet40 while keeping a compatibility branch on dotnet20 while people migrate across.
  • A new game intro is in the works, including a long-overdue theme song. You may also notice that sound effects have been improved. This is all already live on test build but won’t be available on public for a while.
  • Download and update mirrors are centrally managed and traffic is automatically shifted as servers become available/unavailable. DNS changes are also automated via the Cloudflare API when server issues are detected, reducing downtime to only a couple of minutes.
  • I have been a bit busy with boring stuff like restructuring the way osu! is run as a business to make sure I can keep up with the ever-increasing workload. Trying to get more hands on board to get new features out to you guys faster than ever.
  • Huge kudos to Tom94 for taking my lead and rehaulling most of the song select code. The result is a more performant and slicker song select screen than ever before. And it’s only going to get better from here!
  • My sister made me an osu! stamp!
  • Someone used my design documents to make their own version of the osu!arcade unit!
  • We are running another fanart contest aiming to create a bunch of stickers, which may be used around the place in the future (both digitally and physically)!
  • Someone made a programming problem based on CtB.
  • People continue to be dishonest and unbelievably abusive.

I’m sure I’ve missed quite a bit here, but until next time. Follow me on twitter for more regular updates!

comments

a quick update

published 26 May 2014

I have so much I want to post about, but no time to do so. Let’s continue with quick updates because it’s much better than nothing at all!

  • I’ve decided to focus on open-sourcing the whole of osu! as soon as possible. This means that my previous post mentioning needing new team members is temporarily on hold while I restructure things. Moving forward by cleaning up internals, and releasing some side-projects ahead of time.
  • 16 person multiplayer is coming in an update over the next week. Yay!
  • Tablet orders are going out at a rate of around 50-60 a day. Still plenty in stock, but I’m limiting how many are available each day due to the speed we can ship at.
  • The new store is live in all its glory! I have done my best to make it accessible on all devices down to smartphones. Tablets, stickers and plushies are available, with more products coming soon! Excite.
  • Work continues on the new site. I am focusing on the development of this myself, in an attempt to get beatmap modding into a better state ASAP.
  • The osu! site now uses an image proxy for profile user pages and forum posts. This means it is now 100% SSL traffic, which should make your browser a lot happier (no more broken lock icons in your URL bar). It also means people can’t snoop on you accessing their profiles, and should speed things up a fair bit.
  • Images which are being proxied are also being lazy-loaded. This should reduce the bandwidth you waste on the site by a whole heap. Lazy-loading is done in a way such that it shouldn’t affect your normal browsing experience.
  • I moved the complete admin team to Slack for internal communication. It has been the biggest breakthrough in ages as far as I’m concerned. Managed to get everthing integrated in once place, allowing us to handle support tickets, in-game reports, even reply to forum threads without touching the forum itself! It’s 100% linked to bancho and allows us to moderate from smartphones and receive push notifications without a cumbersome IRC client running on low-powered hardware.
  • Catch the Beat World Cup is currently running and we are streaming it each weekend on the osu!live channel. Make sure to tune in! It’s a lot of fun to watch even if you aren’t a CtB player.
  • This amusing conversation happened.
  • A teacher started using osu! in his classroom. Curious stuff.
  • I may be making a guest appearance at a large upcoming convention!

I need to make an important post about my intentions when it comes to open-sourcing osu!, so look forward to that in the coming weeks. A lot of people seem confused and worried about the implications, but you shouldn’t be. I will not fail you!

comments