Hi. It’s Doug, here. We thought it might be an idea for me to precis what’s been going on with the site backend for the last couple of weeks, which has been, um, rather a lot! Firstly, there was the Shellshock bug that was announced a few weeks ago, which really was the starting place for a lot of what’s happened since. Basically, all of the servers that Folksy runs on (from the application servers, to routers, to the image service and cache servers) were potentially vulnerable, and had to be either patched and rebooted, or in some cases completely scrapped and replaced. Due to the severity of the bug, these all had to be done pretty much immediately. We’re actually quite proud that we were able to rebuild and replace our whole stack within a day but, of course, there were some issues bound to arise from such radical and hurried action.
At more or less the same time that was happening, we also reached a few tipping points: Folksy’s application servers were beginning to struggle with demand; the image service servers had reached their end-of-life support and needed upgrading and replacing; and the image service cache was no longer big enough to handle our needs and was starting to fall over. These naturally led to a couple of instances where the site became unresponsive, and a few instances of images failing as the cache fell over and had to rebuild.
In the case of the Folksy application servers, these were all replaced by servers that are twice the power. These are what Folksy is currently running on, and they’re doing very nicely, thank you. There were a couple of issues arising from the work to replace the application servers (such as some features of the seller admin pages not working, and custom shop URLs briefly not working), but these were quickly identified and fixed.
In the case of the image service, the whole backend was rebuilt to automatically scale to demand. The old version of the image service did this too, but this newer version uses more up-to-date tools from Amazon Web Services and is much more responsive and easier to configure. We got it up and running within a day, but there were then a few difficulties that persisted for a couple of days. The basic functionality was intact, but people were unable to upload larger images when listing, and were completely unable to upload shop banners and avatars. We actually created the fix for these issues really quite quickly, but the nature of the fix was a server configuration change that was non-trivial to implement, and took us a whole long weekend to work out and deploy on a brand-new platform. This was rather frustrating to a number of sellers, and I’d like to say here, again, that we were (and are) very grateful for your patience.
The image cache, which sits in front of the image service, has also been replaced by a server that is twice the size and power of the old one. We’re currently monitoring that, but so far it seems to be handling everything admirably, so fingers crossed that that’s another sysadmin issue that can be ticked off for now.
I had originally scheduled all of this work for the month running up to Christmas. I’ve been to a couple of AWS (Amazon Web Services) DevOps events recently to find better ways to build and maintain the Folksy stack, and I was looking forward to implementing a full, end-to-end automated system following all the latest best practices in the month when we traditionally don’t release any new functionality. Then, of course, reality intervened by way of the Shellshock bug and the tipping points we reached, so I had to undertake it all NOW, and in a hurry. Still, the knowledge I brought back from the Amazon events was invaluable in helping to resolve the issues, and I’m still looking forward to designing a better full Folksy stack in the coming months. In particular, it would be nice to get the Folksy application servers running on an auto-scaling infrastructure similar to the one image service now runs on.
Oh, and earlier this week, we also had our search index go down for around half an hour. This was unprecedented, as the service we use for this is normally incredibly reliable (which is indeed why we use it!). They identified and fixed the issue, which was a memory leak, and added more physical memory to the cluster where they run our search index to avoid this happening again.
What all of this has meant is that the new features we were hoping to get released in the last few weeks have had to take a back seat to system administration. I hate it when this happens, because it feels like development has stalled on the site, but sometimes it really is unavoidable. However, those new features haven’t gone away, and we have even managed to keep developing them in the background, so we’re looking forward to getting those released in the next couple of weeks.
I can’t tell you what any of them are, here, but what’s a couple of weeks’ wait between friends? :)
Finally, I’d like to say another really big thank you to all of our users who have been so patient while we fixed the issues outlined above. It really is appreciated. I’m not the most visible member of the Folksy team, but knowing that the goodwill of the Folksy community is behind you really does help to take the edge off the horrible stress you feel when alarms and system alerts start ringing late at night! And I hope this post reassures you that not only have we been working really hard to fix any problems that arise, but that we’ve been working really hard to try to stop them arising in the future.