When Matt and Quin founded Swiftype in 2012, they chose to build the company’s infrastructure using Amazon Web Services. The cloud seemed like the best fit because it was easy to add new servers without managing hardware and there were no upfront costs.
Unfortunately, while some of the services (like Route53 and S3) ended up being really useful and incredibly stable for us, the decision to use EC2 created several major problems that plagued the team during our first year.
Swiftype’s customers demand exceptional performance and always-on availability and our ability to provide that is heavily dependent on how stable and reliable our basic infrastructure is. With Amazon we experienced networking issues, hanging VM instances, unpredictable performance degradation (probably due to noisy neighbors sharing our hardware, but there was no way to know) and numerous other problems. No matter what problems we experienced, Amazon always had the same solution: pay Amazon more money by purchasing redundant or higher-end services.
The more time we spent working around the problems with EC2, the less time we could spend developing new features for our customers. We knew it was possible to make our infrastructure work in the cloud, but the effort, time and resources it would take to do so was much greater than migrating away.
After a year of fighting the cloud, we made a decision to leave EC2 for real hardware. Fortunately, this no longer means buying your own servers and racking them up in a colo. Managed hosting providers facilitate a good balance of physical hardware, virtualized instances, and rapid provisioning. Given our previous experience with hosting providers, we made the decision to choose SoftLayer. Their excellent service and infrastructure quality, provisioning speed, and customer support made them the best choice for us.
After more than a month of hard work preparing the inter-data center migration, we were able to execute the transition with zero downtime and no negative impact on our customers.The migration to real hardware resulted in enormous improvements in service stability from day one, provided a huge (~2x) performance boost to all key infrastructure components, and reduced our monthly hosting bill by ~50%.
This article will explain how we planned for and implemented the migration process, detail the performance improvements we saw after the transition, and offer insight for younger companies about when it might make sense to do the same.
Preparing For The Switch
Before the migration, we had around 40 instances on Amazon EC2. We would experience a serious production issue (instance outage, networking issue, etc) at least 2-3 times a week, sometimes daily. Once we decided to move to real hardware, we knew we had our work cut out for us because we needed to switch data centers without bringing down the service. The preparation process involved two major steps, each of which has a dedicated explanation in their own sections below:
Connecting EC2 and SoftLayer. First, we built a skeleton of our new infrastructure (the smallest subset of servers to be able to run all key production services with development-level load) in SoftLayer’s data center. Once the new data center was set up, we built a system of VPN tunnels between our old and our new data centers to ensure transparent network connectivity between components in both data centers.
Architectural changes to our applications. Next, we needed to make changes to our applications to make them work both in the cloud and on our new infrastructure. Once the application could live in both data centers simultaneously, we built a data-replication pipeline to make sure both the cloud infrastructure and the SoftLayer deployment (databases, search indexes, etc) were always in-sync.
Step 1: Connecting EC2 And Softlayer
One of the first things we had to do to prepare for our migration was figure out how to connect our EC2 and our SoftLayer networks together. Unfortunately the “proper” way of connecting a set of EC2 servers to another private network – using the Virtual Private Cloud (VPC) feature of EC2 – was not an option for us since we could not convert our existing set of instances into a VPC without downtime. After some consideration and careful planning, we realized that the only servers that really needed to be able to connect to each other across the data center boundary were our MongoDB nodes. Everything else we could make data center-local (Redis clusters, search servers, application clusters, etc).
Since the number of instances we needed to interconnect was relatively small, we implemented a very simple solution that proved to be stable and effective for our needs:
Each data center had a dedicated OpenVPN server deployed in it that NAT’ed all client traffic to its private network address.
Each node that needed to be able to connect to another data center would set up a VPN channel there and set up local routing to properly forward all connections directed at the other DC into that tunnel.
Here are some features that made this configuration very convenient for us:
Since we did not control network infrastructure on either side, we could not really force all servers on either end to funnel their traffic through a central router connected to the other DC. In our solution, each VPN server decided (with the help of some automation) which traffic to route through the tunnel to ensure complete inter-DC connectivity for all of its clients.
Even if a VPN tunnel collapsed (surprisingly, this only happened a few times during the weeks of the project), it would only mean one server lost its outgoing connectivity to the other DC (one node dropped out of MongoDB cluster, some worker server would lose connectivity to the central Resque box, etc). None of those one-off connectivity losses would affect our infrastructure since all important infrastructure components had redundant servers on both sides.
Step 2: Architectural Changes To Our Applications
There were many small changes we had to make in our infrastructure in the weeks of preparation for the migration, but having deep understanding of each and every component of it helped us make appropriate decisions reducing a chance of a disaster during the transitional period. I would argue that infrastructure of almost any complexity could be migrated with enough time and engineering resources to carefully consider each and every network connection established between applications and backend services.
Here are the main steps we had to take to ensure smooth and transparent migration:
All stateless services (caches, application clusters, web layer) were independently deployed on each side.
For each stateful backend service (database, search cluster, async queues, etc) we had to consider if we wanted (or could afford to) replicate the data to the other side or if we had to incur inter-data center latency for all connections. Relying on the VPN was always considered the last resort option and eventually we were able to reduce the amount of traffic between data centers to a few small streams of replication (mostly MongoDB) and connections to primary/main copies of services that could not be replicated.
If a service could be replicated, we would do that and then make application servers always use or prefer the local copy of the service instead of going to the other side.
For services that we could not replicate with their internal replication capabilities (like our search backends) we made the changes in our application to implement replication between data centers where asynchronous workers on each side would pull the data from their respective queues and we would always write all asynchronous jobs into queues for both data centers.
Step 3: Flipping The Switch
When both sides were ready to serve 100% of our traffic, we prepared for the final switchover by reducing our DNS TTL down to a few seconds to ensure fast traffic change.
Finally, we switched traffic to the new data center. Requests switched to the new infrastructure with zero impact on our customers. Once traffic to EC2 had drained, we disabled the old data center and forwarded all remaining connections from the old infrastructure to the new one. DNS updates take time, so some residual traffic was visible on our old servers for at least a week after the cut-off time.
A Clear Improvement: Results After Moving From EC2 To Real Hardware
Stability improved. We went from 2-3 serious outages a week (most of these were not customer-visible, since we did our best to make the system resilient to failures, but many outages would wake someone up or force someone to abandon family time) down to 1-2 outages a month, which we were able to handle more thoroughly by spending engineering resources on increasing system resilience to failures and reducing a chance of them making any impact on our customer-visible availability.
Performance improved. Thanks to the modern hardware available from SoftLayer we have seen a consistent performance increase for all of our backend services (especially IO-bound ones like databases and search clusters, but for CPU-bound app servers as well) and, what is more important, the performance was much more predictable: no sudden dips or spikes unrelated to our own software’s activity. This allowed us to start working on real capacity planning instead of throwing more slow instances at all performance problems.
Costs decreased. Last, but certainly not least for a young startup, the monthly cost of our infrastructure dropped by at least 50%, which allowed us to over-provision some of the services to improve performance and stability even further, greatly benefiting our customers.
Provisioning flexibility improved, but provisioning time increased. We are now able to exactly specify servers to meet their workload (lots of disk doesn’t mean we need a powerful CPU). However, we can no longer start new servers in minutes with an API call. SoftLayer generally can add a new server to our fleet within 1-2 hours. This is a big trade-off for some companies, but it was one that works well for Swiftype.
Since switching to real hardware, we’ve grown considerably – our data and query volume is up 20x – but our API performance is better than ever. Knowing exactly how our servers will perform lets us plan for growth in a way we couldn’t before.
In our experience, the cloud may be a good idea when you need to rapidly spin up new hardware, but it only works well when you’re making a huge (Netflix-level) effort to survive in it. If your goal is to build a business from day one and you do not have spare engineering resources to spend on paying the “cloud tax”, using real hardware may be a much better idea.