5 Lessons We’ve Learned Using AWS

In my last post I talked about some of the reasons we chose AWS as our computing platform. We’re about one year into our transition to AWS from our own data centers. We’ve learned a lot so far, and I thought it might be helpful to share with you some of the mistakes we’ve made and some of the lessons we’ve learned.
1. Dorothy, you’re not in Kansas anymore.
If you’re used to designing and deploying applications in your own data centers, you need to be
prepared to unlearn a lot of what you know. Seek to understand and embrace the differences operating in a cloud environment.
Many examples come to mind, such as hardware reliability. In our own data centers, session-based memory management was a fine approach, because any single hardware instance failure was rare. Managing state in volatile memory was reasonable, because it was rare that we would have to migrate from one instance to another. I knew to expect higher rates of individual instance failure in AWS, but I hadn’t thought through some of these sorts of implications.
Another example: in the Netflix data centers, we have a high capacity, super fast, highly reliable
network. This has afforded us the luxury of designing around chatty APIs to remote systems. AWS networking has more variable latency. We’ve had to be much more structured about “over the wire” interactions, even as we’ve transitioned to a more highly distributed architecture.
2. Co-tenancy is hard.
When designing customer-facing software for a cloud environment, it is all about managing down expected overall latency of response. AWS is built around a model of sharing resources; hardware, network, storage, etc. Co-tenancy can introduce variance in throughput at any level of the stack. You’ve got to either be willing to abandon any specific subtask, or manage your resources within AWS to avoid co-tenancy where you must.
Your best bet is to build your systems to expect and accommodate failure at any level, which introduces the next lesson.
3. The best way to avoid failure is to fail constantly.
We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends.
If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
4. Learn with real scale, not toy models.
Before we committed ourselves to AWS, we spent time researching the platform and building test systems within it. We tried hard to simulate realistic traffic patterns against these research projects.
This was critical in helping us select AWS, but not as helpful as we expected in thinking through our architecture. Early in our production build out, we built a simple repeater and started copying full customer request traffic to our AWS systems. That is what really taught us where our bottlenecks were, and some design choices that had seemed wise on the white board turned out foolish at big scale.
We continue to research new technologies within AWS, but today we’re doing it at full scale with real data. If we’re thinking about new NoSQL options, for example, we’ll pick a real data store and port it full scale to the options we want to learn about.
5. Commit yourself.
When I look back at what the team has accomplished this year in our AWS migration, I’m truly amazed. But it didn’t always feel this good. AWS is only a few years old, and building at a high scale within it is a pioneering enterprise today. There were some dark days as we struggled with the sheer size of the task we’d taken on, and some of the differences between how AWS operates vs. our own data centers.
As you run into the hurdles, have the grit and the conviction to fight through them. Our CEO, Reed Hastings, has not only been fully on board with this migration, he is the person who motivated it! His commitment, the commitment of the technology leaders across the company, helped us push through to success when we could have chosen to retreat instead.
AWS is a tremendous suite of services, getting better all the time, and some big technology companies are running successfully there today. You can too! We hope some of our mistakes and the lessons we’ve learned can help you do it well.

C Is For Compute – Google Compute Engine (GCE)

After poking around the Google Compute Engine(GCE) documentation I had some trouble creating a mental model of how GCE works. Is it like AWS, GAE, Rackspace, just what is it? After watching Google I/O 2012 – Introducing Google Compute Engine and Google Compute Engine — Technical Details, it turns out my initial impression, that GCE is disarmingly straightforward, turns out to be the point.

The focus of GCE is on the C, which stands for Compute, and that’s what GCE is all about: deploying lots of servers to solve computationally hard problems. What you get with GCE is a Super Datacenter on Google Steroids.

If you are wondering how you will run the next Instagram on GCE then that would be missing the point. GAE is targeted at applications. GCE is targeted at:

  • Delivering a proven, pure, high performance, high scale compute infrastructure using a utility pricing model, on top of an open, secure, extensible Infrastructure-as-a-Service.
  • Delivering an experience that feels like you are in a datacenter and not at creating a massively multi-tenant cloud.
  • Allowing you to become Google. Tackle the same problems Google tackles with the same infrastructure, minus all the data and people of course.
  • Standing up VM instances quickly, do your work, and tear them down quickly.
  • Performing better and better as cluster gets bigger. Google considers large clusters to start at 10-20K instances.
  • Being a compute utility. You get resources affordably because of Google’s efficiency at scale.
  • Consistent performance.  Google has pioneered consistent performance at scale and they are making a huge deal of this and it’s mentioned several times in the demos. GCE is tuned for both high and consistent performance throughout the stack. The idea is you don’t have to design for unstable or inconsistent system, so don’t have to design for worst case. This allowed some customers to reduce their number of cores in half.
  • Giving you a set of servers you can run anyway you want.
  • Creating a technology you can bet your business on. Google is running Google business on the stack today.

Basic Overview Of GCE

  • Customers
    • Targeted at problems using large compute jobs, batch workloads, or that require high performance real-time calculations. Not building websites. In the future they plan on adding more features like load balancing.
    • Right now it’s about work that can be parallelized. Will provide vertical scaling in the future, that is 32+ cores.
    • Seem to want enterprise customers that can make use of lots of cores, not little guys.
  • Datacenters
    • Region: for geography and routing domain.
    • Zone: for fault tolerance
    • Currently operating 3 US datacenters/zones, located on the East coast of the US.
    • Working on adding more datacenters globally and adding more datacenters in the US.
  • API
    • JSON over HTTP API, REST-inspired, authorization is with OAuth2
    • Main resources: projects, instances, networks, firewalls, disks, snapshots, zones
    • Actions GET, POST (create), DELETE, custom verbs for updates
    • A command line tool (gsutil), a GUI, and a set of standard libraries gives access to the APIs. Experience is like Amazon in that you have an UI and command line tools.
    • All Google tools use the API. There is no backdoor. The web UI is built on Google App Engine, for example. App Engine is the web facing application environment and is considered an orchestration system for GCE.
    • Partners like RightScale, Puppet, and OpsCode, also use the API to provide higher level services.
    • Want people to take their code and run it on their infrastructure. Open API. No backdoors. Can extend that stack at any level.
  • Project
    • Everything happens within the context of a Project: team membership, group ownership, billing. A Project is a container for a set of resources that are owned by the Project and not by people. Every API action is traced back to a person instead of a credential.
  • Service Account
    • Synthetic identity acting as a user when performing operations in code. Connects seamlessly with GAE, Cloud Storage, Task Queues, and other Google services.
    • When launching a VM an OAuth2 scope is provided that is stored in a special metadata server that is used transparently between services. No configuration or password is required.
  • Virtual Machine
    • Linux virtual machines with root access. For security and performance reasons the kernel is locked down. The kernel is tuned to work with their networking environment.
    • Two stock versions of Linux: Ubuntu and CentOS. They say you can run whatever Linux distribution you want, but I’m not sure how that fits with locked down kernel policy.
    • Comes installed with gsutil, turned off password authentication so only use ssh authentication is used, turned on automatic security updates.
    • High performing 2.6 GHz Intel Sandy Bridge processor.
    • Available in 1,2,4, or 8 virtual CPUs. Each virtual CPU is mapped to a hyperthread.  For a 2 CPU instances you get both halves of a real physical core.
    • 3.7GB RAM per core. 420GB local/ephemeral storage.
    • 8 core instances have dedicated spindles. You are the only one reading and writing from the disk, so you have more predictable/consistent performance.
    • Invented performance unit: the Google Compute Engine Unit (GQ). Roughly matches Amazon’s compute unit. Each virtual CPU is rated at 2.75 GQs.
    • Smaller machines will be available for prototyping and debugging.
    • Big boxes because focussed on high performance computing.
  • Instances
    • A combination of KVMs (Kernel Virtual Machines) and Linux cgroups are used for the underlying hypervisor technology. Linux scheduler and memory manager are reused to handle the scheduling of the machines.
    • KVM provides virtualization. Cgroups provides resource isolation. Cgroups was pioneered by Google to keep workloads isolated from each other.
    • Internally Google can run virtualized and non-virtualized workloads on the same kernel and on the same machine, which allows them to deploy and test one single kernel.
    • Located in a zone.
    • Fast boot times: 2 minutes.
  • Instance Metadata
    • Solving the configuration problem to customize VMs at boot time.
    • A dictionary of key-value pairs are available on the instance via a private HTTP metadata server just for that machine. This metadata can be set for the instance to control its boot/configuration/role process. Can be read using curl.
    • Project wide metadata is also available that is inherited by all instances. Used to push SSH keys into VM at boot time. A default image knows how to read a special bit of metadata called SSH Keys and then installs them into the VM.
  • Startup Scripts
    • Simple bootstrapping scripts, similar to rc.local, that run on boot.
    • Use to install software and start other software.
  • Service Orientation, not Server Orientation
    • Build across zones to deal with failure.
    • Use startup scripts and metadata for automatic configuration.
    • Use local disk as a cache or scratch area.
    • Build automation using GAE or their partners.
  • Networking – VPN
    • Google considers their network a distinguishing feature. It features high cross sectional bandwidth, that is, machines can talk more directly to each other without competing with neighboring traffic on a bus. This reduces network latency and increases the consistency of performance. They won’t publish any numbers though.
    • Each project gets its own secure VPN that is unshared with anyone else. Spans across all your VMs, no matter where they are.
    • Networking traffic does not transit the Internet. It is routed over Google’s secure, high performance private network.
    • Network is all L3 using private IP addresses that are guaranteed to come from a machine on your VPN.
    • VM name = DNS name. VMs have normal looking hostnames that you can assign and use the DNS to find. This is very convenient when bringing up an arbitrary set of hosts.
    • IPv6 in the future.
    • You can have many VPNs per project, but by default there is one called default that is used by default.
    • Broadcast and multicast are not supported, which if you have a VPN removes a lot of interesting architectures. Maybe with v6?
  • Networking – Internet
    • Traffic from the Internet to your machine is shunted on to Google’s private network as soon as they can and given a “first class” ticket to your VPN. This is like an overlay network you see on CDNs.
    • 1-to-1 NAT. Every VM can be assigned an external IP address that is rewritten as it enters and exits your VPN. They don’t exist on the VM when you do an ifconfig.
    • IP addresses can be detached from a VM in one region and attached to a VM in another region and Google will make sure the traffic is routed properly.
    • Built in firewall to control who talks to what in the system.
    • Can’t use SMTP. Only UDP, TCP, and ICMP can be used to the Internet.
    • IP addresses are advertised with Anycast, then they encapsulate it, and then forward it to your VPN.
  • Storage
    • Focused on creating persistent block device that offers performance / throughput so you don’t need to push storage local.
    • Two block storage devices: Persistent Disk and Local Disk.
  • Persistent disk
    • Off instance durably replicated storage medium. High consistency. High throughput solution. Secure. Backing store for database. Built from scratch to be highly performant and gives good 99.95 percentile performance.
    • Allocated to a zone.
    • Can be mounted read/write to a single instance or read only to a set of instances.
    • Data is transparently encrypted when it leaves your VM, before it is written to disk. Using new processors there’s very little to no overhead. It seems to use Google keys and not your keys.
    • Less than 3% variance in IO bandwidth when doing 4K random reads and writes. This is their consistency theme. Less variance than a local disk, which can vary by 13%.
    • For large block read and writes there’s triple the local bandwidth compared to local disk.
  • Local/ephemeral disk
    • Ephemeral on reboot. When the VM goes away the data goes away.
    • It’s encrypted using a VM specific key.
    • Currently all instances boot of off local disk, looking to boot off of persistent disk in the future.
    • 3.5TB with the 8 CPU instance.
    • With larger instances (4-8 core) you get dedicated spindles. One spindle with the 4 core instance and 2 spindles with the 8 core instance.
  • Google Cloud Storage
    • Enterprise grade Internet object store.
    • HTTP API for getting and setting values.
    • Don’t have to worry about managing data. Replication is happening for you.
    • Publicly readable objects are cached close to where they will be used. Sounds a bit like a CDN. Data will be replicated to where it is needed and available quickly.
    • Uses Google global high performance internet backbone.
    • Read your writes consistency.
    • Bulk data. Useful for getting data in and out of Google’s cloud using Google’s high capacity pipes.
  • Pricing
    • 50% more compute for your money when compared to AWS.
    • Billed on demand by the hour.
    • SLA and support open to commercial customers.

Examples Of GCE Usage

Invite Media

Runs a real-time ad exchange that has a very high volume of traffic, 400K QPS,  and as with all real-time markets requires consistently low latencies, 150ms end-to-end, in order to calculate the best deals. For each ad request they have time budget of 10ms to find a backend server to serve the request and establish a connection.

Found the GCE model familiar. You have Linux VMs, you have disks, you can assign static IPs, create startup scripts, and have a nice API. Took two weeks to port their system to GCE.

Comparing existing provider with GCE, using 8 core instances:

  • 350 QPS vs 650 QPS (while respecting latency requirements)
  • 284 machines vs 140 machines
  • 5% connection errors vs < .05%
  • 11% of requests timed out vs 6% – means 5 percent more ad requests they can buy for advertisers

Decided to migrate entire operation to GCE.

Hadoop On GCE

This is example code created by someone at Google and will be released in the future.

  • Can run from command line or GAE.
  • Launch a coordinator has an API to set up all the other VMs in the cluster (100 nodes), monitor, etc.
  • Booting from a fresh Ubuntu image the setup was pretty fast. The coordinator installs Hadoop and launches nodes. Took a while, but relatively quick.
  • Launched a job on Hadoop master to process 60GB of compressed wikipedia revision history. Slices data in CSV format. Took 1.5 minutes writing 70GB of data.
  • The CSV is piped into Big Query to answer questions like which wikipedia article had the most edits, who are the top editors, and other interactive questions.

Video Transcoding

This is very common cloud demo.

  1. Video loaded into a job queue.
  2. Consumers, and you can run a lot on GCE, take job and perform the transcode.
  3. Transcoded video is sent to the Google storage service.

MapR On Terasort

MapR ran the Terasort benchmark on a 1250 node cluster in 1:20 minutes at a cost of $16. This was near record performance and they estimate to buy the same hardware to run the test locally would cost nearly $6 million.

They found GCE blazing fast with great disk drive disk and network bandwidth. They were able to provision thousands of VMs in minutes

BuildFax

Put their database and production servers on GCE. They are very pleased with the consistent performance. Their service delivers insurance related data points to customers at the time they write policies. Results were returned in less than 4 seconds with a very low variance. Again, this is the consistent performance claim.

Observations

  1. With GCE Google has designed an experience familiar to Amazon users, with some nice second system improvements in configuration and operations, and a lot of special Google sauce in performance.
  2. Better late than never. GCE is late to the game, but it has a strong performance, pricing, and development model story that often helps with customers wins over first to market entrants. If you need huge scale and/or great performance then why wouldn’t you consider GCE? Performance requires carefull design from the start. It’s hard to add in later. And after all of Google’s bragging about their cool infrastructure this is your chance to give it a spin and see what it is made of.
  3. Kind of bummed that it’s not targeted more at front facing websites. There’s no reason you can’t run a website in GCE it seems, but unlike AWS you won’t get a lot of help. Like in the early days of EC2 it’s all up to you, but that’s probably OK for a lot of people.
  4. As Google deals with more and more customers can they maintain quality? As we’ve seen, most things go bad when problems occur and a lot of traffic is flowing through the system. Shared state is the system killer and Google still has plenty of that. Google has yet to test their cloud infrastrucure in this way.
  5. Where will egress pricing end up once the low promotional pricing ends? Google lockin will occur if it’s expensive to transfer your data out of Google’s cloud. Google pricing in general is a bit scary.
  6. Will AWS Direct Connect be avaialble to GCE?
  7. Is GCE a target for migration or integration? BigData jobs are an obvious target for GCE, but we’ve also seen examples where real-time services benefit from GCE, so running a few select services in GCE might be a good toe in the water strategy. Concerns over data transfer costs are part of the ecosystem lockin play. Resilience alone however argues for implementing systems in more than one cloud.
  8. Amazon has a huge advantage in services. Will Google go upstack as Amazon has done? Or is this your cloud equivalent of a chance to tap the Android market while everyone else is creating apps for theiPhone?

(via HighScalability.com)

The Clever Ways Chrome Hides Latency By Anticipating Your Every Need

Ilya Grigorik has written another wonderful article lavishly detailing the extraordinary tactics Chrome employs to hide network latency from users: Chrome Networking: DNS Prefetch & TCP Preconnect. Ilya springs some surpising factoids on us, revealing how the web has slowed and super sized:

  • The size of an average page has grown to 1059kB and is now composed of over 80 subresource requests.
  • An average DNS lookup takes between 60 and 120ms. This creates a 100-200ms of latency before a request can be sent because of th full round-trip (RTT) to perform the TCP handshake.
  • Slow mobile experiences are largely due to the much higher RTT’s (200-1000ms) on wireless networks. Reducing the number of outbound connections and the total byte size of your pages is the single best optimization you can make for mobile today.

Chrome reduces apparent latency using a host of clever anticipatory mechanisms:

  • Learns the network topology as you use it via a Predictor object that anticipates user behavior and future resources based on historical browsing data, heuristics, and many other hints from the browser to anticipate the requests.
  • Speculatively pre-resolves hostnames (DNS prefetching).
  • Speculatively opens connections (TCP preconnect).
  • Uses a pool of 8 threads to resolve DNS requests through the OS. A faster resolver is in the works.
  • Resolves the first 10 most used URLs on startup.
  • Specutively connects to the search engine when it looks like a search is being typed into the omnibar.
  • Specutively connects to the most likely host when a URL is typed into the omnibar.
  • Uses a speculative PreloadScanner that looks ahead in the document and queues up remote resources before the parser sees them.
  • Learns the most visited resource domains for each hostname and preemptively resolves and preconnects to these resource hosts before the parser even sees them.
  • Actions such as hovering over a link can kick off a prefetch.
  • Adding a link element in the head of a document with rel=dns-prefetch hints to the browser to pre-resolve the DNS name of that host. A good and often overlooked reason for doing this is to pre-resolve redirects: if you know that a specific host request will return a 3XX to a different host, then you can also pre-resolve that via dns-prefetch.

These obviously sophisticated tactics may not apply directly to your application, but similar opportunities to hide latency could be hiding in your app. As Ilya says, “Small wins, but it all adds up!”

(via HighScalability.com)