Google launches Compute Engine

Google has announced plans to start offering a compute-on-demand service that rivals Amazon’s Elastic Compute (EC2) service. Google has offered many what it calls higher-level cloud services such Google storage, BigQuery and Google App Engine in the past, but now the company believes it needs to sell a more prosaic Infrastructure as a Service (IAAS) offering whose primary target is attracting more developers to Google’s cloud platform. The news of this new service was first reported by my colleague Derrick Harris in May and wasconfirmed later by me with additional details this past week.

“The Google Compute Engine, we believe, has been the missing piece,” said, Urs Hölzle, Google’s senior vice president of technical infrastructure, during a broad conversation this week. He said that building an infrastructure-as-a-service isn’t a trivial task, as the demands on such a service are quite intensive. Google has been working on this new service for some time now, he added.

The focus of the Google Compute engine is on performance, scale and value. In order to show its performance and scale, Google is planning to show off a genomic app that runs on 600,000 cores. Another app will use 10,000 virtual machines. And if that isn’t enough, the company says it will offer 50 percent more compute resources compared to other shared cloud infrastructures. Translation: It’s a shot across the bow of Amazon Web Services’ EC2 offering. (See image below for pricing)

The developers can run any stack and any software on this service. The company is partnering with third-party services such as RightScale to add more tools and services to its platform. Google is going to initially offer its service in limited preview and will sell it through its sales force if customers are looking for 100 or more cores. Eventually the service will be accessible with a credit card and a browser like most cloud services.  From Google’s blog post:

The capabilities of Google Compute Engine include:

  • Compute. Launch Linux VMs on-demand. 1, 2, 4 and 8 virtual core VMs are available with 3.75GB RAM per virtual core.
  • Storage. Store data on local disk, on our new persistent block device, or on our Internet-scale object store, Google Cloud Storage.
  • Network. Connect your VMs together using our high-performance network technology to form powerful compute clusters and manage connectivity to the internet with configurable firewalls.
  • Tooling. Configure and control your VMs via a scriptable command line tool or web UI. Or you can create your own dynamic management system using our API.

At launch, we have worked with a number of partners — such as RightScale, Puppet Labs, OpsCodeNumerateCliqr and MapR – to integrate their products with Google Compute Engine.  These partners offer management services that make it easy for you to move your applications to the cloud and between different cloud environments.

A company spokeswoman said that anyone can “sign up today, but we will be accepting customers who are focusing on larger workloads. In some cases we would accept smaller workloads as well. “ During the early phase, Google will offer Google compute service only to the U.S.-based developers, but will eventually roll out the platform to customers globally. Hölzle said that the company was using Google’s current infrastructure stack to offer the on-demand compute service.

Better late than never?

When asked if he believed that Google was a tad late to the party, Hölzle pointed out that while there have been many existing players offering cloud infrastructure services, there is ample opportunity for Google as the shift to cloud is more cyclical and long term. “This really isn’t about stealing marketing share from other players,” he said.

“I think we are early because the whole industry itself is in [its] infancy,” he said. “If you look at it, in the grand scheme of things, nearly 99 percent of the companies are not in the cloud.” Hölzle, however, said it was the right time for Google to enter the market. “More and more apps are being built for the web and mobile and the original storage and services are all moving to the cloud,” he said. The emergence of Chrome OS, Android and iPhone have led to the point where cloud clients are becoming “stateless.”

“It is very early in the market and, frankly, five years from now you will have a whole different kind of cloud and services.” He declined to outline what the cloud will look like in five years.

Hölzle was reticent about predicting the level of adoption as well, but was not shy of pointing out that Google has been in the infrastructure business for years and it is one of the key advantages for the company. “The market will show,” he said, and invited me to ask him the same question “two years from now.”  

The great (cloud) game

Despite Google’s dismissal, one can’t help but notice Amazon’s looming shadow on Google. Amazon Web Services, thanks to being an aggressive and early believer in the cloud as we know it, has carved itself a nice niche and is rumored to be bringing in over a billion dollars in revenue. But it is not just revenue that has a whole industry jealous of Amazon’s cloud business.

Success in cloud services has made Amazon attractive to startups and independent app developers, who are embracing Amazon’s stack of cloud services. These code-tinkerers are the kingmakers in this new world, especially now that Amazon has forked Android and has been pretty public about its grand mobile ambitions.

The battle for developers and locking them into cloud and mobile platforms is literally the trillion-dollar question of the 21st century. Microsoft Azure, Apple’s iCloud, Amazon Web Services and now Google Compute Engine are essentially trying to get their hooks into the developers. Frankly, I am surprised that Facebook hasn’t announced its own efforts to do the same.

Amazon, for now, is the king of the hill. At our Structure 2012 conference, when I asked Amazon CTO Werner Vogels about the next five years, he talked about a new layer of services emerging and Amazon being the trendsetter. It is a distinct advantage for the Seattle-based company that has angered its partners but has been focused on making sure it keeps pushing the envelope. He understands — and so does Google — that there is an opportunity to take away the dollars spent on IT dinosaurs such as Hewlett-Packard.

These giants of the past should be waking up with a migraine, for the entry of Google makes life tougher for them. I wonder what this news does to smaller cloud players such as Rackspace that have been inching their way toward Amazon’s heels.

Nevertheless, Amazon knows it has no time to rest on its laurels. For instance, it is not going to let Google press the price advantage for long. “If you look back, we’ve lowered pricing 20 times, so the best thing to look forward to is we’ll continue to do that,” Vogels said in our onstage conversation. “That’s at least our goal.”

And whichever way you look at it, Google’s entry into the business is a good thing for the developers and startups.

(via gigaom.com)

Amazon installs first European HPC cluster in Ireland

Amazon Web Services (AWS) has installed a cluster system at its EU West facility in Ireland, allowing customers in the region to run high performance computing (HPC) instances in the cloud.

The Cluster Compute Eight Extra Large (cc2.8xl) instances went live on Friday 22 June, the company said in a blog post. This is the first time that AWS’s HPC cloud technology has been available outside of the US.

Each instance includes a pair of Intel Xeon processors, 60.5GB of RAM, and 3.37TB of instance storage, according to AWS. Each processor has 8 cores and hyperthreading is enabled, so users can execute up to 32 threads in parallel.

The instances are connected to a 10GbE network, and can be grouped together for low-latency connectivity to other CC2 instances.

The new HPC cluster will appeal to organisations that are working with large, complex datasets, including financial services, oil and gas, life sciences, social gaming, advertising, e-commerce and media, according to Matt Wood, product manager for high performance and data intensive computing at Amazon Web Services.

“The arrival of the cc2.8xlarge instance size in the EU West region allows customers who store data in that geography to compute, analyse, ask questions and find insight from that data using high performance Intel Xeon E5 processors,” said Wood.

“This brings the power of a purpose designed HPC environment with 10GbE fully bisectional networking to customers with the same on-demand, utility metering offered by other AWS services, providing faster turn arounds, larger computational runs and a shorter time to market.”

The CC2 instances in Ireland are priced at $2.70 (£1.73) per hour for Linux and $2.97 (£1.90) per hour for Windows, said Amazon.

Speaking at the launch of Intel’s Xeon E5 processor family earlier this year, AWS technology evangelist Dr. Matt Wood said that we are now entering the era of utility supercomputing, whereby anyone can dial up computational resources and massive storage requirements on the fly.

He said that scientific and financial organisations with massive computational demands will increasingly be able to rent resources from the cloud to be able to do their work – whether it happens to be product modelling, simulation or informatics – without having to invest in huge infrastructure.

“Traditionally these organisations would have to provision for 10-15 percent over the peaks in demand, but the cloud allows for bursty scalability, lowering the barriers to entry and allowing them to spend at least 70 percent of their time on differentiated work, rather than keeping the lights on,” he said at the time.

Amazon said that cc2.8xl instances had formed the basis of a computing cluster which clocked a maximum speed of 240 teraflops. The cluster, which contains 17,024 cores and 65.968 TB of RAM currently ranks number 72 on the Top500 list.

(via Techworld.com)

StubHub Architecture: The Surprising Complexity Behind The World’s Largest Ticket Marketplace

StubHub is an interesting architecture to take a look at because, as market makers for tickets, they are in a different business than we normally get to consider.

StubHub is surprisingly large, growing at 20% a year, serving 800K complex pages per hour, selling 5 million tickets per year, and handling 2 million API calls per hour.

And the ticket space is surprisingly rich in complexity. StubHub’s traffic is tricky. It’s bursty, centering around unpredictable game outcomes, events, schedules, and seasons. There’s a lot of money involved. There are a lot of different actors involved. There are a lot of complex business processes involved. And StubHub has several complementary but very different parts of their business: they have an ad server component serving ads to sites like ESPN, a rich interactive UI, and a real-time ticket market component.

Most interesting to me is how StubHub is bringing into the digital realm the once quintessentially high-touch physical world of tickets, point-of-sale systems, FedEx delivery, buyers and sellers, and money. They are making it happen with deep electronic integration into organizations (like Major League Baseball) and a Lifecycle Bus that moves complex business processes out of the application space.

It’s an interesting problem made more complex by having to move forward while dealing with legacy systems built when getting business building features out the door was the priority. Let’s see how StubHub makes it all work…

Source

Business Model

  • StubHub is an eBay for tickets. Providing a marketplace for buyers and sellers of tickets. They are not like Ticketmaster.
  • An escrow model is used to provide trust and safety for  buyers a sellers. A credit card is on file at StubHub, when the order is confirmed as received only then is the money transferred.
  • Tickets are not commodities. Buyers want specific tickets, not a bleacher seat, for example. There are limited quantities, there is no backorder for a ticket. Sellers are constantly updating prices and quantity in reaction to market forces. It’s a very active market.

Statistics

  • 5 Million orders per year
  • 2 Million active listings/tickets.
  • Tickets revolve around 45k events. Major League Baseball post season and the Super Bowl are the most active periods.
  • Sales growing at 20% per year.
  • Serves 600k – 800k complex pages per hour. Burst to a million per hour during post season.
  • Traffic is very bursty in a narrow window of the 10-12 core waking hours in the US. When a playoff game finishes, for example, there’s a buyer frenzy for tickets for the next game.
  • 2 million API calls from affiliates per hour.
  • 24-36 engineers.
  • It’s a high touch business. Support call to transaction was 1-1 in the past, but is a lot better now. Biggest part of staff is customer service and back office business operations support.

Platform

  • Cold Fusion
  • ActiveMQ
  • SEDA (Staged Event-Driven Architecture)
  • Lucene/Solr
  • Jboss
  • Memcache
  • Infiniband - fast network to SAN
  • XSL
  • Advanced Queuing from Oracle
  • TeamWorks - An IBM workflow builder
  • Splunk
  • Apache HttpClient
  • Log4j (using message format)
  • Tapestry

Architecture

  • Three sources of ticket purchases: Web, Point of Sale Systems, Bulk Upload. Bulk upload allows sheets of tickets to be uploaded into the system.
  • A Manager layer provides a business object abstraction above the Ticket database. It mediates all conversations with the tickets table.
  • The Ticket table gets hammered with the activity of buyers and sellers and the natural traffic burstiness around events.
  • An active market can cause the mobile clients to become out of date with the current state of the system so buyers are reacting to old data.
  • Two data pumps feed data from the Tickets database to internal and external systems: My Account, Find, and a Public Feed. My Account is an interface to a user’s account. Find is a search capability. The Public Feeds powers sites like ESPN and eBay.
    • Internal Feed – Contains sensitive information, like account information, that is used for dashboards, who sellers are, what sales are happening, sales velocities, heat maps, etc.  It also feeds parts of the home page that are based on sensitive market data, like what’s hot, that they don’t want to share with the public.
    • External Feed (LCS) - Advertisers, like ESPN, are fed through this feed. Ads aregeo mapped by IP address.
  • LCS (Listing Catalog Service):
    • My apologies here, the slides were a little off and it was difficult to  match the speaker with the presentation. All errors are mine.
    • Triggers are used to keep a modification table up to do date with changes as they occur in the database.
    • A Change Data Capture job constantly polls for changes and injects message into an ActiveMQ broker. This is the first level of routing and contains small payloads: object ID, object type, and change date.
    • Change data is routed to the Master which is the fundamental mechanism for replicating between datacenters. A secondary datacenter subscribes to these topics and that’s how data is replicated between datacenters.
    • Once in the Master data is injected into a SEDA queue for processing (more on that later).
    • The Manager is not used because many legacy systems exist that don’t use the Manager and go directly to the database, so the database is the common point to distribute deltas.
    • Most of their ads are flash, but some use HTML rendering.
    • Shopping experiences like selecting tickets from an interactive graphic of the stadium are powered by LCS. Solr makes adding new features like this pretty easy.
  • SEDA (Staged Event-Driven Architecture) use in the Master
    • SEDA is a technique for smoothing out load. It has been effective for them from a resource management perspective. The idea is to break work down into small enough pieces that slow pieces of work won’t steal threads from other users. Work is modeled in stages and each stage has its own thread pool, which acts as a throttle on work.
    • The Master receives the small updates and figures out how to build the thing that goes into memcache for eventual routing to Solr.
    • Messages are cached using a protocol buffer format in memcached.
    • Messages are sent to a second broker which disperses messages out to the edge, which is Lucene/Solr.
    • A message consumer receives messages from a broker
      • Loads the entity from the database
      • Determine if there are any cascading impacts from the update. Because Solrand other NoSQL databases don’t do joins, if a band name is changed, for example, that change must be propagated to all events.
      • Serialize entity. Stores it in memcache.
      • Sends a message to a second ActiveMQ broker which routes messages to the edges, which is Solr.
      • The broker is listened in on by a seperate process that does the routing. This was once in Jboss, but Jboss would get overwhelmed and they ran into starvation issues so they moved it out of Jboss. This listener became a useful valve in the system for operational management. If they needed to swap out a new Solr index to put in a new schema, they could turn the valve off, let messages back up in the message broker, and open up the valve again an messages would start flowing again. The valves had a huge impact on their operational stability when recovering from Solr failures, copying indexes around, and updating schemas.
    • All these are blocking operations so using the thread pools prevents the storming of database connections.
  • Solr
    • Solr provides a nice document store and nature language text query feature.
    • All search is built on Solr, including faceting.
    • FAST. Queries returned in 10msec or less. They use an Infiniband network to their SAN and found they didn’t need to cache data in memory, they could it serve it off their SAN fast enough with the fast network.
    • POTENT. Flexible query language, full text search, and geo-special searching.
    • Supports many output formats: XML, Atom, RSS, CSV, Json. Makes it easier to integrate with various clients.
    • Not so great with high frequency writes. Replication doesn’t seem well integrated under high frequency writes. Doing an rsync when you are seeing 10s of thousands of changes doesn’t work. You get really stale data. So they had to build out their own replication mechanism.
    • Flat data structures. Still pretty much row oriented. They would like support for documents with more complex structures.
  • DCL - Double Click intro Browse
    • URLs are mapped onto IDs: genre ID, geo ID, render type ID.
    • Genre ID and Geo ID are used by DCL, which is just XSL, to create queries forLCS. The returned data can then be generically rendered. So all football teams can have a similar page generated for them with the exact same structure using URL mapping, XSL, and LCS.
    • Huge productivity gain, it makes it much easier to add new features. Each block in a page is an individual asset that is managed via their CMS. They are chunks ofXSL that get rendered against a context document which is retrieved from LCS.
    • A RenderChunkByName call makes it easy to render events on other services likeFacebook.
    • Doing all this on the backend for SEO purposes.  When search engines can index Ajax they may not need to do this.
  • Edge caching of gifs, style sheets, etc, but  data is cached on their servers.
  • Reducing the number of customer interactions per transaction:
    • Customer interaction is the biggest drain on their operating income. When things go bad it takes a lot of working with buyers and sellers to straighten problems out.
    • Increasing customer self service. Customers want to know when they get their money and tickets. The MyAccount screens allow customers to check on the progress of their order without using customer service.
    • Expose APIs to sellers so they can integrate these features into their systems.
    • IVR - integrated voice recognition systems support customers calling up to find the state of their account.
    • Cash flow is important to vendors so they work on getting payments out quicker.
    • Electronic integration with Major League Baseball so they can transfer a ticket directly to buyer from a seller before the seller has physical possession of the tickets. Instant delivery and huge improvement in customer satisfaction and removal of failure points.
  • Lifecycle Bus
    • Used to prevent hardwiring complex workflow in applications. A check-out application doesn’t want to have to worry about all the different business processes living downstream. You just want to worry about validated credit cards and ordering, not things like fulfillment and email.
    • Useful in dealing with legacy issues and managing changes to the site.
    • There are topics for all the major lifecycle events. Agents listen on Oracle’s queue for topics. When an order is cut it goes into an unconfirmed state. A listener listens on the “unconfirmed” topic and emails the seller to come to the site and confirm the order. When the seller confirms the order there agents that will capture the money from the buyer’s escrow account, email the buyer saying the seller has confirmed and where to find the tickets. When the tickets are confirmed payment is released to the seller.
    • All this logic is decoupled from the end user facing part of the web site. These are all backoffice engines.
    • TeamWorks models these processes, finds weakness, monitors the processes, check SLAs, and triggers actions. Helps better optimize backoffice business processes. As they grow 20% per year they don’t want to grow the operations team 20% year over year.
    • FedEx was the original mode of fulfillment. Electronic Fulfillment was added. The business process looks like: unconfirmed -> auto confirm; confirmed -> barcode reissue and Disburse PDF; fulfilled. All you have to do is write agents that live off the same order lifecycle to implement new fulfillment modes. This logic is not in the application. It’s in the agent and is an individually deployable and testable unit.
    • Fraud avoidance. Uses the same lifecycle model for fulfillment, but adds two new states: purchased and approved. They didn’t have to change anything to add fraud avoidance. They just had to change the state machine. The agent decided to move it to an approved state or not approved state.
  • Point-of-Sale System Integration
    • Uses a two phase commit: reserve a ticket on the external system, mark it as claimed in StubHub, commit purchase on external system.
    • Looking to generalize this so other systems can buy tickets as part of their transactions. Bundle a ticket with a trip or hotel purchase.
  • Splunk and Dye
    • One of the biggest ROI projects at StubHub. Saves so much time debugging problems.
    • Dye – an artifact is placed into the HTTP header of each request.
    • These are logged using Log4j.
    • Using Splunk, if there’s a problem with an order, you can look at the log lines using Dye markers to pull back and see all the calls that were part of the request, including secondary calls to other services like LCS. It’s easy to retrace activity.
    • Really like Splunk. It’s like a document store for log lines. Put dye marker and order IDs into log lines, like a series of key-value pairs, and Splunk makes it really easy to look at logs. Their dashboards are written using Splunk, for stats like transactions per minute, failures per minute. You can slice and dice the data anyway you want.
    • Use Log4j with message format so it won’t do dynamic string creations.

Lessons Learned

  • Scalability is specialization. Every problem space has its unique characteristics and any system must be built to solve that particular problem. StubHub is constrained by the need for a safe buying experience, the unique nature of the ticket market, burstytraffic, and the vagaries of events. Their system must reflect those requirements.
  • Use an abstraction layer from the start. Otherwise you’ll be stuck supporting legacy clients far past your tolerance point.
  • Bake-off in production. Implement multiple solutions and have a bake-off in production to determine which version works better. StubHub tested two different data pump versions in production to see which was a better fit. You don’t want to support multiple infrastructures.
  • Move work out of Jboss. Huge numbers of requests could cause Jboss to starve, so they moved work outside of Jboss.
  • Percolate dye through the causal chain. Instrument requests so that a request can be traced through an entire stack. It’s a huge ROI to be able to debug problems across the stack.
  • Optimize business processes. Electronic integration between systems.  By acting as the coordinator between sellers, buyers, and MLB, they were able to increase customer satisfaction and remove a large number of possible failure points in the transaction. The bought out a popular point-of-resale program so they could integrate with it.
  • Build on your own APIs. They are spending a lot of time trying to build their own application on their own APIs so they can better govern the experience for their users and partners.
  • Define assets generically. Defining assets such that they could be rendered on any context made it easy to compose pages in different formats and render events on other websites.
  • Maximize development from an ROI perspective. Look for projects that improve developer ROI. Using Solr was a huge win for them. It’s easy to use, fast, and very functional, many types of queries could be satisfied off the shelf.
  • SEDA is good for blocking reads. A lot of their system is based on blocking reads, so SEDA is a good fit for that use case. Threads pools prevent overwhelming the resources they are trying to use.
  • Render on the client. For really cool interactive maps like stadium maps, handling allthe UI interaction on the client will save a lot of server load. Even for larger events with 10-20K listings, it was a better solution to download the whole listing.
  • Prefer thin over heavy frameworks. Heavy frameworks are easy to use and abuse. Hiding complexity makes you lose control over your site really fast. Examples: Hibernate and Component-based frameworks. Validation and business logic can leak into the presentation layer. Make a conscious decision. Understand what you are biting off in terms of a legacy issues.
  • Bad experiences are the best training. There’s nothing like failing to teach how to do things right. The guys that started StubHub were building a business by getting features out as fast as possible, but this left a legacy. The key to managing a legacy is managing the dependencies. Using agent based Lifecycle Bus style solution helps them understand the dependencies on legacy systems.
  • Decouple state machines from applications using workflow. Don’t embed complex flows in application logic. Externalize logic so business processes can be couples together in more flexible ways. This makes the system infinitely more flexible and adaptable going forward.
  • Avoid ETL. It introduces a lot of dependencies you would rather not deal with. It’s a risk. A legacy data model can really suck away resource when you are trying to figure out if a change will great they system you financially depend on.
  • Don’t short-change CM and deployment. Current biggest waste of time for developers now. It’s very painful. Invest now in your CM and deployment system.
  • Invest in continuous improvement. It does not come for free. Run a post-mortem on projects. Make sure issues don’t come up again. As your company grows this stuff does not scale. Make the right decisions now or it will take 3x-5x more in the future to fix.
  • Build operational valves into the system. If, for example, you need to swap out a new schema, have a valve where you can turn off events and the restart them again.

(via HighScalability.com)