What is a Monolith?

There is currently a strong trend for microservice based architectures and frequent discussions comparing them to monoliths. There is much advice about breaking-up monoliths into microservices and also some amusing fights between proponents of the two paradigms – see the great Microservices vs Monolithic Melee. The term ‘Monolith’ is increasingly being used as a generic insult in the same way that ‘Legacy’ is!

However, I believe that there is a great deal of misunderstanding about exactly what a ‘Monolith’ is and those discussing it are often talking about completely different things.

A monolith can be considered an architectural style or a software development pattern (or anti-pattern if you view it negatively). Styles and patterns usually fit into different Viewtypes (a viewtype is a set, or category, of views that can be easily reconciled with each other [Clements et al., 2010]) and some basic viewtypes we can discuss are:

  • Module – The code units and their relation to each other at compile time.
  • Allocation – The mapping of the software onto its environment.
  • Runtime – The static structure of the software elements and how they interact at runtime.

A monolith could refer to any of the basic viewtypes above.

Module Monolith

If you have a module monolith then all of the code for a system is in a single codebase that is compiled together and produces a single artifact. The code may still be well structured (classes and packages that are coherent and decoupled at a source level rather than a big-ball-of-mud) but it is not split into separate modules for compilation. Conversely a non-monolithic module design may have code split into multiple modules or libraries that can be compiled separately, stored in repositories and referenced when required. There are advantages and disadvantages to both but this tells you very little about how the code is used – it is primarily done for development management.

module

 

 

Allocation Monolith

For an allocation monolith, all of the code is shipped/deployed at the same time. In other words once the compiled code is ‘ready for release’ then a single version is shipped to all nodes. All running components have the same version of the software running at any point in time. This is independent of whether the module structure is a monolith. You may have compiled the entire codebase at once before deployment OR you may have created a set of deployment artifacts from multiple sources and versions. Either way this version for the system is deployed everywhere at once (often by stopping the entire system, rolling out the software and then restarting).

A non-monolithic allocation would involve deploying different versions to individual nodes at different times. This is again independent of the module structure as different versions of a module monolith could be deployed individually.

allocation

 

Runtime Monolith

A runtime monolith will have a single application or process performing the work for the system (although the system may have multiple, external dependencies). Many systems have traditionally been written like this (especially line-of-business systems such as Payroll, Accounts Payable, CMS etc).

Whether the runtime is a monolith is independent of whether the system code is a module monolith or not. A runtime monolith often implies an allocation monolith if there is only one main node/component to be deployed (although this is not the case if a new version of software is rolled out across regions, with separate users, over a period of time).

runtime

Note that my examples above are slightly forced for the viewtypes and it won’t be as hard-and-fast in the real world.

Conclusion

Be very carefully when arguing about ‘Microservices vs Monoliths’. A direct comparison is only possible when discussing the Runtime viewtype and properties. You should also not assume that moving away from a Module or Allocation monolith will magically enable a Microservice architecture (although it will probably help). If you are moving to a Microservice architecture then I’d advise you to consider all these viewtypes and align your boundaries across them i.e. don’t just code, build and distribute a monolith that exposes subsets of itself on different nodes.

(Via Codingthearchitecture.com)

 

Async Fragments: Rediscovering Progressive HTML Rendering with Marko

At eBay, we take site speed very seriously and are always looking for ways to allow developers to create faster-loading web apps. This involves fully understanding and controlling how web pages are delivered to web browsers. Progressive HTML rendering is a relatively old technique that can be used to improve the performance of websites, but it has been lost in a whole new class of web applications. The idea is simple: give the web browser a head start in downloading and rendering the page by flushing out early and multiple times. Browsers have always had the helpful feature of parsing and responding to the HTML as it is being streamed down from the server (even before the response is ended). This feature allows the HTML and external resources to be downloaded earlier, and for parts of the page to be rendered earlier. As a result, both the actual load time and the perceived load time improve.

In this blog post, we will take an in-depth look at a technique we call “Async Fragments” that takes advantage of progressive HTML rendering to improve site speed in ways that do not drastically complicate how web applications are built. For concrete examples we will be using Node.js,Express.js and the Marko templating engine (a JavaScript templating engine that supports streaming, flushing, and asynchronous rendering). Even if you are not using these technologies, this post can give you insight into how your stack of choice could be further optimized.

To see the techniques discussed in this post in action, please take a look at the accompanying sample application.

Background

Progressive HTML rendering is discussed in the post The Lost Art of Progressive HTML Rendering by Jeff Atwood, which was published back in 2005. In addition, the “Flush the Buffer Early” rule is described by the Yahoo! Performance team in their Best Practices for Speeding Up Your Web Site guide. Stoyan Stefanov provides an in-depth look at progressive HTML rendering in his Progressive rendering via multiple flushes post. Facebook discussed how they use a technique they call “BigPipe” to improve page load times and perceived performance by dividing up a page into “pagelets.” Those articles and techniques inspired many of the ideas discussed in this post.

In the Node.js world, its most popular web framework, Express.js, unfortunately recommends a view rendering engine that does not allow streaming and thus prevents progressive HTML rendering. In a recent post, Bypassing Express View Rendering for Speed and Modularity, I described how streaming can be achieved with Express.js; this post is largely a follow-up to discuss how progressive HTML rendering can be achieved with Node.js (with or without Express.js).

Without progressive HTML rendering

A page that does not utilize progressive HTML rendering will have a slower load time because the bytes will not be flushed out until the complete HTML response is built. In addition, after the client finally receives the complete HTML it will then see that it needs to download additional external resources (such as CSS, JavaScript, fonts, and images), and downloading these external resources will require additional round trips. In addition, pages that do not utilize progressive HTML rendering will also have a slower perceived load time, since the screen will not update until the complete HTML is downloaded and the CSS and fonts referenced in the <head> section are downloaded. Without progressive HTML rendering, a server/client waterfall chart might be similar to the following:

ps1

The corresponding page controller might look something like this:

function controller(req, res) {
    async.parallel([
            function loadSearchResults(callback) {
                ...
            },
            function loadFilters(callback) {
                ...
            },
            function loadAds(callback) {
                ...
            }
        ],
        function() {
            ...
            var viewModel = { ... };
            res.render('search', viewModel);
        })
}

As you can see in the above code, the page HTML is not rendered until all of the data is asynchronously loaded.

Because the HTML is not flushed until all back-end services are completed, the user will be staring at a blank screen for a large portion of the time. This will result in a sub-par user experience (especially with a poor network connection or with slow back-end services). We can do much better if we flush part of the HTML earlier.

Flushing the head early

A simple trick to improve the responsiveness of a website is to flush the head section immediately. The head section will typically include the links to the external CSS resources (i.e. the <link>tags), as well as the page header and navigation. With this approach the external CSS will be downloaded sooner and the initial page will be painted much sooner as shown in the following waterfall chart:

ps2

As you can see in the chart above, flushing the head early reduces the time to render the initial page. This technique improves the responsiveness of the page, but it does not significantly reduce the total time it takes to make the page fully functional. With this approach, the server is still waiting for all back-end services to complete before flushing the final HTML. In addition, downloading of external JavaScript resources will be delayed since <script> tags are placed at the end of the page (assuming you are following best practices) and don’t get sent out until the second and final flush.

Multiple flushes

Instead of flushing only the head early, it is often beneficial to flush multiple times before ending the response. Typically, a page can be divided into multiple fragments where some of the fragments may depend on data asynchronously loaded from various back-end services while others may not depend on any asynchronously loaded data. The fragments that depend on asynchronously loaded data should be rendered asynchronously and flushed as soon as possible.

For now, we will assume that these fragments need to be flushed in the proper HTML order (versus the order that the data asynchronously loads), but we will also show how out-of-order flushing can be used to further improve both page load times and perceived performance. When using “in-order” flushing, fragments that complete out of order will need to be buffered until they are ready to be flushed in the proper order.

In-order flushing of async fragments

As an example, let’s assume we have divided a complex page into the following fragments:

ps3

Each fragment is assigned a number based on the order that it appears in the HTML document. In code, our output HTML for the page might look like the following:

<html>
<head>
    <title>Clothing Store</title>
    <!-- 1a) Head <link> tags -->
</head>
<body>
    <header>
        <!-- 1b) Header -->
    </header>
    <div class="body">
        <main>
            <!-- 2) Search Results -->
        </main>
        <section class="filters">
            <!-- 3) Search filters -->
        </section>
        <section class="ads">
            <!-- 4) Ads -->
        </section>
    </div>
    <footer>
        <!-- 5a) Footer -->
    </footer>
    <!-- 5b) Body <script> tags -->
</body>
</html>

The Marko templating engine provides a way to declaratively bind template fragments to asynchronous data provider functions (or Promises). An asynchronous fragment is rendered when the asynchronous data provider function invokes the provided callback with the data. If the asynchronous fragment is ready to be flushed, then it is immediately flushed to the output stream. Otherwise, if the asynchronous fragment completed out of order then the rendered HTML is buffered in memory until it is ready to be flushed. The Marko templating engine ensures that fragments are flushed in the proper order.

Continuing with the previous example, our HTML page template with asynchronous fragments defined will be similar to the following:

<html>
<head>
    <title>Clothing Store</title>
    <!-- Head <link> tags -->
</head>
<body>
    <header>
        <!-- Header -->
    </header>
    <div class="body">
        <main>
            <!-- Search Results -->
            <async-fragment data-provider="data.searchResultsProvider"
                var="searchResults">
 
                <!-- Do something with the search results data... -->
                <ul>
                    <li for="item in searchResults.items">
                        $item.title
                    </li>
                </ul>
 
            </async-fragment>
        </main>
        <section class="filters">
 
            <!-- Search filters -->
            <async-fragment data-provider="data.filtersProvider"
                var="filters">
                <!-- Do something with the filters data... -->
            </async-fragment>
 
        </section>
        <section class="ads">
 
            <!-- Ads -->
            <async-fragment data-provider="data.adsProvider"
                var="ads">
                <!-- Do something with the ads data... -->
            </async-fragment>
 
        </section>
    </div>
    <footer>
        <!-- Footer -->
    </footer>
    <!-- Body <script> tags -->
</body>
</html>

The data provider functions should be passed to the template as part of the view model as shown in the following code for a sample page controller:

function controller(req, res) {
    template.render({
            searchResultsProvider: function(callback) {
                performSearch(req.params.category, callback);
            },
 
            filtersProvider: function(callback) {
                ...
            },
 
            adsProvider: function(callback) {
                ...
            }
        },
        res /* Render directly to the output HTTP response stream */);
}

In this particular example, the “search results” async fragment appears first in the HTML template, and it happens to take the longest time to complete. As a result, all of the subsequent fragments will need to be buffered on the server. The resulting waterfall with in-order flushing of async fragments is shown below:

ps4

While the performance of this approach might be fine, we can enable out-of-order flushing for further performance gains as described in the next section.

Out-of-order flushing of async fragments

Marko achieves out-of-order flushing of async fragments by doing the following:

Instead of waiting for an async fragment to finish, a placeholder HTML element with an assigned id is written to the output stream. Out-of-order async fragments are rendered before the ending <body> tag in the order that they complete. Each out-of-order async fragment is rendered into a hidden <div> element. Immediately after the out-of-order fragment, a <script> block is rendered to replace the placeholder DOM node with the DOM nodes of the corresponding out-of-order fragment. When all of the out-of-order async fragments complete, the remaining HTML (e.g. </body></html>) will be flushed and the response ended.

To clarify, here is what the output HTML might look like for a page with out-of-order flushing enabled:

<html>
<head>
    <title>Clothing Store</title>
    <!-- 1a) Head <link> tags -->
</head>
<body>
    <header>
        <!-- 1b) Header -->
    </header>
    <div class="body">
        <main>
            <!-- 2) Search Results -->
            <span id="asyncFragment0Placeholder"></span>
        </main>
        <section class="filters">
            <!-- 3) Search filters -->
            <span id="asyncFragment1Placeholder"></span>
        </section>
        <section class="ads">
            <!-- 4) Ads -->
            <span id="asyncFragment2Placeholder"></span>
        </section>
    </div>
    <footer>
        <!-- 5a) Footer -->
    </footer>
 
    <!-- 5b) Body <script> tags -->
 
    <script>
    window.$af=function(){
    // Small amount of code to support rearranging DOM nodes
    // Unminified:
    // https://github.com/raptorjs/marko-async/blob/master/client-reorder-runtime.js
    };
    </script>
 
    <div id="asyncFragment1" style="display:none">
        <!-- 4) Ads content -->
    </div>
    <script>$af(1)</script>
 
    <div id="asyncFragment2" style="display:none">
        <!-- 3) Search filters content -->
    </div>
    <script>$af(2)</script>
 
    <div id="asyncFragment0" style="display:none">
        <!-- 2) Search results content -->
    </div>
    <script>$af(0)</script>
 
</body>
</html>

One caveat with out-of-order flushing is that it requires JavaScript running on the client to move each out-of-order fragment into its proper place in the DOM. Thus, you would only want to enable out-of-order flushing if you know that the client’s web browser has JavaScript enabled. Also, moving DOM nodes may cause the page to be reflowed, which can be visually jarring to the user and result in more client-side CPU usage. If reflow is an issue then there are tricks that can be used to avoid a reflow (e.g., reserving space as part of the initial wireframe). Marko also allows alternative content to be shown while waiting for an out-of-order async fragment.

To enable out-of-order flushing with Marko, the client-reorder="true" attribute must be added to each <async-fragment> tag, and the <async-fragments> tag must be added to the end of the page to serve as the container for rendered out-of-order fragments. Here is the updated<async-fragment> tag for the search results fragment:

<async-fragment data-provider="data.searchResultsProvider"
    var="searchResults"
    client-reorder="true">
    ...
</async-fragment>

The updated HTML page template with the new <async-fragments> tag is shown below:

<html>
<head>
    <title>Clothing Store</title>
    <!-- Head <link> tags -->
</head>
<body>
    ...
 
    <!-- Body <script> tags -->
 
    <async-fragments/>
</body>
</html>

In combination with out-of-order flushing, it may be beneficial to move <script> tags that link to external resources to the end of the first chunk (before all of the out-of-order chunks). While the server is busy preparing the rest of the page, the client can start downloading the external JavaScript required to make the page functional. As a result, the user will be able to start interacting with the page sooner.

Our final waterfall with out-of-order flushing will now be similar to the following:

ps5

The final waterfall shows that the strategy of out-of-order flushing of asynchronous fragments can significantly improve the load time and perceived load time of a page. The user will be met with a progressive loading of a page that is ready to be interacted with sooner.

Additional considerations

HTTP transport and HTML compression

To allow HTML to be served in parts, chunked transfer encoding should be used for the HTTP response. Chunked transfer encoding uses delimiters to break up the response, and each flush results in a new chunk. If gzip compression is enabled (and it should be) then flushing the pending data to the gzip stream will result in a gzip data frame being written to the response as part of each chunk. Flushing too often will negatively impact the effectiveness of the compression algorithm, but without flushing periodically then progressive HTML rendering will not be available. By default, Marko will flush at the beginning of an <async-fragment> block (in order to send everything that has already completed), as well as when an async fragment completes. This default strategy results in efficient progressive loading of an HTML page as long as there are not too many async fragments.

Binding behavior

For improved usability and responsiveness, there should not be a long delay between rendered HTML being displayed to the user in the web browser and behavior being attached to the associated DOM. At eBay, we use the marko-widgets module to bind behavior to DOM nodes. Marko Widgets supports binding behavior to rendered widgets immediately after each independent async fragment, as illustrated in the accompanying sample app. For immediate binding to work, the required JavaScript code must be included earlier in the page. For more details, please see the marko-widgets module documentation.

Error handling

It is important to note that as soon as a byte is flushed for the HTTP body, then the response is committed; no additional HTTP headers can be sent (e.g., no server-side redirects or cookie-setting), and the HTML that has been sent cannot be “unsent”. Therefore, if an asynchronous data provider errors or times out, then the app must be prepared to show alternative content for thatparticular async fragment. Please see the documentation for the marko-async module for additional information on how to show alternative content in case of an error.

Summary

The Async Fragments technique allows web developers to maximize the benefits of progressive HTML rendering to produce web pages that have improved actual and perceived load times. Developers at eBay have found the concept of binding HTML template fragments to asynchronous data providers easy to grasp and utilize. In addition, the flexibility to support both in-order and out-of-order flushing of async fragments makes this technique applicable for all web browsers and user agents.

The Marko templating engine is being used as part of eBay’s latest Node.js stack to improve performance while also simplifying how pages are constructed on both the server and the client. Marko is one of a few templating engines for Node.js and web browsers that support streaming, flushing, and asynchronous rendering. Marko has a simple HTML-based syntax, and the Marko compiler produces small and efficient JavaScript modules as output. We encourage you to try Marko online and in your next Node.js project. Because Marko is a key component of eBay’s internal Node.js stack, and given that it is heavily documented and tested, you can be confident that it will be well supported.

(Via Calendar.perfplanet.com)

HTTP 2.0 is coming!

Why we need another version of HTTP protocol?

HTTP has been in use by the World-Wide Web global information initiative since 1990. However, it is December 2014 and we don’t have anymore simple pages with cross linked HTML documents as it used to be. Instead, we have Web applications, some of them very heavy and requiring a lot of resources. And unfortunately, the version of the HTTP protocol currently used – 1.1, has issues.

HTTP is actually very simple – browser sends request the server, server provides the response and that is it. Very simple, but if you check the chart below you’ll see that there is not only one request and one response, but multiple requests and responses – about 80 – 100 and 1.8MB of data:

chart

Data provided by HTTP Archive.

Now, imagine we have a server in Los Angeles and our client is in Berlin, Germany. All those 80-100 requests should travel from Berlin to L.A. and then get back. That is not fast – for example, the roundtrip time between London and New York is about 56 ms. From Berlin to Los Angeles it is even more. And as we know, first page load is latency bound; latency is the constraining factor for today’s applications.

In order to speed up downloading the needed resources, browsers currently open multiple connections to the server (typically 6 per domain). However, opening a connection is expensive – there is domain name resolution, socket connect, more roundtrips if TLS should be established and so on. From browser point of view, this also means more consumed memory, connections management, using heuristics when to open a new connection and when to reuse existing one and so on.

Web engineers also tried to make sites loading as fast as possible. They invented many different workarounds (aka “optimizations”) like: image sprites, domain sharding, resource inlining, concatenating files, combo services and so on. Inventing more and more tricks may work to some point but what if we were able to fix the issues on the protocol level and avoid these workarounds?

HTTP/2 in a nutshell

HTTP/2 is binary protocol, where browser and server exchange frames. Each frame belongs to a stream via identifier. The key point is that these streams are multiplexed, they have priorities, priorities are specified by the client (browser) and they can be changed runtime. A stream can depend on another stream.

In contrary to HTTP 1.1, in HTTP/2 the headers are compressed. A special algorithm was invented for that purpose and it is called HPACK.

Server push is a feature of HTTP/2 which deserves special attention. Web Developers actually implemented the same idea for years – the mentioned above inlining of resources in the page is an example of that. Since this is on protocol level, instead of embedding CSS and JavaScript files or images directly on the page, server can explicitly push these resources to the browser in relation with a previously made request.

How does an HTTP/2 connection look like?

connection

Image by Ilya Grigorik

In this example we have three streams:

  1. Client initiated request of page.html
  2. Stream, which carries script.js – the server initiated this stream, since it knows already what is the content of page.html and that script.js will be requested from the browser as soon as it parses the HTML.
  3. Stream, which carries style.css – initiated by the server, since this CSS is considered as critical and it will be required from the browser as soon as it parses the HTML file.

HTTP/2 is huge step forward. It is currently on its Draft-16 and the final specification will be ready very soon.

Optimizing the Web stack for HTTP/2 era

Does the above mean that we should discard all optimizations we were doing for years to make our Web applications as fast as possible? Of course not! This just means we have to forget about some of them and start applying others. For example, we still should send as less data as possible from the server to the client, to deal with caching and store files offline. In general, we can split the needed optimizations in two parts:

Optimize the content being served to the browser:

  • Minimizing JavaScript, CSS and HTML files
  • Removing redundant data from images
  • Optimize Critical Path CSS
  • Removing the CSS which is not needed on the page using tools like UnCSS before to send the page to the server
  • Properly specifying ETag to the files and setting far future expires headers
  • Using HTML 5 offline to store already downloaded files and minimize traffic on the next page load

Optimize the server and TCP stack:

  • Check your server and be sure the value of TCP’s Initial Congestion Window (initial cwnd) is 10 segments (IW10). If you use GNU/Linux, just upgrade to 3.2+ to get this feature and another important update – Proportional Rate Reduction for TCP
  • Disable Slow-Start Restart after idle
  • Check and enable if needed Window Scaling
  • Consider to use TCP Fast Open (TFO)

(for more information check the wonderful book “High Performance Browser Networking” by Ilya Grigorik)

We may consider to remove the following “optimizations”:

  • Joining files – nowadays many companies are striking for continues deployment which makes this challenging – a single line of code change invalidates the whole bundle. Also, it forces browser to wait until the whole file arrives before to start processing it
  • Domain sharding – loading resources from different domains in order to avoid browser’s limit of connections per domain (usually 6) is the first “optimization” to be removed. It causes retransmissions and unnecessary latency
  • Resource inlining – prevents caching and inflates the document in which they are being stored. Instead, consider to leave CSS, JavaScript and images as external resources
  • Image sprites – the problem with cache invalidation is present here too. Apart from that, image sprites force browser to consume more CPU and memory during the process of decoding the the entire sprite
  • Using cookie free domains

The new modules API from ES6 and HTTP/2

For years we were splitting our JavaScript code in modules, stored in different files. Since JavaScript did not provide module system prior to version 6, the community invented two different main formats – AMD and CommonJS. Of course, custom formats, like those used by YUI Loader existed too. In ECMAScript 6 however we have a brand new way of creating modules. The API for loading them looks like this:

Declarative syntax:

import {foo, bar} from 'file.js';

Programmatic loader API:

System.import('my_module') .then(my_module => {
  // ...
})
.catch(error => {
  // ...
});

Imagine this module has 10 different dependencies, each of them stored in separate file.
If we don’t change anything on build time, then the browser will request the file, which contains the main module, and then it will make many additional requests in order to download all dependencies.
Since we have HTTP/2 now the browser won’t open multiple connections. However, in order to process the main module, browser still has to wait for all dependencies to be downloaded over the network. This means – download one file, parse it, then oh – it requires another module, download again the required file and so on, until all dependencies are being resolved.

One possible fix of the above issue could be to change this on build time. This means, you may concatenate all modules in one file and then overwrite the originally specified import statements to look for modules in that joined file. However, this has drawbacks and they are the same as if we were joining files for the HTTP 1.1 era.

Another fix, which may be considered is to leverage HTTP/2 push promises. In the example above this means you may try to push the dependencies when the main module is being requested. If the browser already has them then it may abort (reset) the stream by sending RST_STREAM frame.

Patrick Meenan however pointed me to a very interesting issue – on practice browser may not be able to abort the stream quickly enough. By the time the pushed resources hit the client and are validated against the cache, the entire resource will already be in buffers (on the network and in the server) and the whole file will be downloaded anyway. It will work if we can be sure that the resources aren’t in the browser cache, otherwise we will end up sending them anyway. This is an interesting point for further research.

HTTP/2 implementations

You may start playing with HTTP/2 today. There are many server implementations – grab one and start experimenting.

The main browser vendors support HTTP/2 already:

  • Internet Explorer supports HTTP/2 from IE 11 on Windows 10 Technical Preview,
  • Firefox has enabled HTTP/2 by default in version 34 and
  • current version of Chrome supports HTTP/2, but it is not enabled by default. It may be enabled by adding a command line flag --enable-spdy4 when Chrome is being launched or via chrome://flags.

Currently only HTTP/2 over TLS is implemented in all browsers.

Other interesting protocols to keep an eye on

QUIC is a another protocol, started by Google as natural extension of the research on protocols like SPDY and HTTP/2. Here I won’t give many details, it is enough to mention that it has all features of SPDY and HTTP/2, but it is built on top of UDP. The goal is to avoid head-of-line blocking like in SPDY or HTTP/2 and to establish a connection much faster than TCP+TLS is capable to do.

For more information about HTTP/2 and QUIC, please watch my JSConfEU talk.

(Via Calendar.perfplanet.com)

Auth0 Architecture – Running In Multiple Cloud Providers And Regions

15920200395_04c420407a_m

Auth0 provides authentication, authorization and single sign on services for apps of any type: mobile, web, native; on any stack.

Authentication is critical for the vast majority of apps. We designed Auth0 from the beginning with multiple levels of redundancy. One of this levels is hosting. Auth0 can run anywhere: our cloud, your cloud, or even your own servers. And when we run Auth0 we run it on multiple-cloud providers and in multiple regions simultaneously.

This article is a brief introduction of the infrastructure behind app.auth0.com and the strategies we use to keep it up and running with high availability.

Core Service Architecture

The core service is relatively simple:

  • Front-end servers: these consist of several x-large VMs, running Ubuntu on Microsoft Azure.

  • Store: mongodb, running on dedicated memory optimized X-large VMs.

  • Intra-node service routing: nginx

All components of Auth0 (e.g. Dashboard, transaction server, docs) run on all nodes. All identical.

Multi-Cloud / High Availability

png_base647fe106fda2c43970

Multi Cloud Architecture

Last week, Azure suffered a global outage that lasted for hours. During that time our HA plan activated and we switched over to AWS

  • The services runs primarily on Microsoft Azure (IaaS). Secondary nodes on stand-by always ready on AWS.

  • We use Route53 with a failover routing policy. TTL at 60 secs. The Route53 health check detects using a probe against primary DC, if it fails (3 times, 10 seconds interval) it changes the DNS entry to point to secondary DC. So max downtime in case of primary failure is ~2 minutes.

  • We use puppet to deploy on every “push to master”. Using puppet allows us to be cloud independent on the configuration/deployment process. Puppet Master runs on our build server (TeamCity currently).

  • MongoDB is replicated often to secondary DC and secondary DC is configured as read-only.

  • While running on the secondary DC, only runtime logins are allowed and the dashboard is set to “read-only mode”.

  • We replicate all the configuration needed for a login to succeed (application info, secrets, connections, users, etc). We don’t replicate transactional data (tokens, logs).

  • In case of failover, there might might some logging records that are lost. We are planning to improve that by having a real-time replica across Azure and AWS.

  • We use our own version of chaos monkey to test the resiliency of our infrastructure https://github.com/auth0/chaos-mona

cqHiNgjo8PF+

Automated Testing

  • We have 1000+ unit and integration tests.

  • We use saucelabs to run cross-browser (desktop/mobile) integration tests for Lock, our JavaScript login widget.

  • We use phantomjs/casper for integration tests. We test, for instance, that a full flow login with Google and other providers works fine.

  • All these run before every push to production.

CDN

Our use case is simple, we need to serve our JS library and its configuration (which providers are enabled, etc.). Assets and configuration data is uploaded to S3. It has to support TLS on our own custom domain (https://cdn.auth0.com). We ended up building our own CDN.

  • We tried 3 reputable CDN providers, but run into a whole variety of issues: The first one we tried when we didn’t have our own domain for cdn. At some point we decided we needed our own domain over SSL/TLS. This cdn was too expensive if you want SSL and customer domain at that point (600/mo). We also had issues configuring it to work with gzip and S3. Since S3 cannot serve both version (zipped and not) of the same file and this CDN doesn’t have content negotiation, some browsers (cough IE) don’t play well with this. So we moved to another CDN which was much cheaper.

  • The second CDN, we had a handful of issues and we couldn’t understand the root cause of them. Their support was on chat and it took time to get answers. Sometimes it seemed to be S3 issues, sometimes they had issues on routing, etc.

  • We decided to spend more money and we moved to a third CDN. Given that this CDN is being used by high load services like GitHub we thought it was going to be fine. However, our requirements were different from GitHub. If the CDN doesn’t work for GitHub, you won’t see an image on the README.md. In our case, our customers depends on the CDN to serve the Login Widget, which means that if it doesn’t work, then their customers can’t login.

  • We ended up building our own CDN using nginx, varnish and S3. It’s hosted on every region on AWS and so far it has been working great (no downtime). We use Route53 latency based routing.

Sandbox (Used To Run Untrusted Code)

One of the features we provide is the ability to run custom code as part of the login transaction. Customers can write these rules and we have a public repository for commonly used rules.

  • The sandbox is built on CoreOS, Docker and etcd.

  • There is a pool of Docker instances that gets assigned to a tenant on-demand.

  • Each tenant gets its own docker instance and there is a recycling policy based on idle time.

  • There is a controller doing the recycling policy and a proxy that routes the request to the right container.

custom_code

sandbox_vm

More information about the sandbox is in this JSConf presentation Nov 2014: https://www.youtube.com/watch?feature=player_detailpage&v=I4VkZ5H9PE8#t=7015 and slides: http://tjanczuk.github.io/about/sandbox.html

Monitoring

Initially we used pingdom (we still use it), but we decided to develop our own health check system that can run arbitrary health checks based on node.js scripts. These run from all AWS regions.

  • It uses the same sandbox we developed for our service. We call the sandbox via an http API and send the node.js script to run as an HTTP POST.

  • We monitor all the components and we also do synthetic transactions against the service (e.g. a login transaction).

cFZ4mniHT_0+

c0k6Kb6aaui+

If a health check fails we get notified through Slack. We have two Slack channels #p1 and #p2. If the failure happens 1 time, it gets posted to #p2. If it happens 2 times in a row it gets posted to #p1 and all members of devops get an SMS (via Twilio).

For detailed performance counters and response times we use statsd and we send all the metrics to Librato. This is an example of a chart you can create.

cIzBgYAL6NL+

We also setup alerts based on derivative metrics (i.e. how much something grows or shrinks in a time period). For instance, we have one based on logins: if Derivate(logins) > X => Send an alert to Slack.

crcAsxzWUf7+

Finally, we have alerts coming from NewRelic for infrastructure components.

cmKYmWpoKFs+

For logging we use ElasticSearch, Logstash and Kibana. We are storing logs from nginx and mongodb at this point. We are also parsing mongo logs using logstash in order to identify slow queries (anything with a high number of collscans).

cWjCKlf_z3j+

Website

  • All related web properties: the auth0.com site, our blog, etc. run completely separate from the app and runtime, on their own Ubuntu + Docker VMs.

Future

This is where we are going:

  • We are moving to CoreOS and Docker. We want to move to a model where we manage clusters as a whole instead of doing configuration management over individual nodes. Docker helps also by removing some moving parts by doing image-based deployment (and be able to rollback at that level as well).

  • MongoDB will be geo-replicated across DCs between AWS and Azure. We are testing latency.

  • For all the search related features we are moving to ElasticSearch to provide search based on any criteria. MongoDB didn’t work out well in this scenario (given our multi-tenancy).

(Via HighScalability.com)

Spark Sets New Record in Sort Performance

Databricks, a company founded by the creators of Apache Spark, has recently announced a new record in the Daytona GraySort contest using the Spark processing engine. The Daytona GraySort contest is a 3rd party benchmark measuring how fast a system can sort 100 Terabytes of data.Databricks posted a throughput of 4.27 TB/min over a cluster of 206 machines for their official run which constitutes a 3x performance improvement, using 10x fewer machines when compared to the previous record submitted by Yahoo! running Hadoop MapReduce.

In a blog post announcing their submission to the Daytona GraySort contest, Databricks explained some of the technological improvements recently introduced to Spark that allowed it to sustain such a large throughput.

Spark 1.1 introduced a new shuffle implementation called sort-based shuffle. The previous shuffle implementation required an in-memory buffer for each partition in the shuffle which lead to notable memory overhead. The new sort-based shuffle requires only one in-memory buffer at a time. This significantly reduced the memory usage and allowed for considerably more tasks to be run concurrently on the same hardware.

In addition to the new shuffle algorithm, the network module was revamped based on Netty’s native Epoll socket transport which maintains its on pool of memory, bypassing the JVM’s memory allocator and reducing the impact of garbage collection. The new network module was then used to build an external shuffle service to allow shuffled files to be served even during garbage collection pauses in the main Spark executor.

Finally, Spark 1.1 included TimSort as its new default sorting algorithm. TimSort is derived from merge sort and insertion sort and performs better than quicksort in most real-world datasets, especially for datasets that are partially ordered.

All of these improvements allowed the Spark cluster to sustain 3GB/s/node I/O activity during the map phase, and 1.1 GB/s/node network activity during the reduce phase which saturated the 10Gbps ethernet link.

Spark is an advanced execution engine born out of research done at the AMPLab at UC Berkley. It allows programs to run up to 10x faster than Hadoop MapReduce and when data is on disk, and up to 100x faster when data resides in memory. Spark supports programs written in Java, Scala or Python and uses familiar functional programming constructs to build data processing flows.

Spark has garnered significant attention as a next generation execution platform for Hadoop and is seen by some as a replacement for MapReduce. It graduated to a top level Apache project in February and since then has been included in the Cloudera, Hortonworks and MapR’s Hadoop distributions. More recently, Hortonworks announced they will support running Hive on Spark as part of their Stinger.next initiative.

Databricks was founded in 2013 as a commercial entity supporting Spark and its associated projects. Those projects include Spark Streaming for stream processing, Spark SQL for querying Hive data and MLlib for machine learning.

(Via InfoQ.com)

Scaling Docker with Kubernetes

Kubernetes is an open source project to manage a cluster of Linux containers as a single system, managing and running Docker containers across multiple hosts, offering co-location of containers, service discovery and replication control. It was started by Google and now it is supported by Microsoft, RedHat, IBM and Docker amongst others.

Google has been using container technology for over ten years, starting over 2 billion containers per week. With Kubernetes it shares its container expertise creating an open platform to run containers at scale.

The project serves two purposes. Once you are using Docker containers the next question is how to scale and start containers across multiple Docker hosts, balancing the containers across them. It also adds a higher level API to define how containers are logically grouped, allowing to define pools of containers, load balancing and affinity.

Kubernetes is still at a very early stage, which translates to lots of changes going into the project, some fragile examples, and some cases for new features that need to be fleshed out, but the pace of development, and the support by other big companies in the space, is highly promising.

Kubernetes concepts

The Kubernetes architecture is defined by a master server and multiple minions. The command line tools connect to the API endpoint in the master, which manages and orchestrates all the minions, Docker hosts that receive the instructions from the master and run the containers.

  • Master: Server with the Kubernetes API service. Multi master configuration is on the roadmap.
  • Minion: Each of the multiple Docker hosts with the Kubelet service that receive orders from the master, and manages the host running containers.
  • Pod: Defines a collection of containers tied together that are deployed in the same minion, for example a database and a web server container.
  • Replication controller: Defines how many pods or containers need to be running. The containers are scheduled across multiple minions.
  • Service: A definition that allows discovery of services/ports published by containers, and external proxy communications. A service maps the ports of the containers running on pods across multiple minions to externally accesible ports.
  • kubecfg: The command line client that connects to the master to administer Kubernetes.

architecture-small

Kubernetes is defined by states, not processes. When you define a pod, Kubernetes tries to ensure that it is always running. If a container is killed, it will try to start a new one. If a replication controller is defined with 3 replicas, Kubernetes will try to always run that number, starting and stopping containers as necessary.

The example app used in this article is the Jenkins CI server, in a typical master-slaves setup to distribute the jobs. Jenkins is configured with the Jenkins swarm plugin to run a Jenkins master and multiple Jenkins slaves, all of them running as Docker containers across multiple hosts. The swarm slaves connect to the Jenkins master on startup and become available to run Jenkins jobs. The configuration files used in the example are available in GitHub, and the Docker images are available as csanchez/jenkins-swarm, for the master Jenkins, extending the official Jenkins image with the swarm plugin, and csanchez/jenkins-swarm-slave, for each of the slaves, just running the slave service on a JVM container.

Creating a Kubernetes cluster

Kubernetes provides scripts to create a cluster with several operating systems and cloud/virtual providers: Vagrant (useful for local testing), Google Compute Engine, Azure, Rackspace, etc.

The examples will use a local cluster running on Vagrant, using Fedora as OS, as detailed in the getting started instructions, and have been tested on Kubernetes 0.5.4. Instead of the default three minions (Docker hosts) we are going to run just two, which is enough to show the Kubernetes capabilities without requiring a more powerful machine.

Once you have downloaded Kubernetes and extracted it, the examples can be run from that directory. In order to create the cluster from scratch the only command needed is ./cluster/kube-up.sh.

$ export KUBERNETES_PROVIDER=vagrant
$ export KUBERNETES_NUM_MINIONS=2
$ ./cluster/kube-up.sh

Get the example configuration files:

$ git clone https://github.com/carlossg/kubernetes-jenkins.git

The cluster creation will take a while depending on machine power and internet bandwidth, but should eventually finish without errors and it only needs to be ran once.

Command line tool

The command line tool to interact with Kubernetes is called kubecfg, with a convenience script in cluster/kubecfg.sh.

In order to check that our cluster is up and running with two minions, just run the kubecfg list minions command and it should display the two virtual machines in the Vagrant configuration.

$ ./cluster/kubecfg.sh list minions

Minion identifier
----------
10.245.2.2
10.245.2.3

Pods

The Jenkins master server is defined as a pod in Kubernetes terminology. Multiple containers can be specified in a pod, that would be deployed in the same Docker host, with the advantage that containers in a pod can share resources, such as storage volumes, and use the same network namespace and IP. Volumes are by default empty directories, type emptyDir, that live for the lifespan of the pod, not the specific container, so if the container fails the persistent storage will live on. Other volume type is hostDir, that will mount a directory from the host server in the container.

In this Jenkins specific example we could have a pod with two containers, the Jenkins server and, for instance, a MySQL container to use as database, although we will only focus on a standalone Jenkins master container.

In order to create a Jenkins pod we run kubecfg with the Jenkins container pod definition, using Docker image csanchez/jenkins-swarm, ports 8080 and 50000 mapped to the container in order to have access to the Jenkins web UI and the slave API, and a volume mounted in /var/jenkins_home. You can find the example code in GitHub as well.

The Jenkins web UI pod (pod.json) is defined as follows:

{
  "id": "jenkins",
  "kind": "Pod",
  "apiVersion": "v1beta1",
  "desiredState": {
    "manifest": {
      "version": "v1beta1",
      "id": "jenkins",
      "containers": [
        {
          "name": "jenkins",
          "image": "csanchez/jenkins-swarm:1.565.3.3",
          "ports": [
            {
              "containerPort": 8080,
              "hostPort": 8080
            },
            {
              "containerPort": 50000,
              "hostPort": 50000
            }
          ],
          "volumeMounts": [
            {
              "name": "jenkins-data",
              "mountPath": "/var/jenkins_home"
            }
          ]
        }
      ],
      "volumes": [
        {
          "name": "jenkins-data",
          "source": {
            "emptyDir": {}
          }
        }
      ]
    }
  },
  "labels": {
    "name": "jenkins"
  }
}

And create it with:

$ ./cluster/kubecfg.sh -c kubernetes-jenkins/pod.json create pods

Name                Image(s)                           Host                Labels              Status
----------          ----------                         ----------          ----------          ----------
jenkins             csanchez/jenkins-swarm:1.565.3.3   <unassigned>        name=jenkins        Pending

After some time, depending on your internet connection, as it has to download the Docker image to the minion, we can check its status and in which minion is started.

$ ./cluster/kubecfg.sh list pods
Name                Image(s)                           Host                    Labels              Status
----------          ----------                         ----------              ----------          ----------
jenkins             csanchez/jenkins-swarm:1.565.3.3   10.0.29.247/10.0.29.247   name=jenkins        Running

If we ssh into the minion that the pod was assigned to, minion-1 or minion-2, we can see how Docker started the container defined, amongst other containers used by Kubernetes for internal management (kubernetes/pause and google/cadvisor).

$ vagrant ssh minion-2 -c "docker ps"

CONTAINER ID        IMAGE                              COMMAND                CREATED             STATUS              PORTS                                              NAMES
7f6825a80c8a        google/cadvisor:0.6.2              "/usr/bin/cadvisor"    3 minutes ago       Up 3 minutes                                                           k8s_cadvisor.b0dae998_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0.default.file_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0_28df406a
5c02249c0b3c        csanchez/jenkins-swarm:1.565.3.3   "/usr/local/bin/jenk   3 minutes ago       Up 3 minutes                                                           k8s_jenkins.f87be3b0_jenkins.default.etcd_901e8027-759b-11e4-bfd0-0800279696e1_bf8db75a
ce51fda15f55        kubernetes/pause:go                "/pause"               10 minutes ago      Up 10 minutes                                                          k8s_net.dbcb7509_0d38f5b2-759c-11e4-bfd0-0800279696e1.default.etcd_0d38fa52-759c-11e4-bfd0-0800279696e1_e4e3a40f
e6f00165d7d3        kubernetes/pause:go                "/pause"               13 minutes ago      Up 13 minutes       0.0.0.0:8080->8080/tcp, 0.0.0.0:50000->50000/tcp   k8s_net.9eb4a781_jenkins.default.etcd_901e8027-759b-11e4-bfd0-0800279696e1_7bd4d24e
7129fa5dccab        kubernetes/pause:go                "/pause"               13 minutes ago      Up 13 minutes       0.0.0.0:4194->8080/tcp                             k8s_net.a0f18f6e_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0.default.file_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0_659a7a52

And, once we know the container id, we can check the container logs with vagrant ssh minion-1 -c “docker logs cec3eab3f4d3″

We should also see the Jenkins web UI at http://10.245.2.2:8080/ or http://10.0.29.247:8080/, depending on what minion it was started in.

Service discovery

Kubernetes allows defining services, a way for containers to use discovery and proxy requests to the appropriate minion. With this definition in service-http.json we are creating a service with id jenkins pointing to the pod with the label name=jenkins, as declared in the pod definition, and forwarding the port 8888 to the container’s 8080.

{
  "id": "jenkins",
  "kind": "Service",
  "apiVersion": "v1beta1",
  "port": 8888,
  "containerPort": 8080,
  "selector": {
    "name": "jenkins"
  }
}

Creating the service with kubecfg:

$ ./cluster/kubecfg.sh -c kubernetes-jenkins/service-http.json create services

Name                Labels              Selector            IP                  Port
----------          ----------          ----------          ----------          ----------
jenkins                                 name=jenkins        10.0.29.247         8888

Each service is assigned a unique IP address tied to the lifespan of the Service. If we had multiple pods matching the service definition the service would load balance the traffic across all of them.

Another feature of services is that a number of environment variables are available for any subsequent containers ran by Kubernetes, providing the ability to connect to the service container, in a similar way as running linked Docker containers. This will provide useful for finding the master Jenkins server from any of the slaves.

JENKINS_PORT='tcp://10.0.29.247:8888'
JENKINS_PORT_8080_TCP='tcp://10.0.29.247:8888'
JENKINS_PORT_8080_TCP_ADDR='10.0.29.247'
JENKINS_PORT_8080_TCP_PORT='8888'
JENKINS_PORT_8080_TCP_PROTO='tcp'
JENKINS_SERVICE_PORT='8888'
SERVICE_HOST='10.0.29.247'

Another tweak we need to do is to open port 50000, needed by the Jenkins swarm plugin. It can be achieved creating another service service-slave.json so Kubernetes forwards traffic to that port to the Jenkins server container.

{
  "id": "jenkins-slave",
  "kind": "Service",
  "apiVersion": "v1beta1",
  "port": 50000,
  "containerPort": 50000,
  "selector": {
    "name": "jenkins"
  }
}

The service is created with kubecfg again.

$ ./cluster/kubecfg.sh -c kubernetes-jenkins/service-slave.json create services

Name                Labels              Selector            IP                  Port
----------          ----------          ----------          ----------          ----------
jenkins-slave                           name=jenkins        10.0.86.28          50000

An all the defined services are available now, including some Kubernetes internal ones:

$ ./cluster/kubecfg.sh list services

Name                Labels              Selector                                  IP                  Port
----------          ----------          ----------                                ----------          ----------
kubernetes-ro                           component=apiserver,provider=kubernetes   10.0.22.155         80
kubernetes                              component=apiserver,provider=kubernetes   10.0.72.49          443
jenkins                                 name=jenkins                              10.0.29.247         8888
jenkins-slave                           name=jenkins                              10.0.86.28          50000

Replication controllers

Replication controllers allow running multiple pods in multiple minions. Jenkins slaves can be run this way to ensure there is always a pool of slaves ready to run Jenkins jobs.

In a replication.json definition:

{
  "id": "jenkins-slave",
  "apiVersion": "v1beta1",
  "kind": "ReplicationController",
  "desiredState": {
    "replicas": 1,
    "replicaSelector": {
      "name": "jenkins-slave"
    },
    "podTemplate": {
      "desiredState": {
        "manifest": {
          "version": "v1beta1",
          "id": "jenkins-slave",
          "containers": [
            {
              "name": "jenkins-slave",
              "image": "csanchez/jenkins-swarm-slave:1.21",
              "command": [
                "sh", "-c", "/usr/local/bin/jenkins-slave.sh -master http://$JENKINS_SERVICE_HOST:$JENKINS_SERVICE_PORT -tunnel $JENKINS_SLAVE_SERVICE_HOST:$JENKINS_SLAVE_SERVICE_PORT -username jenkins -password jenkins -executors 1"
              ]
            }
          ]
        }
      },
      "labels": {
        "name": "jenkins-slave"
      }
    }
  },
  "labels": {
    "name": "jenkins-slave"
  }
}

The podTemplate section allows the same configuration options as a pod definition. In this case we want to make the Jenkins slave connect automatically to our Jenkins master, instead of relying on Jenkins multicast discovery. To do so we execute the jenkins-slave.sh command with -master parameter to point the slave to the Jenkins master running in Kubernetes. Note that we use the Kubernetes provided environment variables for the Jenkins service definition (JENKINS_SERVICE_HOST and JENKINS_SERVICE_PORT). The image command is overridden to configure the container this way, useful to reuse existing images while taking advantage of the service environment variables. It can be done in pod definitions too.

Create the replicas with kubecfg:

$ ./cluster/kubecfg.sh -c kubernetes-jenkins/replication.json create replicationControllers

Name                Image(s)                            Selector             Replicas
----------          ----------                          ----------           ----------
jenkins-slave       csanchez/jenkins-swarm-slave:1.21   name=jenkins-slave   1

Listing the pods now would show new ones being created, up to the number of replicas defined in the replication controller.

$ ./cluster/kubecfg.sh list pods

Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
jenkins                                csanchez/jenkins-swarm:1.565.3.3    10.245.2.3/10.245.2.3   name=jenkins         Running
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   10.245.2.2/10.245.2.2   name=jenkins-slave   Pending

The first time running jenkins-swarm-slave image the minion has to download it from the Docker repository, but after a while, depending on your internet connection, the slaves should automatically connect to the Jenkins server. Going into the server where the slave is started, docker ps has to show the container running and docker logs is useful to debug any problems on container startup.

$ vagrant ssh minion-1 -c "docker ps"

CONTAINER ID        IMAGE                               COMMAND                CREATED              STATUS              PORTS                    NAMES
870665d50f68        csanchez/jenkins-swarm-slave:1.21   "/usr/local/bin/jenk   About a minute ago   Up About a minute                            k8s_jenkins-slave.74f1dda1_07651754-4f88-11e4-b01e-0800279696e1.default.etcd_11cac207-759f-11e4-bfd0-0800279696e1_9495d10e
cc44aa8743f0        kubernetes/pause:go                 "/pause"               About a minute ago   Up About a minute                            k8s_net.dbcb7509_07651754-4f88-11e4-b01e-0800279696e1.default.etcd_11cac207-759f-11e4-bfd0-0800279696e1_4bf086ee
edff0e535a84        google/cadvisor:0.6.2               "/usr/bin/cadvisor"    27 minutes ago       Up 27 minutes                                k8s_cadvisor.b0dae998_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0.default.file_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0_588941b0
b7e23a7b68d0        kubernetes/pause:go                 "/pause"               27 minutes ago       Up 27 minutes       0.0.0.0:4194->8080/tcp   k8s_net.a0f18f6e_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0.default.file_cadvisormanifes12uqn2ohido76855gdecd9roadm7l0_57a2b4de

The replication controller can automatically be resized to any number of desired replicas:

$ ./cluster/kubecfg.sh resize jenkins-slave 2

And again the pods are updated to show where each replica is running.

$ ./cluster/kubecfg.sh list pods
Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   10.245.2.2/10.245.2.2   name=jenkins-slave   Running
a22e0d59-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   10.245.2.3/10.245.2.3   name=jenkins-slave   Pending
jenkins                                csanchez/jenkins-swarm:1.565.3.3    10.245.2.3/10.245.2.3   name=jenkins         Running

220140917 kubernetes-jenkins

Scheduling

Right now the default scheduler is random, but resource based scheduling will be implemented soon. At the time of writing there are several issues opened to add scheduling based on memory and CPU usage. There is also work in progress in an Apache Mesos based scheduler. Apache Mesos is a framework for distributed systems providing APIs for resource management and scheduling across entire datacenter and cloud environments.

Self healing

One of the benefits of using Kubernetes is the automated management and recovery of containers.

If the container running the Jenkins server dies for any reason, for instance because the process being ran crashes, Kubernetes will notice and will create a new container after a few seconds.

$ vagrant ssh minion-2 -c 'docker kill `docker ps | grep csanchez/jenkins-swarm: | sed -e "s/ .*//"`'
51ba3687f4ee


$ ./cluster/kubecfg.sh list pods
Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
jenkins                                csanchez/jenkins-swarm:1.565.3.3    10.245.2.3/10.245.2.3   name=jenkins         Failed
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   10.245.2.2/10.245.2.2   name=jenkins-slave   Running
a22e0d59-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   10.245.2.3/10.245.2.3   name=jenkins-slave   Running

And some time later, typically no more than a minute…

Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
jenkins                                csanchez/jenkins-swarm:1.565.3.3    10.245.2.3/10.245.2.3   name=jenkins         Running
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   10.245.2.2/10.245.2.2   name=jenkins-slave   Running
a22e0d59-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   10.245.2.3/10.245.2.3   name=jenkins-slave   Running

Running the Jenkins data dir in a volume we guarantee that the data is kept even after the container dies, so we do not lose any Jenkins jobs or data created. And because Kubernetes is proxying the services in each minion the slaves will reconnect to the new Jenkins server automagically no matter where they run! And exactly the same will happen if any of the slave containers dies, the system will automatically create a new container and thanks to the service discovery it will automatically join the Jenkins server pool.

If something more drastic happens, like a minion dying, Kubernetes does not offer yet the ability to reschedule the containers in the other existing minions, it would just show the pods as Failed.

$ vagrant halt minion-2
==> minion-2: Attempting graceful shutdown of VM...
$ ./cluster/kubecfg.sh list pods
Name                                   Image(s)                            Host                    Labels               Status
----------                             ----------                          ----------              ----------           ----------
jenkins                                csanchez/jenkins-swarm:1.565.3.3    10.245.2.3/10.245.2.3   name=jenkins         Failed
07651754-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   10.245.2.2/10.245.2.2   name=jenkins-slave   Running
a22e0d59-4f88-11e4-b01e-0800279696e1   csanchez/jenkins-swarm-slave:1.21   10.245.2.3/10.245.2.3   name=jenkins-slave   Failed

Tearing down

kubecfg offers several commands to stop and delete the replication controllers, pods and services definitions.

To stop the replication controller, setting the number of replicas to 0, and causing the termination of all the Jenkins slaves containers:

$ ./cluster/kubecfg.sh stop jenkins-slave

To delete it:

$ ./cluster/kubecfg.sh rm jenkins-slave

To delete the jenkins server pod, causing the termination of the Jenkins master container:

$ ./cluster/kubecfg.sh delete pods/jenkins

To delete the services:

$ ./cluster/kubecfg.sh delete services/jenkins
$ ./cluster/kubecfg.sh delete services/jenkins-slave

Conclusion

Kubernetes is still a very young project, but highly promising to manage Docker deployments across multiple servers and simplify the execution of long running and distributed Docker containers. By abstracting infrastructure concepts and working on states instead of processes, it provides easy definition of clusters, including self healing capabilities out of the box. In short, Kubernetes makes management of Docker fleets easier.

About the Author

Carlos Sanchez has been working on automation and quality of software development, QA and operations processes for over 10 years, from build tools and continuous integration to deployment automation, DevOps best practices and continuous delivery. He has delivered solutions to Fortune 500 companies, working at several US based startups, most recently MaestroDev, a company he cofounded. Carlos has been a speaker at several conferences around the world, including JavaOne, EclipseCON, ApacheCON, JavaZone, Fosdem or PuppetConf. Very involved in open source, he is a member of the Apache Software Foundation amongst other open source groups, contributing to several projects, such as Apache Maven, Fog or Puppet.

(Via InfoQ.com)

DDN Pushes the Envelope for Parallel Storage I/O

Today at Supercomputing 2014, DataDirect Networks lifted the veil a bit more on Infinite Memory Engine (IME), its new software that will employ Flash storage and a bunch of smart algorithms to create a buffer between HPC compute and parallel file system resources, with the goal of improving file I/O by up to 100x. The company also announced the latest release of its Exascaler, its Lustre-based storage appliance lineup.

The data patterns have been changing at HPC sites in a way that is creating bottlenecks in the I/O. While many HPC shops may think they’re primarily working with large and sequential files, the reality is that most data is relatively small and random, and that fragmented I/O creates problems when moving the data across the interconnect, says Jeff Sisilli, Sr. Director Product Marketing at DataDirect Networks.

“Parallel file systems were really built for large files,” Sisilli tells HPCwire. “What we’re finding is 90 percent of typical I/O in HPC data centers utilizes small files, those less than 32KB. What happens is, when you inject those into a parallel file system, it starts to really bring down performance.”

DDN says it overcame the restrictions in how parallel file systems were created with IME, which creates a storage tier above the file system and provides a “fast data” layer between the compute nodes in an HPC cluster and the backend file system. The software, which resides on the I/O nodes in the cluster, utilizes any available Flash solid state drives (SSDs) or other non-volatile memory (NVM) storage resources available, creating a “burst buffer” to absorb peak loads and eliminate I/O contention.

IME_diagram-236x300
IME works in two ways. First, it removes any limitations of the POSIX layer, such as file locks, that can slow down communication. Secondly, algorithms bundle up the small and random I/O operations into larger files that can be more efficiently read into the file system.

In lab tests at a customer site, DDN ran IME against the S3D turbulent flow modeling software. The software was really designed for larger sequential files, but is often used in the real world with smaller and random files. In the customer’s case, these “mal-aligned and fragmented” files were causing I/O throughput across the InfiniBand interconnect to drop to 25 MBs per second.

After introducing IME, the customer was able to ingest data from the compute cluster onto IME’s SSDs at line rate. “This customer was using InfiniBand, and we were able to fill up InfiniBand all the way to line rate, and absorb at 50 GB per second,” Sisilli says.

The data wasn’t written back into the file system quite that quickly. But because the algorithms were able to align all those small files and convert fragments into full stripe writes, it did provide a speed up compared to 25MB per second. “We were able to drain out the buffer and write to the parallel file system at 4GB per second, which is two orders of magnitude faster than before,” Sisilli says.

The “net net” of IME, Sisilli says, is it frees up HPC compute cluster resources. “From the parallel file system side, we’re able to shield the parallel file system and underlying storage arrays from fragmented I/O, and have those be able to ingest optimized data and utilize much less hardware to be able to get to the performance folks need up above,” he says.

IME will work with any Lustre- or GPFS-based parallel file system. That includes DDN’s own EXAscaler line of Lustre-based storage appliances, or the storage appliances of any other vendor. There are no application modifications required to use IME, which also features data erasure encoding capabilities typicaly found in object file stores. The only requirements are that the application is POSIX compliant or uses the MPI job scheduler. DDN also provides an API that customers can use if they want to modify their apps to work with IME; the company has plans to create an ecosystem of compatible tools using this API.

There are other vendors developing similar Flash-bashed storage buffer offerings. But DDN says the fact that it’s taking an open, software-based approach gives customer an advantage over those vendors that are requiring customers to purchase specialized hardware, or those that work with only certain types of Interconnects.

IME_burst_buffer-300x153

IME isn’t available yet; it’s still in technology preview mode. But when it becomes available, scalability won’t be an issue. The software will be able to corral and make available petabytes worth of Flash or NVM storage resources living across thousands of nodes, Sisilli says. “What we’re recommending is [to have in IME] anywhere between two to three amount of your compute cluster memory to have a great working space within IME to accelerate your applications and and do I/O,” he says. “That can be all the way down to terabytes, and for supercomputers, it’s multi petabytes.”

IME is still undergoing tests, and is expected to become generally available in the second quarter of 2015. DDN will offer it as an appliance or as software.

DDN also today unveiled a new release of EXAScaler. With Version 2.1, DDN has improved read and write I/O performance by 25 percent. That will give DDN a comfortable advantage over competing file systems for some time, says Roger Goff, Sr. Product Manager for DDN.

“We know what folks are about to announce because they pre-announce those things,” Goff says. “Our solution is tremendously faster than what you will see [from other vendors], particularly on a per-rack performance basis.”

Other new features in version 2.1 include support for self-encrypting drives; improved rebuild times; InfiniBand optimizations; and better integration with DDN’s Storage Fusion Xcelerator (SFX Flash Caching) software.

DDN has also standardized on the Lustre file system from Intel, called Intel Enterprise Edition for Lustre version 2.5. That brings it several new capabilities, including a new MapReduce connector for running Hadoop workloads.

“So instead of having data replicated across multiple nodes in the cluster, which is the native mode for HDFS, with this adapter, you can run those Hadoop applications and take advantages of the single-copy nature of a parallel file system, yet have the same capability of a parallel file system to scale to thousands and thousands of clients accessing that same data,” Goff says.

EXAScaler version 2.1 is available now across all three EXAScaler products, including the entry-level SFA7700, the midrange ES12k/SFA12k-20, and the high-end SFA12KX/SFA212k-40.

( Via HPCwire.com )