WhatsApp Partners With Open WhisperSystems To End-To-End Encrypt Billions Of Messages A Day

Good news for people who love privacy and security! Bad news for black-hat hackers and government surveillance agencies. WhatsApp, the wildly popular messaging app, has partnered with the crypto gurus at Open WhisperSystems to implement strong end-to-end encryption on all WhatsApp text messages–meaning not even Zuck himself to pry into your conversations, even if a court order demands it.

WhatsApp, of course, has hundreds of millions of daily users worldwide, and was purchased by Facebook earlier this year for an eye-popping $19 billion. Open WhisperSystems creates state-of-the-art end-to-end encryption systems such as Signal, for voice, and TextSecure, for text. (They have nothing to do with the messaging app Whisper.) As per the EFF’s Secure Messaging Scorecard, TextSecure is a huge upgrade from WhatsApp’s previous encryption.

And I do mean huge. Right now the TextSecure protocol has only rolled out to WhatsApp for Android–other clients, and group/media messaging, are coming soon–but already “billions of encrypted messages are being exchanged daily … we believe this already represents the largest deployment of end-to-end encrypted communication in history,” to quote OWS founder Moxie Marlinspike.

This comes in the wake of Apple, similarly, encrypting iOS 8 devices such that Apple cannot retrieve data stored on them (albeit in a closed-source, unverifiable way.) FBI director James Comey subsequently claimed “the post-Snowden pendulum” has “gone too far” amid concerns that the world is “going dark” to wiretaps and surveillance. These and other three-letter agencies–along with black-hat hackers and governments worldwide, thanks to WhatsApp’s immense global reach–won’t be happy about their new inability to pry into the contents of WhatsApp text messages.

However, Comey’s claims appear to suffer from the slight disadvantage of being completely false:

and I think the tech industry’s response to the FBI and NSA’s dismay at the widespread use of end-to-end encryption by more and more companies, and the notion that the tech industry is “picking a fight” with them, can be summed up as:

And it’s fair to say that, in a world where surveillance technology seems to grow more powerful and pervasive every week, any meaningful blow that can be struck for privacy and anonymity is a welcome rebalancing. Kudos to WhatsApp for making this happen–Marlinspike, who approached them with the idea, stresses how impressed he’s been by their eagerness, dedication, and thoroughness–and to Open WhisperSystems for making it possible.

Marlinspike says OWS’s own goal is to keep producing strong, open-source privacy and security tools that companies (and individuals) can easily incorporate into their own services, without having to do their own crypto research and/or protocol design. They’ll keep working on Signal and TextSecure as reference implementations, using them to push the envelope and prove new ideas; and in the meantime, in one fell swoop, their TextSecure protocol suddenly has hundreds of millions of daily users, more than all but a tiny handful of companies.

Advertisements

[Vietnamese] Làm thế nào để viết phần mềm diệt vi-rút – Bước 1 quét file bằng mã hash

Giới thiệu: Tui (VVM) làm phần mềm diệt vi-rút (hay còn gọi là anti-virus software – gọi tắt là av) trong khoảng độ 6 năm cho bên CMC InfoSec. Hiện tại tui không còn làm về av nhưng khi nói chuyện và trao đổi với các bạn trong ngành IT thì phần lớn các bạn không hình dung ra cách làm av và nghĩ là làm av rất khó vì vậy tôi viết loạt bài về “Làm thế nào để xây dựng phần mềm diệt vi-rút” để các bạn có thể hình dung ra cách xây dựng một av (mức đơn giản) như thế nào.

AV ngay từ thời cổ xưa của nó vốn khởi chỉ là quét file và kiểm tra xem file đó có phải là vi-rút hay không. Và đến ngày nay tính năng quét file vẫn là tính năng chính và cơ bản nhất của một av. Tui cùng các bạn chúng ta bắt đầu với những câu hỏi.
Vậy làm sao ta biết một file có phải là vi-rút hay không? Có nhiều cách nhưng tui xin chỉ ra một cách cơ bản nhất và phổ biến nhất để nhận biết 1 file có phải là vi-rút hay không, đó là so sánh mã hash (vd: MD5, SHA1, SHA2,…) của file ta cần kiểm tra và tập mẫu vi-rút ta có, nếu 2 mã hash đó trùng nhau thì ta có thể khẳng định đó là file chứa vi-rút và nếu là file không can thiệp sâu vào hệ thống thì đơn giản ta chỉ cần xóa các file chứa vi-rút đi là coi như đã an toàn.
  *hash: Là dạng mã độc nhất được sinh ra từ nội dung nhất định. Tức là mỗi nội dụng khác nhau ta sẽ có một mã hash khác nhau (theo lý thuyết vẫn có tỷ lệ trùng lặp nhất định nhưng rất nhỏ nên ta có thể bỏ qua).
Câu hỏi đặt ra là làm sao ta có mẫu vi-rút? Các nguồn cung cấp mẫu phổ biến là từ diễn đàn chuyên ngành, từ các tổ chức chuyên ngành, từ các dịch vụ như virustotal.com, từ chương trình trao đổi mẫu giữa các hãng phần mềm diệt vi-rút, và từ quá trình thu nhập mẫu từ phía người dùng.
Giả sử bạn có tầm vài trăm đến tầm vài nghìn mẫu vi-rút rồi thì làm thế nào tìm kiếm trong tập mẫu đó? Cách đơn giản nhất là bạn đem so mã hash của file với từng mã hash của các mẫu vi-rút bạn có, nhưng tốt hơn là bạn nên áp dụng các thuật toán sắp xếp đơn giản (vd: Quick Sort, Bubble Sort, Merge Sort,…) và dùng thuật toán binary search để tìm kiếm.
Câu hỏi dẫn thêm là làm thế nào để so sánh hiệu quả với tập mẫu tầm vài triệu mẫu trở lên (đôi khi là cả chục triệu mẫu)? Để trả lời câu hỏi này thì ta bắt đầu phải dùng đến bài toán tối ưu, nhưng trong trường hợp cơ bản nhất thì các bản có thể dùng binary search tree (cây nhị phân tìm kiếm) và rất nhiều phần mềm cũng tận dụng thuật toán này để xây dựng database (DB) cho riêng cho mình. Và nếu dùng binary search tree thì bạn nên dùng self-balancing binary search tree (red-black tree là một dạng đó) hoặc không nếu bạn chỉ làm để tìm hiểu hoặc cung ứng dạng dịch vụ online/cloud thì các bạn có thể dùng hẳn DB (RocksDB, Redis, Riak, MongoDB, PostgresSQL,…) để khỏi phải mất công viết và thử nghiệm DB.
Đó là với một file còn cả một thư mục hay ổ đĩa với cả trăm nghìn (đôi khi là triệu) file + kích thức file khác nhau thì làm thế nào? Thứ nhất về việc quét nhiều file thì các bạn có thể mở nhiều thread/process để duyệt và kiểm tra file (tối ưu hơn thì các bạn có thể phân luồng công việc (vd: thread thì quét file, thread thì tính hash, thread thì tìm kiếm và so sánh với DB,…), và tối ưu cho các thread/process dựa trên tài nguyên của hệ thống). Thứ hai với file lớn thì các bạn có thể phân làm 2 giai đoạn: giai đoạn một chỉ quét vài KByte đầu để kiểm tra xem dữ liệu có trong DB mẫu hay không và nếu có chuyển qua giai đoạn 2 là lấy thêm hash + file size hoặc là tính hash toàn file để chắc chắn là trùng mẫu trong DB (với cách này thì mẫu vi-rút dung lượng lớn các bạn cũng phải làm tương tự).
Vậy ta còn có thể tối ưu được nữa không? Vâng ta vẫn còn có thể tối ưu thêm được nữa . Nếu các bạn hiểu các CPU tính toán thì sẽ biết là CPU sẽ tính toán nhanh với các phép tính dạng số nguyên và nếu tính toán tốt để hạn chế sự trùng lặp thì các bạn có thể dùng CRC32/CRC64 kết hợp với file size để kiểm tra hash một cách nhanh chóng hơn MD5/SHA1/SHA2 tương đối. Ngoài ra cũng nên tận dụng các kỹ thuật tối ưu về cấp phát bộ nhớ (memory), các dạng thuật toán hỗ trợ lock-free/lockless, inline function để có tăng tốc độ tính toán. Và nếu sau này ta có gắn thêm Realtime Engine thì ta còn có thể tận dụng làm tăng tốc độ quét nhanh lên tương đới nữa (bằng cách lưu lịch sử trạng thái file,…).
Vậy còn DB mẫu vi-rút thì quản lý thế nào? Nếu là DB bạn tự xây dụng thì đây cũng là một bài toán cần cân nhắc và tính toán kỹ. DB mẫu có thể tổ chức dưới định dạng nhất định để av sau khi cập nhật về có thể load ngay lên mà không cần thêm công đoạn import,… . Việc thêm và loại bỏ mẫu vi-rút cũng là việc diễn ra thường xuyên, và để hạn chế việc download lại phần không cần nhiết nhiều thì các bạn có thể chia nhỏ DB mẫu thành nhiều tập con dưới dạng các file khác nhau (có thể phân loại dựa theo thời gian, mức độ nguy hiểm, nguồn mẫu,…). Và cũng nói thêm là DB mẫu thì thường rất nhẹ nên chỉ chứa thông tin về mã hash của vi-rút, tên virus (do cách bạn tự đặt hoặc mượn từ các nguồn khác), mức độ nguy hiểm và id để truy vấn thông tin thêm khi cần.

Kết luận: Vậy ta đã biết cách dùng hash để kiểm tra một file có phải là vi-rút hay không. Về quét vi-rút lây file hay còn gọi là vi-rút đa hình là trường hợp đặc biệt tui sẽ nói trong các bài sau.

Bài kế tiếp: Realtime Engine trong av là gì và hoạt động ra sao?

(do thời gian giới hạn nên tui chưa thể vẽ thêm hình đễ các bạn có thể dễ hình dung hơn, tui sẽ bổ xung thêm sau)

>VVM.

Posted in AV

World’s First 1,000-Processor Chip Said to Show Promise Across Multiple Workloads

kilocore_chip-300x248

A blindingly fast microchip, the first to contain 1,000 independent processors and said to show promise for digital signal processing, video processing, encryption and datacenter/cloud workloads, has been announced by a team at the University of California, Davis. The “KiloCore” chip has a maximum computation rate of 1.78 trillion instructions per second and contains 621 million transistors, according to the development team, which presented the microchip at the 2016 Symposium on VLSI Technology at Circuits this month in Honolulu.

By way of comparison, if the KiloCore’s area were the same as a 32 nm Intel Core i7 processor, it would contain approximately 2300-3700 processors and have a peak execution rate of 4.1 to 6.6 trillion independent instructions per second, according to the design team.

Although Bevan Baas, professor of electrical and computer engineering at UC/Davis who led the chip architectural design team, told HPCwire’s sister publication, EnterpriseTech, that there are no current plans to commercialize the processor, he said it has important commercial implications, with several applications already developed for the chip.

Bevan-Baas

“It has been shown to excel spectacularly with many digital signal processing, wireless coding/decoding, multimedia and embedded workloads, and recent projects have shown that it can also excel at computing kernels for some datacenter/cloud and scientific workloads,” he said. He said the KiloCore innovates in a number of areas covering architectures, application development and mapping, circuits, and VLSI design.

“We hope multiple aspects of KiloCore will influence the design of future computing systems,” he said. “For workloads that can be mapped to its architecture, it could very well have a place in exascale-class computing.”

The design team claims for KiloCore the highest clock-rate processor ever designed in a university. And while other multiple-processor chips have been created, none exceed about 300 processors, according to the team.

The KiloCore chip was fabricated by IBM using their 32 nm CMOS technology.

Beyond throughput performance, Baas said KiloCore also is the most energy-efficient many-core processor ever reported, Baas said. For example, the 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7 Watts, low enough to be powered by a single AA battery. The KiloCore chip executes instructions more than 100 times more efficiently than a modern laptop processor.

Each processor core can run its own small program independently of the others, Baas explained, which he said is a fundamentally more flexible approach than Single-Instruction-Multiple-Data approaches utilized by processors such as GPUs. The idea is to break an application up into many small pieces, each of which can run in parallel on different processors, enabling high throughput with lower energy use.

The KiloCore architecture is an example of a “fine-grain many-core” processor array, Baas said. Processors are kept as simple as possible so they occupy a small chip area, with numerous cores per chip. “Short low-capacitance wires result in high efficiency,” he said, and “operate at high clock frequencies (high performance in terms of high throughput and low latency).” The cores dissipate low power when both active and idle – in fact, he said, they dissipate perfect zero active power when there is no work to do. Energy efficiency also is achieved by operation at low supply voltages and a relatively-simple architecture consisting of a single-issue 7-stage pipeline with a small amount of memory per core and a message-passing-based inter-processor interconnect rather than a cache-based shared-memory model.

Baas said the team has completed a compiler and automatic program mapping tools for use in programming the chip

(via HPCWire.com)

Logging for Unikernels

logging-for-unikernelsUnikernels are the next step in virtualized computing. There is a lot of hubbub right now in the tech-o-sphere about unikernels. In fact, in many circles unikernels are thought to be the Next Big Thing. There is a lot of recent news to justify this perspective. Docker, a key player in Linux Containers, has acquired Unikernel Systems. MirageOS, the library that you use to create unikernels has solid backing from Xen and the Linux Foundation as an incubator project.

Unikernels are not going anyway soon. Having a clear understanding of unikernel technology is important to anybody working in large scale, distributed computing. Thus, I share. In this article I am going to explain the basic concepts behind unikernels and I am going to describe how they sit in the landscape of virtualized computing. Also I am going to talk about an approach to logging from within unikernels that I learned from industry expert, Adam Wick, Ph.D..

So let’s begin.

The Path From Virtual Machines to Unikernels.

A unikernel is a unit of binary code that runs directly on a hypervisor, very much in the same way that a virtual machine (VM) runs on a hypervisor. However, there is a difference. When you create a virtual machine, you are for all intents and purposes creating an abstract computer. As with any computer, your virtual machine will have RAM, disk storage, CPUs and network I/O. (Please see Figure 1, below.)

VirtualMachineArchitecture

When you create a virtual machine, the resources you allocate from the host computer are static. For example, as shown above, when you create a VM with 16GB of RAM, a 4 core CPU and 1 TB of storage, that configuration is set. The VM owns the resources and will not give them back. Also, the operating system on a VM can be different than the OS on the host. For example, if your host computer is using Ubuntu, creating a VM that uses Windows is no problem at all.

The important thing to remember is that a VM is a full blown computer with dedicated resources. However, there is a drawback. At runtime, should your 4 core, 16 GB VM use only 1 CPU, 1 GB of RAM, the other resources will sit around unused. Remember, there’s no giving back. Your VM owns the resources whether it uses them or not. Thus, there is the possibility that you can be paying for computing that is never used. Inefficient use of resources is a very real hazard on the Virtual Machine landscape.

Enter Unikernels

While full blown VMs may be useful for some work, there are situations where you don’t need all the overhead that a VM requires. Let’s say you publish a web service that makes a recommendation of stock to buy given hourly activity on the Internet. What do you need? You can get by with threading capability, a web client, a web server and some code that provides analysis intelligence. That’s it. This is where a unikernel comes in.

As mentioned above, a unikernel is a unit of binary code that runs directly on a hypervisor. Think of a unikernel as an application that has been pulled into the operating system’s kernel. However, a unikernel is a special type of application. It’s compiled code that contains not only application code, but also the parts of the operating system that are essential to its purpose. A unikernel thinks it is the only show in town. And because it is compiled, you don’t have to go through a whole lot of installation hocus pocus. It’s a single file. (Please see Figure 2, below.)

UnikernalArch

Unikernels are very small. For example, a DNS server weighs in at around 449 KB; a web server at ~674 KB. A unikernel contains only the pieces of the operating system required, no more, no less. There are no unnecessary drivers, utilities or graphical rendering components.

Unikernels load into memory very fast. Start times can be measured in milliseconds. Thus, they are well suited to provide transient microservice functionality. A transient microservice is a microservice that is loaded and unloaded into memory to meet an immediate need. The service is not meant to have a long lifespan. It’s loaded, work gets done and then the service is terminated

Given the small size of a unikernel and the fast loading speed, you can deploy thousands of them on a single machine, bringing them on and offline as the need requires.. And because they are compiled, binary artifacts, there are less susceptible to security breaches provided the base compilation is clean.

A unikernel can be assigned a unique IP address. A unique IP address means that the unikernel is discoverable on Internet. The implication is that you can have thousands of unikernels running on a machine each communicating with each other via standard network protocols such as HTTP.

What about Containers?

A container is a virtualization technology in which components are assembled into a single deployment unit called a container. You do not have to use yum or apt-get to install application dependencies. Everything the container needs exists in the container.

A container is an isolated process. Thus, conceptually a container is like a VM in that it thinks that it’s the only show in town.

A container leverages the operating system of the host computer. Hence, there is no mixing and matching. You cannot have a Windows host computer running a Linux container.

Similar to a unikernel a container uses only the resources it needs. They are highly efficient.

Unlike a VM and a unikernel which are managed by a hypervisor, containers are managed by a container manager. Docker and Kubernetes are popular container managers.

You can read more about containers here.

 

As mentioned above, a microservice lives behind an IP address that makes it available to other services. Typically the microservice provides a single piece of functionality that is stateless. Unlike VMs and Containers that can provide their own storage and authentication mechanisms, the architectural practice for unikernels is to have a strong boundary around the unikernel’s area of concern. The best practice is to have a unikernel use other services to meet needs beyond its concern boundary. For example, going back to the stock analysis service we described above, the purpose of that unikernel is to recommend stocks. If we want to make sure that only authorized parties are using the service we’d make it so our Stock Recommender unikernel uses an authorization/authentication service to allow users access to the service. (Please see Figure 3, below.)

UnikernalAuthSvc

Unikernel technology is best suited for immutable microservices. The internals of a unikernel are written in stone. You do not update parts; there are not external parts. When it comes time to change a unikernel, you create a new binary and deploy that binary.

So far, so good? Excellent! Now that you have an understanding about the purpose and nature of unikernels, let’s move onto logging from within a unikernel.

The Importance of Logging

A unikernel is a compiled binary similar to low level compilation unit such as a C++ executable. Therefore, as of this writing, debugging is difficult. Whereas in interpreted languages such as Java and C#,  where you can decompile the deployment unit, (class/jar for Java, dll/exe for C#)  to take a look at the code and debug it if necessary, in a unikernel things are locked up tight. If anything goes wrong, there is no way to peek inside. The only way you have to get a sense of what is going on or what’s gone wrong is to read the logs. So, the question becomes how do you log from within a unikernel?

The answer is, carefully. I know, it’s a smartass answer. But it’s an answer that’s appropriate. Unikernel technology is still emerging. There is a lot of activity around unikernel loggers, but as of this writing no off-the-shelf technology on the order of Log4J, Log4Net or NLog have come forward. However, there are some people working in the space and I had the privilege to talk to one of them, Adam Wick, Ph.D.

Adam is Research Lead at Galois, Inc. The folks at Adam’s company solve big problems for organizations such as DARPA and the Department of Homeland Security. Galois does work that is strategic and complex.

Adam has been working with unikernels for a while and shared some wisdom with me. The first thing that Adam told me is that they stick to the Separation of Concerns mantra that I described above. The unikernels they implement do only one thing and when it comes to logging they write loggers that do no more then send log entries to a log service. All their loggers do is create log data. Storing log data and analysis is done by a third party service. (Logentries is one the logging services that Adam tests his logger against.) (Please see Figure 4, below.)

UnikernalsLoggingSvc

In terms of creating the logger for the unikernel, Adam acknowledges that at the moment, it’s still pretty much a roll your own affair. Nonetheless, Adam recommends that no matter what logger you create, you will do well to support the syslog protocol. To quote Adam,

“syslog is a tremendously easy protocol to implement. You basically just need to prepare a simple string in the right format and send it to the right place. Not a lot of tricky bits.”

The syslog protocol defines a log entry format that supports a predefined set of fields. The protocol has gone through a revision from RFC1364 to RFC5424, and there are some alternative versions out there. So there is a bit of confusion around the implementation. Still, there are usual fields that you can plan to support when implementing log emission. Table 1 below describes these usual fields.

Table 1: The fields of the Syslog protocol

Field Description Example
Facility Refers to the source of the message, such as a hardware device, a protocol, or a module of the system software. Log Alert, Line Printer subsystem
Severity The usual severity tags Debug, Informational, Notice, Warning, Error, Critical, Alert, Emergency
Hostname The host generating the entry 198.162.0.1, MyHostName
Timestamp The timestamp is the local time, in MMM DD HH:MM:SS format
Message A free form message that provides additional information relevant to the log entry “I am starting up now.”

Once you get logging implemented in your unikernel, the question becomes what to log. Of course you’ll log typical events that are important to the service your unikernel represents. For example, if your microservice does image manipulation, you’ll want to log when and how the images are processed. If your unikernel is providing data analysis, you’ll want to log the what and when events relevant to the analysis.

However, remember that your unikernel will be one of thousands sitting in a single computer. More is required. To quote Adam Wick again:

“… the problem that we run into is not logging enough information.

I suspect this is true of many systems — in fact, I know I’ve seen at least a few conference talks on it — but understanding complex distributed systems is just a very hard problem. The more information you have to understand why something failed, or why something bogged down, the better. For example, in some of our early work, we didn’t log a very simple message: a “hello, I’m booted and awake!” message. This turned out to be a problem, later, because we were seeing some weird behavior, and it wasn’t clear if it was because of some sort of deep logic problem, or if it was just because our DHCP server was slow.

In hindsight this is probably obvious, but having information like ‘I’m up and my IP configuration is … ‘ and  ‘I’m about to crash.’ have proved to be very useful in a couple different situations. Most high-level programming languages have a mechanism to catch exceptions that occur. Catching and sending them out on a log seems obvious, but we did have to remember to make it a practice.”

Another area that Adam covered is ensuring storage availability of your log data. Remember, your unikernel will not store data, only emit it. Thus, as we’ve mentioned, we need to rely upon a service to store the data for later analysis. Logentries allows you to store your data for days or months so you can quickly analyze it  and make alerts on your unikernel events. If you need to store the data for longer, you can use Logentries S3 archiving features to maintain a backup of your log data for years (or indefinitely). And, not only  does Logentries provide virtually unlimited storage for an unlimited amount of time, but the service is highly available. It’s always there at the IP endpoint. Internally, if a node within Logentries goes down, the system is designed to transfer traffic to another node immediately without any impact to the service consumer. High availability of the logging service is a critical. Remember, the logging service is all your unikernel has to store log data.

Given the closed, compiled nature of unikernels, logging within a unikernel is important practice. In fact, it’s a ripe opportunity for an entrepreneurial developer.  It’s not unreasonable to think that as the technology spreads, someone will publish an easy to use unikernel logger. Who knows, that someone might be you.

Putting it all together

Unikernels are going to be with us for a while. They are are small and load fast, making them a good choice for implementing transient microservices. And, because the are small, you can load thousands into a machine. Each unikernel has an unique IP address which means that you can cluster them to a common purpose behind a load balancer. Also, the IP address allows you to use standard communication protocols to have the unikernels talk to each other.

But, for as much activity that is going on in the space, unikernels are still an emerging technology. The toolset to create unikernels is growing. Players such as Docker and Xen are active players in unikernel development.

Given that a unikernel is a compiled binary that is nearly impossible to debug, logging information out of your unikernel implementation becomes critical. However, unikernel loggers are still a roll your own undertaking, for now anyway. Those in the know suggest that when you make a logging client, you emit log data according to the Syslog protocol and use logging services such Logentries to store log data for later analysis,

As I’ve mentioned over and over, unikernels are going to be with us for a while. Those who understand the details of unikernel technology and have made a few mission critical microservices using unikernels will be greatly in demand in no time at all. Unikernels are indeed the next big thing and the next big opportunity.

(Via Blog.logentries.com)

Design Of A Modern Cache

Caching is a common approach for improving performance, yet most implementations use strictly classical techniques. In this article we will explore the modern methods used by Caffeine, an open-source Java caching library, that yield high hit rates and excellent concurrency. These ideas can be translated to your favorite language and hopefully some readers will be inspired to do just that.

Eviction Policy

A cache’s eviction policy tries to predict which entries are most likely to be used again in the near future, thereby maximizing the hit ratio. The Least Recently Used (LRU) policy is perhaps the most popular due to its simplicity, good runtime performance, and a decent hit rate in common workloads. Its ability to predict the future is limited to the history of the entries residing in the cache, preferring to give the last access the highest priority by guessing that it is the most likely to be reused again soon.

Modern caches extend the usage history to include the recent past and give preference to entries based on recency and frequency. One approach for retaining history is to use a popularity sketch (a compact, probabilistic data structure) to identify the “heavy hitters” in a large stream of events. Take for example CountMin Sketch, which uses a matrix of counters and multiple hash functions. The addition of an entry increments a counter in each row and the frequency is estimated by taking the minimum value observed. This approach lets us tradeoff between space, efficiency, and the error rate due to collisions by adjusting the matrix’s width and depth.

sPS-0VWjzfHp6DtvEQKU0Hg

Window TinyLFU (W-TinyLFU) uses the sketch as a filter, admitting a new entry if it has a higher frequency than the entry that would have to be evicted to make room for it. Instead of filtering immediately, an admission window gives an entry a chance to build up its popularity. This avoids consecutive misses, especially in cases like sparse bursts where an entry may not be deemed suitable for long-term retention. To keep the history fresh an aging process is performed periodically or incrementally to halve all of the counters.

sHXTg0FelH7DQ0ctP3zAnYg

 

W-TinyLFU uses the Segmented LRU (SLRU) policy for long term retention. An entry starts in the probationary segment and on a subsequent access it is promoted to the protected segment (capped at 80% capacity). When the protected segment is full it evicts into the probationary segment, which may trigger a probationary entry to be discarded. This ensures that entries with a small reuse interval (the hottest) are retained and those that are less often reused (the coldest) become eligible for eviction.

database

search

As the database and search traces show, there is a lot of opportunity to improve upon LRU by taking into account recency and frequency. More advanced policies such as ARC, LIRS, and W-TinyLFU narrow the gap to provide a near optimal hit rate. For additional workloads see the research papers and try our simulator if you have your own traces to experiment with.

Expiration Policy

Expiration is often implemented as variable per entry and expired entries are evicted lazily due to a capacity constraint. This pollutes the cache with dead items, so sometimes a scavenger thread is used to periodically sweep the cache and reclaim free space. This strategy tends to work better than ordering entries by their expiration time on a O(lg n) priority queue due to hiding the cost from the user instead of incurring a penalty on every read or write operation.

Caffeine takes a different approach by observing that most often a fixed duration is preferred. This constraint allows for organizing entries on O(1) time ordered queues. A time to live duration is a write order queue and a time to idle duration is an access order queue. The cache can reuse the eviction policy’s queues and the concurrency mechanism described below, so that expired entries are discarded during the cache’s maintenance phase.

Concurrency

Concurrent access to a cache is viewed as a difficult problem because in most policies every access is a write to some shared state. The traditional solution is to guard the cache with a single lock. This might then be improved through lock striping by splitting the cache into many smaller independent regions. Unfortunately that tends to have a limited benefit due to hot entries causing some locks to be more contented than others. When contention becomes a bottleneck the next classic step has been to update only per entry metadata and use either a random sampling or a FIFO-based eviction policy. Those techniques can have great read performance, poor write performance, and difficulty in choosing a good victim.

An alternative is to borrow an idea from database theory where writes are scaled by using a commit log. Instead of mutating the data structures immediately, the updates are written to a log and replayed in asynchronous batches. This same idea can be applied to a cache by performing the hash table operation, recording the operation to a buffer, and scheduling the replay activity against the policy when deemed necessary. The policy is still guarded by a lock, or a try lock to be more precise, but shifts contention onto appending to the log buffers instead.

In Caffeine separate buffers are used for cache reads and writes. An access is recorded into a striped ring buffer where the stripe is chosen by a thread specific hash and the number of stripes grows when contention is detected. When a ring buffer is full an asynchronous drain is scheduled and subsequent additions to that buffer are discarded until space becomes available. When the access is not recorded due to a full buffer the cached value is still returned to the caller. The loss of policy information does not have a meaningful impact because W-TinyLFU is able to identify the hot entries that we wish to retain. By using a thread-specific hash instead of the key’s hash the cache avoids popular entries from causing contention by more evenly spreading out the load.

sxSo8y3_uR8QZ_NbIqnCwzA

In the case of a write a more traditional concurrent queue is used and every change schedules an immediate drain. While data loss is unacceptable, there are still ways to optimize the write buffer. Both types of buffers are written to by multiple threads but only consumed by a single one at a given time. This multiple producer / single consumer behavior allows for simpler, more efficient algorithms to be employed.

The buffers and fine grained writes introduce a race condition where operations for an entry may be recorded out of order. An insertion, read, update, and removal can be replayed in any order and if improperly handled the policy could retain dangling references. The solution to this is a state machine defining the lifecycle of an entry.

scG-jsP3Cfr4YfwBQn9_bpQ

In benchmarks the cost of the buffers is relatively cheap and scales with the underlying hash table. Reads scale linearly with the number of CPUs at about 33% of the hash table’s throughput. Writes have a 10% penalty, but only because contention when updating the hash table is the dominant cost.

png;base641c7880f2120d6f20

Conclusion

There are many pragmatic topics that have not been covered. This could include tricks to minimize the memory overhead, testing techniques to retain quality as complexity grows, and ways to analyze performance to determine whether a optimization is worthwhile. These are areas that practitioners must keep an eye on, because once neglected it can be difficult to restore confidence in one’s own ability to manage the ensuing complexity.

The design and implementation of Caffeine is the result of numerous insights and the hard work of many contributors. Its evolution over the years wouldn’t have been possible without the help from the following people: Charles Fry, Adam Zell, Gil Einziger, Roy Friedman, Kevin Bourrillion, Bob Lee, Doug Lea, Josh Bloch, Bob Lane, Nitsan Wakart, Thomas Müeller, Dominic Tootell, Louis Wasserman, and Vladimir Blagojevic.

(via HighScalability.com)

ejabberd Massive Scalability: 1 Node — 2+ Million Concurrent Users

How Far Can You Push ejabberd?

From our experience, we all get the idea that ejabberd is massively scalable. However, we wanted to provide benchmark results and hard figures to demonstrate our outstanding performance level and give a baseline about what to expect in simple cases.

That’s how we ended up with the challenge of fitting a very large number of concurrent users on a single ejabberd node.

It turns out you can get very far with ejabberd.

Scenario and Platforms

Here is our benchmark scenario: Target was to reach 2,000,000 concurrent users, each with 18 contacts on the roster and a session lasting around 1h. The scenario involves 2.2M registered users, so almost all contacts are online at the peak load. It means that presence packets were broadcast for those users, so there was some traffic as an addition to packets handling users connections and managing sessions. In that situation, the scenario produced 550 connections/second and thus 550 logins per second.

Database for authentication and roster storage was MySQL, running on the same node as ejabberd.

For the benchmark itself, we used Tsung, a tool dedicated to generating large loads to test servers performance. We used a single large instance to generate the load.

Both ejabberd and the test platform were running on Amazon EC2 instances. ejabberd was running on a single node of instance type m4.10xlarge (40 vCPU, 160 GiB). Tsung instance was identical.

Regarding ejabberd software itself, the test was made with ejabberd Community Server version 16.01. This is the standard open source version that is widely available and widely used across the world.

The connections were not using TLS to make sure we were focusing on testing ejabberd itself and not openSSL performance.

Code snippets and comments regarding the Tsung scenario are available for download: tsung_snippets.md

Overall Benchmark Results

ejabberd_hs4

We managed to surpass the target and we support more than2 million concurrent users on a single ejabberd.

For XMPP servers, the main limitation to handle a massive number of online users is usually memory consumption. With proper tuning, we managed to handle the traffic with a memory footprint of 28KB per online user.

The 40 CPUs were almost evenly used, with the exception of the first core that was handling all the network interruptions. It was more loaded by the Operating System and thus less loaded by the Erlang VM.

In the process, we also optimized our XML parser, released now as Fast XML, a high-performance, memory efficient Expat-based Erlang and Elixir XML parser.

Detailed Results

ejabberd Performance

ejabberd_performance-2-1024x640

Benchmark shows that we reached 2 million concurrent users after one hour. We were logging in about 33k users per minute, producing session traffic of a bit more than 210k XMPP packets per minute (this includes the stanzas to do the SASL authentication, binding, roster retrieval, etc). Maximum number of concurrent users is reached shortly after the 2 million concurrent users mark, by design in the scenario. At this point, we still connect new users but, as the first users start disconnecting, the number of concurrent users gets stable.

As we try to reproduce common client behavior we setup Tsung to send “keepalive pings” on the connections. Since each session sends one of such whitespace pings each minute, the number of such requests grows proportionately with the number of connected users. And while idle connections consume few resources on the server, it is important to note that in this scale they start to be noticeable. Once you have 2M users, you will be handling 33K/sec of such pings just from idle connections. They are not represented on the graphs, but are an important part of the real life traffic we were generating.

ejabberd Health

ejabberd_health-1024x640

At all time, ejabberd health was fine. Typically, when ejabberd is overloaded, TCP connection establishment time and authentication tend to grow to an unacceptable level. In our case, both of those operations performed very fast during all bench, in under 10 milliseconds. There was almost no errors (the rare occurrences are artefacts of the benchmark process).

Platform Behavior

ejabberd_platform-1-1024x640

Good health and performance are confirmed by the state of the platform. CPU and memory consumption were totally under control, as shown in the graph. CPU consumption stays far from system limits. Memory grows proportionally to the number of concurrent users.

We also need to mention that values for CPUs are slightly overestimated as seen by the OS, as Erlang schedulers stay a bit of busy waiting when running out of work.

Challenge: The Hardest Part

The hardest part is definitely tuning the Linux system for ejabberd and for the benchmark tool, to overcome the default limitations. By default, Linux servers are not configured to allow you to handle, nor even generate, 2 million TCP sockets. It required quite a bit of network setup not to have problems with exhausted ports on the Tsung side.

On a similar topic, we worked with the Amazon server team, as we have been pushing the limits of their infrastructure like no one before. For example, we had to use a second Ethernet adapter with multiple IP addresses (2 x 15 IP, spread across 2 NICs). It also helped a lot to use latest Enhanced Networking drivers from Intel.

All in all, it was a very interesting process that helped make progress on Amazon Web Services by testing and tuning the platform itself.

What’s Next?

This benchmark was intended to demonstrate that ejabberd can scale to a large volume and serve as a baseline reference for more complex and full-featured platforms.

Next step is to keep on going with our benchmark and optimization iteration work. Our next target is to benchmark Multi-User Chat performance and Pubsub performance. The goal is to find the limits, optimize and demonstrate that massive internet scale can be done with these ejabberd components as well.

(via Blog.process-one.net)

Server-Side Architecture. Front-End Servers And Client-Side Random Load Balancing

23543475854_28e93a4a75_o

Chapter by chapter Sergey Ignatchenko is putting together a wonderful book on the Development and Deployment of Massively Multiplayer Games, though it has much broader applicability than games. Here’s a recent chapter from his book.

Enter Front-End Servers

[Enter Juliet]
Hamlet:
Thou art as sweet as the sum of the sum of Romeo and his horse and his black cat! Speak thy mind!
[Exit Juliet]

— a sample program in Shakespeare Programming Language

 

BB_part071_BookChapter006b_v2

Our Classical Deployment Architecture (especially if you do use FSMs) is not bad, and it will work, but there is still quite a bit of room for improvement for most of the games out there. More specifically, we can add another row of servers in front of the Game Servers, as shown on Fig VI.8:

Fig-VIv2-8

As you see, compared to the Classical Deployment Architecture (see Fig VI.4 above) we’ve just added a row of Front-End Servers in front of our Game Servers. These additional Front-End Servers are intended to deal with all the communication stuff when it comes from the clients. All those pesky “whether the player is connected or not” questions (including keep-alive handing where applicable, see Chapter [[TODO]] for details on keep-alives), all that client-to-server encryption (if applicable), with all those keys etc., all those rather more-or-less strange reliable-UDP protocols (again, if applicable), and of course, routing messages between the clients and different Game Servers – all the communication with clients is handled here.

BB_emotion_0008

In addition, usually these Front-End servers store a copy of relevant Game Worlds when it is necessary, and are acting as “concentrators” for the game world updates

 In addition, usually these Front-End servers store a copy of relevant Game Worlds when it is necessary, and are acting as “concentrators” for the game world updates; i.e. even if a Game Server has 100’000 people watching some game (like final of some tournament or something), it will need to send updates only to a few Front-End servers, and Front-End servers will take care of data distribution to all the 100’000 people. This ability comes really handy when you have some kind of Big Final game, with thousands of people willing to watch it (and you don’t really need to make it a video broadcast, which is not-so-convenient for existing players and damn expensive, but you can do it right within your client). More on it below, and implementation of this “concentrator” paradigm is discussed in more detail in Chapter [[TODO]].

We’ll discuss the implementation of our Front-End servers a bit later, but for now let’s note that most importantly,

Front-End Servers MUST Be Easily Replaceable Without Significant Inconveniences To Players

That is, if any of Front-End Servers fails for whatever reason – the most a player should see, is a disconnect for a few seconds. While still disruptive, it is very much better than scenarios such as “the whole game world went down and we need to restore it from backup”. In other words, whenever Front-End server crashes for whatever reason, all the clients who were connected there, need to detect the crash (or even worse, “black hole”) and automagically reconnect to some other Front-End server; in this case all the player can see, is a momentarily disconnect (which is also a nuisance, but is much better than to see your game hang).

Front-End Servers: Benefits

Whenever we’re adding another layer of complexity, there is always a question “Do we really need it?” From what I’ve seen, having easily replaceable Front-End Servers in front of your Game Servers is very valuable and provides quite a few benefits. More specifically:

  • Front-End Servers take some load off your Game Servers, while being easily replaceable
    • it means that you can have less Game Servers
      • this, combined with the observation that Front-End Servers are easily replaceable, means that you improve reliability of your site as a whole; instances when some of your Game World Servers go down, will occur more rarely (!)
    • having a copy of relevant game world(s) on your Front-End Servers, takes even more load off your Game Servers, and makes Game Server load independent on the number of observers
  • you can use really cheap boxes for your Front-End Servers; strictly speaking, you don’t even need ECC and RAID for them (and you certainly do need them for your Game Servers). As noted above, Front-End Servers are easily replaceable, so if one goes down – its load is automagically redistributed among the others (see Chapter [[TODO]] for further details). If you’re going to deploy into the cloud – you may want to consider cheaper offers for your front-end servers (even if they’re coming from different CSP).1
  • they allow your client to have a single connection point to the whole site; benefits of this approach include better control over player’s “last mile” so that priorities between different data streams can be controlled, eliminating difficult-to-analyze “partial connections”, and hiding more implementation details of your site from the hostile world outside; more on single client connection in Chapter [[TODO]]
  • they allow for trivial client-side load balancing (no hardware load balancers needed, etc. etc.), more discussion on the load balancing below in “On Client-Side Load Balancing and Law of Big Numbers” section below
  • having a copy of relevant game world(s) on your Front-End Servers allows to have virtually unlimited number of observers who want to watch some of the games being played on your site (such as a Big Final or something2) Best of all, this will happenwithout affecting game server’s performance (!). Moreover, usually you won’t need to organize anything for your Big Final, the system (if built properly) can take care of it itself, in (roughly) the following manner:
    • whenever somebody comes to watch a certain game, his client requests this game from the Front End Server
    • if Front End Server doesn’t have a copy of the requested game, it requests it from the relevant Game Server, alongside with updates to the game world state
    • from this point on, Front End Server will keep an “in-sync” copy of the game world, providing it (with updates) to all the clients which have requested it
    • it means that from this point on, even if you have 100’000 observers watching some game on this Game Server, all the additional load is handled by your Front-End Servers, without affecting your Game Server
    • for further details, see Chapter [[TODO]].
  • Front-End Servers allow for better security later on (acting essentially as a kind of DMZ, see Chapter [[TODO]] for details).

1 keep in mind that you still need top-notch connectivity

2 and as Big Finals are a good way to attract attention, this does provide you an edge over your competitors, etc. etc.

 

Front-End Servers: Latencies And Inter-Player Latency Differences

BB_emotionM_0017

You can have processing time of your Front End server application-layer of the order of single-digit microseconds.

 As for the negative side of having Front End Servers, I can think only of two such drawbacks. The first one is additional latency introduced by your Front End Server. More specifically, we’re speaking about the time which is necessary for the packet incoming from a client at application layer, to get processed by your Front End Server, to go into TCP stack on Front End Server side,3 to get out of TCP stack on Game Server side, and to reach application layer in your Game Server (plus the time necessary to go in the opposite direction).

Let’s take a look at this additional latency. From my experience, if you’re using a reasonably good communication layer library, you can have processing time of your Front End server application-layer of the order of single-digit microseconds.4 Then, we have an end-to-end TCP connection from your Front End Server to your Game Server; latencies of such a connection (over 10GB Ethernet) have been measured at around 8 µs [Larsen2007]. Adding these two delays together and multiplying it by two to get RTT, would mean that we’re still staying well below 100 µs. However, there are some further considerations (such as switch delays, differences between different operating systems, differences between games, etc.) which make me uncomfortable to say that you will have no problem achieving 100 µs delay (i.e. either you may, or you may not). On the other hand, I am ready to say that if you’re careful enough with your implementation, reducing the delay introduced by Front-End Servers, down to 1ms is achievable in all but most weird cases.

To summarize:

  • if additional latency of around 1 millisecond is ok for you – don’t worry about additional latencies and go for Front-End Servers; this certainly covers all genres with the only potential exception being MMOFPS
  • if additional latency you can live with, is well below 1 millisecond (which is difficult for me to imagine as it is still over an order of magnitude less that 1/60 sec frame update time, but in MMOFPS world pretty much anything can happen) – think about it a bit more and try to find out (ideally – experimentally) what kind of latency you can achieve in practice; if your experiments show that latencies are indeed unacceptable, you MIGHT need to drop those Front-End Servers because of the latency they’re introducing5
  • YMMV, no warranties of any kind, batteries not included

The second (IMHO more theoretical, but as usual, YMMV) potential issue with having Front-End Servers would arise if some of your Front-End Servers are overloaded (or they’re running using significantly different hardware), so those players connected to less-loaded Front-End Servers, will have lower latencies, and therefore will have an advantage.

BB_emotion_0005

If/when such inter-player latency becomes a real problem, you MAY need to implement some kind of affinity for players of certain Game Worlds to certain Front-End servers

 On the one hand, I didn’t see situations where it makes any practical difference in real-world deployments (i.e. as I’ve seen it, if some of the Front-End Servers are overloaded, it means that most of the other ones are already at 90%+ of capacity, which you should avoid anyway; see [[TODO!]] section for further discussion of load balancing). On the other hand, YMMV and in theory you might get hit by such an effect (though I certainly don’t see it coming into play for anything but MMOFPS).

If such inter-player latency differences become the case (and only when/if it becomes a real problem), you MAY need to implement some kind of affinity for players of certain Game Worlds to certain Front-End servers (more on affinity in “On Affinity” section below). However, keep in mind that large-scale affinity tends to remove most of the benefits provided by Front-End Servers, so if you feel that you’re going to implement affinity for each-and-every-game – you’ll probably be better without Front-End Servers (implementing affinity only for a small percentage of your games, such as “high profile tournaments” will cause less trouble, see “On Affinity” section below for further discussion).


3 yes, I’m arguing for TCP connections for inter-server communications in most cases, see “On Inter-Server Communication” section above. On the other hand, UDP is also possible if you really really prefer it.

4 note that this might become a non-trivial exercise, see further discussion in Chapter [[TODO]]. On the other hand, I’ve done it myself.

5 in theory, you may also want to experiment with something like Infiniband, which BTW would fit nicely in overall QnFSM architecture with communications neatly isolated from the rest of the code, but most likely it won’t be worth the trouble

Client-Side Random Balancing And Law Of Big Numbers

As soon as you have several Front-End servers where your clients are coming, you have a question “how to ensure that all the Front-End Servers are loaded equally”, i.e. a typical load balancing question. Load balancing in general is quite a big topic at least over last 20 years. Three most common techniques out there are the following: DNS Round-Robin, Client-Side Random Balancing, and Server-Side (usually hardware-based) Load Balancers. With the industry producing those hardware boxes behind the last one, there is no wonder that it becomes more and more popular at least in the enterprise world. Still, let’s take a closer look at these load balancing solutions.

DNS Round-Robin

BB_emotionM_0027

one of these returned IPs can get cached by a Big Fat DNS server, and then get distributed to many thousands of clients

 DNS round-robin is based on a traditional DNS requests. Whenever a client requests address frontend.yoursite.com to be resolved into IP address, a DNS request is sent (this stands with or without DNS round-robin) to your (or “your DNS provider’s”) DNS server. If DNS server is configured for DNS round-robin, it returns different IP addresses to different DNS requests, in a round-robin fashion6 hence the name.

DNS Round-Robin, when applied to balancing browsers across different web servers, has two major disadvantages. First of all, there is a problem with caching DNS servers along the path of the request (which is a very standard part of DNS handling). That is, even if your server is faithfully returning all your IPs in a round robin fashion, one of these returned IPs can get cached by a Big Fat DNS server (think Comcast or AT&T), and then get distributed to many thousands of clients; in this case distribution of your clients across your servers will be skewed towards that “lucky” IP which got cached by the Big Fat DNS server :-( . The second problem with using DNS round-robin for web servers, if that if one of your servers is down, usual web browser won’t try another server on the list, so usually in web server realm round-robin DNS doesn’t provide server fault tolerance.

Fortunately, as we DO have a client, we can solve both these problems very easily. Moreover, these techniques will also work for your browser-based games (that is, after you’ve got your JS loaded and it started execution).


6 strictly speaking, it is a little bit more complicated than that, as DNS packets contain a list of servers, but as virtually everybody out there ignores all the entries in returned packet except for the very first one, it is more or less equivalent to returning only one IP per request – that is, unless you have your own client which can do the choice itself, see “Client-Side Balancing”

 

Client-Side Random Balancing

BB_emotion_0015

Client simply takes random item from the IP list, and tries connecting to this randomly chosen IP.

 To improve on DNS round-robin, a very simple idea can be used. We won’t rotate anything on the server side; instead, we will distribute exactly the same list of servers to all the clients. This list may be hardcoded into your clients (and that’s what I’ve used personally with big success), or the list can be distributed via DNS as a simple list of IPs for desired name (and retrieved on client via getaddrinfo() or equivalent). Which way to prefer – doesn’t matter to us now, but we’ll discuss relevant issues in Chapter [[TODO]].

As soon as the client gets the list of IPs, everything is very simple. Client simply takes random item from the IP list, and tries connecting to this randomly chosen IP. If connection attempt is unsuccessful (or connection is lost, etc.) – client gets another random item from the list and tries connecting again.

One note of caution – while you don’t really need a cryptographic-quality random generator to choose the IP from the list, you DO want to avoid situations when your random number generator (the one used for this purpose) is essentially just some function of coarse-grained time. One Really Bad example would be something like

int myrand() {//DON'T DO THIS!
  srand(time(0));
  return rand();
}

In such a case, if you get mass disconnect (and as a result all your players will attempt to reconnect at about the same time), your IP distribution will likely get skewed due to too few differences between the clients trying to get their IP addresses; if all the clients attempt to connect within 5 seconds, with such a bad myrand() function you’ll get at most 5 different IPs (less if you’re unlucky). Other than such extremely bad cases, pretty much any RNG should be fine for this purpose. Even a trivial linear congruential generator, seeded with time(0) at the moment when the program was launched (and NOT at the moment of request, as in example above), should do in practice, though adding some kind of milliseconds or some other randomly looking or client-specific data to the mix is advisable “just in case”.

Law of Large Numbers. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.— Wikipedia —

Client-Side Random Balancing: A Law Of Large Numbers, And Comparison With DNS Round-Robin

Unlike DNS round-robin (which in theory provides “ideal” balancing), client-side random balancing relies on the statistical Law of Large Numbers to achieve flat distribution of clients between the servers. What the law basically says is that for independent measurements, the more experiments you’re performing – the more flat distribution you’ll get. [[TODO!: add stuff about binomial distribution, and an example]]

In practice, despite being “non-ideal” in theory, client-side random balancing achieves much more flat distribution than DNS round-robin. The reason for it is two-fold. First, as soon as the number of clients is large (hundreds and up), client-side random balancing becomes sufficiently flat for practical purposes (and if your system is provisioned for thousands of players, and only a few have came yet – the distribution won’t be too flat, but the inequality involved won’t be able to hurt, and the balance will improve as the number grows). On the positive side, however, client-side random balancing doesn’t suffer from DNS caching issue described above. Even if you’re using DNS to distribute IP lists (and this list gets cached) – with client-side balancing all the IP lists circulating in the system are identical by design, so caching (unlike with DNS round-robin) doesn’t change client distribution at all.

To summarize: personally, I would be very cautious to use DNS Round-Robin for production load balancing. On the other hand, I’ve seen Client-Side Random Balancing to work extremely well for a game which grew from a few hundreds of simultaneous players into hundreds of thousands; it worked without any problems whatsoever, providing almost-perfect balancing all the time. That is, if the average load across the board was 50%, you could find some servers at 48% and some at 52%, but not more than that.7

As for the second disadvantage mentioned above for DNS Round-Robin as applied to web browsers (which was inability of most of the browsers to provide fault tolerance in case when one of the servers crashes) – this evaporates as soon as we have the whole list on the client-side, can detect failure, and can select another item from the list.


7 this, of course, stands only when you have run your servers identically for sufficient time; if one of the servers has just entered service, it will take some hours until it reaches the same load level than the others. If really necessary, this effect can be mitigated, though mitigation is rather ugly and I’ve never seen it necessary in practice

 Server-Side Load Balancers

An approach which is very different from both round-robin DNS and client-side random balancing, is to use server-side load balancers. Load balancer is usually an additional box, sitting in front of your servers, and doing, as advertised, load balancing.

BB_emotion_0006

These additional balancing capabilities are usually completely unnecessary for games (where Law of Large Numbers tends to stand very firmly)

 Server-side load balancers do have significantly more balancing capabilities with regards to scenarios when different clients cause very different loads (so that server-side balancers can work even if the Law of Large Numbers doesn’t work anymore). However, on the one hand, these additional balancing capabilities are usually completely unnecessary for games (where Law of Large Numbers tends to stand very firmly), and on the other hand, such load balancer boxes tend to be damn expensive (double that if you want redundancy, and you certainly want it), they do not allow inter-datacenter balancing and fault tolerance (by design), and they introduce additional not-so-well-controlled latencies.8

Oh, and BTW – when speaking about redundancy and the cost of their boxes, quite a few hardware manufacturers will tell you “hey, you can use our balancer in active/active configuration, so you won’t waste anything!”. Well, while you can indeed use many server-side load balancers in active/active configuration, you still MUST have at least one redundant box to handle the load if one of those boxes fails. In other words, if all you have is two boxes in active/active configuration, when both are working, overall load on each of them MUST be well below 50%, there is no way around it if you want redundancy.

As a result of all the considerations above, for game load-balancing purposes I have never seen any practical uses for server-side load balancer boxes (as always, YMMV and batteries are not included). Even if you’re using Web-Based Deployment Architecture (in the way described above), you should be able to stay away from them (though YMMV even more).


8 most of load balancers are designed to balance web sites where anything below 100ms is pretty much nothing, so at the very least make sure to discuss and measure the latency (under your kind of load!) before buying such a box

Balancing Summary

From my experience, client-side random balancing (aimed towards front-end servers) worked really good, and I’ve never seen any reasons to use something different. Round-robin DNS is almost universally inferior to client-side balancing, and hardware-based server-side balancers are too complicated and expensive, usually without any real reason to use them in gaming environment. As note above, one exception when you MAY need server-side balancers, is if you’re using Web-Based Deployment Architecture.

CDN A content delivery network or content distribution network (CDN) is a globally distributed network of proxy servers deployed in multiple data centers— Wikipedia —One last word about load balancing: it is possible to use more than one of the methods listed here (and it might even work for you); however, implications of such combined use of more than one method of load balancing, are way too convoluted to discuss them in this book.

Front-End Servers As A CDN

It is possible to use Front-End Servers as a kind of CDN (or even use them to build your own CDN). Even if you’re running all your Game Servers from one single datacenter, for certain kinds of games it might be a good idea to have your Front-End Servers sitting in different datacenters (and acting as different “entry points” to your clients), as shown on Fig VI.9:

Fig-VIv2-91

 

The idea here is pretty much like the one behind classical CDN: to reduce latencies for end-users. On the other hand, we need to note that

unlike classical CDN, the content with our game-sorta-CDN is not static, so gain in latencies is possible only because of better peering, with gains usually being in single-digit milliseconds

There is still a different reason to use such deployment architectures – in case if you want to protect yourself from Internet connectivity in your primary datacenter going down (provided that “Some Connectivity” survives); in practice, if you have a decent datacenter, it should never happen. More precisely – your datacenter WILL occasionally experience transient faults of around 1.5-2 minutes long (typical BGP convergence time), so if you’re looking for excuses to use this nice diagram on Fig VI.9 and your client can detect the fault and redirect to a different datacenter significantly faster than that, it MAY make some difference to your players.

Implementation-wise, there are several considerations for such CDN-like multi-datacenter Front-End Server configurations:

  • you MUST have very good connectivity between your data centers (“some connectivity” on Fig. VI.9). At the very least, you should have inter-ISP peering explicitly set by both of your ISPs (to each other) to ensure the best data flow for this critical path
    • strictly speaking, “some connectivity” does not necessarily need to be Internet-based; you often can save additional few milliseconds by getting something like “dedicated” Frame-Relay between your datacenters, but this will likely cost you in the range of tens of thousands per month :-( .
  • traffic on “some connectivity” can be an order (or even two) of magnitude lower than that going to the clients due to Front-End Servers acting as “concentrators”
  • you SHOULD account for secondary datacenter to go down (in particular, in case of inter-datacenter connectivity going down). The simplest way to deal with it is to have enough capacity in your primary datacenter (both traffic-wise and CPU-wise) to handleall of your clients, but this tends to be expensive. As an alternative, shutting down some activities in case of such a failure may be possible depending on specifics of your game.

Bottom line for CDN-like arrangements. CDN-like arrangements of Front-End Servers may save some of your players a few milliseconds in latency (that is, if you have a really good connection between datacenters), which in turn may allow to level the field a bit with regards to latency. From my experience, it was hardly worth the trouble (because you cannot really improve MUCH in terms of latency, as the packets still need to go all the way to the Game Server and back), but keep the possibility in mind. For example, it may come handy in some really strange scenarios when you’re legally required to keep your game servers in a strange location (hey casino guys!) where you simply don’t have enough bandwidth to serve your clients directly.

Front-End Servers + Game Servers As A Kinda-CDN

On the other hand, if you’re really concerned about latencies, it is usually much better to bring your Game World Servers closer to players (while leaving DB Server behind), as shown on Fig VI.10:

Fig-VIv2-10

Here, we’re moving the most time-critical stuff (which is usually your Game World Servers) towards the end-user, providing significantly better latencies to those players who’re in the vicinity of corresponding datacenter. Maintaining such infrastructure is quite a Big Headache, but is doable, so if you’re really concerned about latencies – you may want to deploy in such a manner. A word of caution – if going this way, you will end up with “regional servers”, which have their own share of troubles (you’ll need to ensure that clients in the region go only to the relevant Front-End Servers, security on inter-datacenter connections becomes quite an issue, etc., etc.); once again – it is doable, but go this way only if youreally need it.

On Affinity

In some cases, you may decide that you need to have a kind of “affinity” so that some specific players (usually those playing in a specific game world) are coming to specific Front-End Servers.

BB_emotionM_0016

The things will go smoothly as long as the number of the game worlds which use affinity is small.

 Note when we’re speaking about our Front-End Servers, “affinity” is quite different from classical affinity (usually referred to as “persistence” or “stickiness”) used on load balancers for web servers. In the web world persistence/stickiness is about having the same client coming to the same server (to deal with sessions and per-client caches). For our Front-End Servers, however, affinity has a very different motivation, and is usually about Front-End-Server-to-game-world affinity (for players or for players+observers) rather than client-to-server affinity (see “Front-End Servers: Latencies and Inter-Player Latency Differences” section above for one reason where you MIGHT need such affinity).

Technically, implementing Front-End-Server-to-game-world-affinity is not that difficult, but the real problems will start after you deploy your affinity. In short – the things will go smoothly as long as the number of the game worlds which use affinity is small. On the other hand, as soon as you have a significant chunk of your players connected using the affinity rules, you will find that achieving reasonable load balance between different Front-End Servers becomes difficult :-( . When there is no affinity, the balance is near-perfect just because of the Law of Large Numbers; as you’re introducing the affinity rules, you’re starting to skew this near-perfectly-flat distribution, and the more players are affected by affinity, the more you’re deviating from the ideal distribution, so managing those rules while achieving load balance can become a Big Fat Challenge.

Bottom line: avoid affinity as long as possible (and most likely you will be able to get away without it).

Front-End Servers: Implementation

Now let’s discuss ways how our Front-End Servers can be implemented. As mentioned above, the key property of our Front-End Servers is that they’re easily replaceable in case of failure. To achieve this behavior,

you MUST ensure that there is NO original game-world state on any of your Front-End Servers. In other words, Front-End Servers should have only a replica of the original game-world state, with the original game-world state kept by Game Servers

There is no need to worry too much about it if you’re using a generic subscriber/publisher (or state replication) kind of stuff, but be extremely careful if you’re introducing any custom logic to your Front-End Servers, because you may lose the all-important “easily replaceable” property above. See Chapter [[TODO]] for further discussion of this potential issue.

Front-End Servers: QnFSM Implementation

One implementation of the Front-End Server implemented under pure Queues-and-FSMs architecture (see Chapter V for details on QnFSM, state machines, and queues) is shown on Fig VI.10:

Fig-VIv2-11

 

Here, we have TCP- and UDP-related threads similar to those described in “Implementing Game Servers under QnFSM architecture” section above with regards to Game Servers, and one or more of Routing&Data Threads (with at least one Routing&Data FSM each), which are responsible for routing of all the packets, and for caching the data (such as “game world” data). Let’s discuss these routing-related FSMs in a bit more detail.

Routing&Data FSMs. Each of Routing&Data FSMs has its own data that it handles (and updates if applicable). For example, one such Routing&Data FSM may contain a state of one game world. Other Routing&Data FSMs may handle routing of the point-to-point packets from players to (and from) one specific Game Server. Further details of the data types handled by Routing&Data FSMs will be discussed in Chapter [[TODO]], but generally there will be three different types of Routing&Data FSMs:

  • generic connection handlers (to handle point-to-point communications including player input and server-to-server connections)
  • generic publisher/subscriber handlers (to cache and handle generic but structured data such as a list of available games, if players are allowed to select the game)
  • specific game world handlers (to cache and handle game world data if the required functionality doesn’t fit into generic handler). In many cases you’ll be able to live without specific game world handlers, but if you want to implement some kind of server-side filtering, like server-side fog-of-war to avoid sending data to those players who shouldn’t see it (so no hack of the client can possibly lift fog-of-war) – specific game world handlers become a necessity.

BB_emotion_0023

It is possible (and often advisable) to have more than one Routing&Data FSM within single Routing&Data Thread

 It is possible (and often advisable) to have more than one Routing&Data FSM within single Routing&Data Thread to reduce unnecessary load due to an exceedingly high number of threads (and unnecessary thread context switches). How to combine those Routing&Data FSMs into specific threads – depends on your game significantly, but usually generic connection handlers are extremely fast and all of them can be combined in one thread. As for generic publisher/subscriber and specific game world handlers, their distribution into different threads should take into account typical load and allowed latencies. The rule of thumb is (as usual) the following: the more FSMs per thread – the more latency and the less thread-related overhead; unfortunately, the rest depends too much on specifics of your game to discuss it here.

Routing&Data Factory Thread. Routing&Data Factory Thread is responsible for creating Routing&Data Threads (and Routing&Dara FSMs), according to requests coming from TCP/UDP threads. A typical life cycle of Routing&Data FSM may look as follows:

  • One of TCP/UDP FSMs needs to route some message (or to provide synchronization to some state), and realizes that it has no data on Routing&Data FSM, which it needs to route the message to, in its own cache.
  • TCP/UDP FSM sends a request to Routing&Data Factory FSM
  • Factory FSM creates Routing&Data Thread (with an appropriate Routing&Data FSM)
  • Factory FSM reports ID of the Queue, where the messages towards appropriate Routing&Data FSM should be sent, back to the requesting TCP/UDP Thread
    • TCP/UDP FSM (the one mentioned above) sends the message to the appropriate Queue (using ID rather than pointer to enable deterministic “recording”/”replay”, see Chapter V for details).
  • Whenever the Routing&Data FSM is no longer necessary for its purposes, TCP/UDP FSM reports it to the Factory FSM
    • if it was the last TCP/UDP FSM which needs this Routing&Data FSM, Factory FSM may instruct appropriate Routing&Data Thread to destroy the Routing&Data FSM

Routing&Data FSMs In Game Servers And Clients

I need to confess that personally I am positively in love these Routing&Data FSMs. I lovethem so much that I usually have not only on Front-End Servers, but also on Game Servers, and on Clients too; while they’re not strictly necessary there (and are not shown on appropriate diagrams to avoid unnecessary clutter), they did help me to simplify things quite a bit, making all the communications very uniform. Still, it is pretty much your choice if you want to have Routing&Data stuff on your Game Servers and/or Clients.

Front-End Servers Summary

BB_emotion_0009

“As a rule of thumb, Front-End Servers are a Good Thing™.

To summarize the section on Front-End Servers:

  • As a rule of thumb, Front-End Servers are a Good Thing™. In particular:
    • they take the load off your Game Servers
      • which often makes the system cheaper (as Front-End Servers are cheap)
      • and also improves overall system reliability (as Front-End Servers are easily replaceable)
    • they facilitate single client connection (which is generally a good thing to have, see Chapter [[TODO]] for further discussion)
    • they facilitate client-side load balancing
    • they allow to handle 100’000+ observers for your Big Event easily (actually, the sky is the limit)
    • their drawbacks are pretty much limited to the additional latency, and this additional latency is firmly in sub-millisecond range
  • Client-side load balancing usually is the best one for games
    • one potential exception is Web-Based Deployment Architectures, where you MAY need server-side balancers
    • large-scale affinity is to be avoided
  • CDN-like arrangements are possible, but not without caveats
  • Front-End Servers can (and IMHO SHOULD) be implemented in QnFSM architecture, as described above

(via Highscalability.com)