Cloud Bursting - Mikelangelo - Horizon 2020 Project on Virtualization, Cloud Computing, and HPC

This use case will showcase how MIKELANGELO’s software stack can solve the problem of cloud bursting. Cloud bursts are a web-phenomenon, which leads many users to suddenly use a service, outstripping its capacity. We will leverage sKVM, OSv, and our integration with cloud middleware, to implement unprecedentedly reactive cloud elasticity. To prove that our approach is valid, we will implement a demonstration using memcached, as a ubiquitous stateful service.

Cloud bursts appear when suddenly a much larger than usual amount of users request a service. For example, often a positive review of a startup’s fledgling new service by a popular technology website leads to a sudden surge in users. Such a surge leads to a sudden rise in service requests, which in turn outstrips the service’s resources quickly. This effect is also known as “The Slashdot-effect”, named after the popular tech-website Slashdot. Social media, with their complex user dynamics, are another source of cloud bursts. Where web-sites can roll-out the review of a new service progressively, to spare the service, social media cannot be controlled in this way. Instead social media develop their own dynamics, which frequently lead to surges. There exist even more sources of cloud bursts such as DDoS-attacks, popular events, media consumption, and many more.

Companies, who provide a new service, crucially rely on cloud bursting to build up a good reputation for the service. Especially for startups, handling a cloud burst gracefully is akin to a make-or-break situation.

Before the advent of cloud computing, service providers could not deal with large changes in workloads. To preempt service collapses, due to high service demand, IT infrastructure usually had overly large capacity. Through cloud computing, in specific its aspect of elasticity, service providers can theoretically react slowly to arbitrarily large changes in demand. However, current cloud stacks react too slowly to cope with cloud bursts. During cloud bursts network and storage resources are strained to their capacity, and VMs cannot be booted quickly enough. It is especially hard to distribute state to new service instances during a cloud burst. How to overcome those issues? Fast VM provisioning, short boot times, and clever state distribution are the important problems to solve to cope with cloud bursts.

As an example of current state-of-the-art elasticity, we refer to a case study with Google’s Compute Engine. This case study describes how Google Compute Engine is able to start up three hundred VMs to serve one million Cassandra write requests per second. It took the infrastructure 70 minutes to provision the whole setup from scratch. During a cloud burst, 70 minutes are enough time to bring down a service and to ruin a startup’s reputation.

We propose to use MIKELANGELO’s whole software stack to cope with realistic cloud bursts. In this use case, we intent to use sKVM, OSv, and the cloud middleware integration. At the bottom of architecture, sKVM will ensure that I/O requests perform above the current state of the art. Fast I/O will enable the cloud to distribute state to newly spawn service instances without wasting many CPU cycles for virtualized I/O. Additionally, fast I/O will increase the maximum I/O request rate of each service instance. OSv plays a central role in this use case, since it allows quick large scale deployment, fast boot times, and an optimized distribution of state. OSv can easily be run on a large scale, due to its minimal images size. OSv boots quickly by design, which we have discussed in section X. Our tight integration of OSv with the cloud middleware and applications allows us to distribute service state efficiently to new instances. Furthermore, the integration with the cloud middleware will monitor for bursts and provision services as required.

We will verify the cloud bursting scenario on a use case with a scalable, stateful service, which is related to web-hosting. Potential candidates for our validation scenario are memcached, redis, and Cassandra. As part of the verification scenario, we will setup a cluster to generate load to the service under test. Furthermore, to recognize cloud bursts early, and to react properly, we will extend the cloud middleware for this use case. To verify the benefits of MIKELANGELO in cloud bursting, we will perform the same cloud bursting scenario twice. First, to provide a baseline, we will use a standard public cloud, such as Amazon EC2, with the given setup and a suitable guest OS, other than OSv. Second, we will run the same experiment on a cloud with MIKELANGELO’s cloud-bursting-ready stack.

We expect to beat the deployment time and the request rate, which have been described in the case study using Google Compute Engine. Instead of a deployment time of 70 minutes, we target deployment times in the range of tens of seconds to few minutes. We also expect to achieve the same request rate in both verification scenarios, with significantly fewer VMs using MIKELANGELO’s stack. We base these targets on sKVM’s I/O performance, OSv’s small image size and zero-configuration approach, and its small footprint. If successful, this use case will allow cloud customers, to save costs in general, based on the increased agility. The increased agility enables faster reaction with smaller safety margins on resources utilization. More efficient resource utilization, in turn, leads directly to lower costs, in typical cloud pricing models.

Conclusively, in this use case, we will use and extend MIKELANGELO’s stack to deal with cloud bursts. Cloud bursts are currently an unsolved problem in IT management, with potentially disastrous economic effects for service providers. The use case builds on MIKELANGELO’s whole stack, which is well equipped to deal with cloud bursting. We will verify this hypothesis by comparing experimental results with our solution and state-of-the-art clouds. If successful, we believe that the results of this use case will find wide application in cloud computing.