This article introduces the architecture of the MIKELANGELO project, describing our main innovations. At the same time this article is the first of the MIKELANGELO architecture series. The remaining article dive deep on individual innovations.
MIKELANGELO’s architecture builds on the experience of dozens of researchers across 9 renowned organisations to optimise and leverage VM performance.
Each article in our series covers one part in the architecture diagram below:
- the I/O-optimised hypervisor (sKVM),
- the lean guest OS (OSv),
- the I/O-aware cloud middleware (OpenStack),
- the virtualised batch system (Torque),
- the holistic monitoring system (still secret).
For each of those components we mention the limitations of the state of the art and how we intend to overcome those limitations.
Tuning the Hypervisor: sKVM
We call our hypervisor super-KVM (sKVM). It features three novel components that will push the state of the art: IOcm for fast I/O, vRDMA for fast inter-VM communication, and SCAM for security (yeah, really: SCAM!).
These three components tackle three serious limitations of KVM:
- low VM performance for I/O-heavy workloads,
- high I/O overhead for inter-VM communication,
- big attack surface due to side-channel attacks.
IOcm for Fast I/O
IOcm, which stands for I/O core manager, targets the low I/O performance of type 2 open source hypervisors.
What a type 2 hypervisor is? Thanks for asking! The diagram below shows the main difference between type 1 and type 2 hypervisors. Type 1 hypervisors provide a special operating system dedicated and optimised to host virtual machines. In contrast, type 2 hypervisors, such as KVM, leverage operating system services, such as memory management and I/O, to host virtual machines..
And then there is container-based virtualisation. Container-based virtualisation isolates a part of the host OS, which users then can access. As a consequence, containers perform nearly as well as the host OS. However, containers do not allow you to install arbitrary kernels. Furthermore, with containers new security and management issues arise.
Currently, open source type 2 hypervisors reach poor VM performance for I/O-heavy applications. This low performance limits their use. Especially I/O-intensive applications such as big data and HPC perform poorly in VMs. However, type 2 hypervisors allow for more flexible management and tighter security.
To improve the I/O performance for virtual network interfaces and block devices, we are extending KVM. Specifically, we are optimising vhost, which is a para-virtual driver that improves the I/O performance of KVM. If you run VMs with KVM, chances are you already use vhost. However, we can further optimise KVM and vhost. And we will do so in sKVM.
Furthermore, sKVM will allocate I/O resources according to the VM’s at runtime. This adaptive mechanism will ensure an optimal balance between I/O and computational performance.
Finally, sKVM will exchange status messages with the management layers above it, such as OpenStack in the cloud. First, sKVM will inform the cloud middleware about the utilisation of I/O resources and I/O performance. Then, the middleware will integrate the data in a control loop, to optimise performance across the boundaries of a single host.
vRDMA for Fast Communication
The vRDMA component in sKVM targets the overhead of inter-VM communication. The diagram below shows the two important scenarios of inter-VM communication:
- intra-host, between VM1 and VM2,
- inter-host, between VM1 and VM3.
For intra-host communication we can by-pass several layers of the networking stack and operate on the host’s shared memory. In particular, we want to use IVSHMEM (Inter-VM Shared Memory) for this scenario. Using IVSHMEM will lead to a considerable speedup for this common use case.
For inter-host communication vRDMA will enable VMs to use RDMA with RoCE NICs (RDMA over Converged Ethernet). Especially for HPC, vRDMA will reduce the overhead of inter-host, inter-VM communication over Infiniband significantly.
With these approaches combined, inter-VM communication, intra-host and inter-host, will perform way better than currently possible.
SCAM for Security
The SCAM component of sKVM focuses on side-channel attacks. Currently, various popular hypervisors offer a surprisingly large attack surface for side-channel attacks. These attacks allow a malicious VM to extract a secret from another VM on the same host, as in the diagram below.
Our SCAM Our approach with SCAM comprises two phases. The first phase phase includes monitoring and profiling to identify malicious behaviour. The second phase includes mitigation techniques to reduce and eliminate the hypervisor’s attack surface.
If any of your VMs contains any private data, such as a private key, SCAM’s promise should get you excited.
Engineering the Perfect Cloud Guest: OSv
Current clouds have two major shortcomings due to the guest operating system (OS): performance and application management. We intend to overcome both limitations based on OSv, which has been built to run in the cloud. Cloudius Systems aka. ScyllaDB, the inventors of OSv, lead the work on the guest OS in MIKELANGELO.
Improving I/O Performance and Reducing the VM Footprint
To improve the performance of services running in a VM, we need to improve I/O performance and the footprint of the guest OS.
Currently, most cloud guests are generic OSs. These OSs carry a lot of legacy code, which makes them that generic. However, when when you deploy a generic OS as cloud guest a lot of that legacy stands in the way. In MIKELANGELO, we build on OSv, which is an OS engineered for the cloud from scratch.
Leaving behind many concepts of generic OSs, leads to improved I/O performance and a small footprint. The small footprint shows itself in an image size of 30MB and boot times of under 1s.
Managing Services Comfortably
Engineers have invented cloud computing to manage services effortlessly. After years of development the basics of the cloud, such as multi-tenancy, VM management, and monitoring work well enough. Now we finally can get to work on managing those services.
During the last two years it became apparent that Docker’s model for service management strikes a good balance. It’s somewhere near the sweet spot between manual installation, and an inflexible and heavy-weight PaaS solution.
To deploy an application with OpenStack, currently you have to install it by hand, use pre-baked images or use a combination of standard images with configuration tools such as Puppet or Ansible. These approaches are inflexible, error-prone, and time consuming.
In contrast, we will develop a simple packaging mechanism that will use manifests and installation scripts. This workflow will allow simple application packaging, deployment, and measurement.
To align ourselves with proven technology, we are building on OpenStack Heat. To showcase responsive application management, we will use OSv with some extensions.
Speeding Up the Cloud
Cloud computing, especially if deployed on premise, reaches its limitations when it comes to I/O optimisation and dynamic VM management.
To improve the VM performance in OpenStack, we will tackle these two limitations:
- monitoring is not scalable and dynamic,
- OpenStack does not optimise for I/O.
These limitations severely limit what your big data and HPC applications can achieve in the cloud!
To overcome those limitations, we follow two approaches: use a novel monitoring architecture and implement dynamic scheduling.
First, we intend to collect data about the I/O usage of VMs and the host in a monitoring system. To move beyond the capabilities of OpenStack’s monitoring system, we will integrate a novel monitoring system. Besides standard metrics, this monitoring system will store data about the I/O performance of hosts and VMs, which leads us to our second approach.
Second, we will implement online scheduling of VMs to make the best use of the I/O capabilities of the host and to fulfil the VMs’ thirst for I/O. Furthermore, our additions to OpenStack will manage vRDMA in VMs seamlessly. In the ideal case, you’ll get fast inter-VM communication without even trying.
GWDG, the service provider of the Max Planck Society leads the work on cloud computing in MIKELANGELO.
Its missing adoption of virtualisation limits HPC’s reach and popularity. Without virtualisation HPC is confined to the pre-cloud era in terms of flexibility.
HPC operators have not embraced virtualisation yet for these three reasons:
- lacking support for high performance hardware,
- poor VM performance for I/O-heavy application,
- missing integration with batch schedulers.
sKVM already resolves the first two limitations via IOcm and vRDMA. However, for you to use HPC on a virtual infrastructure, we still need to integrate virtualisation with typical HPC middleware.
Currently, no integration of VMs with a popular open source batch scheduler such as Torque or SLURM exists. Thus, users need to cope with the HPC environment as provided by the operator. HPC users need to adapt their implementation to fit the cluster. This introduces an unnecessary adoption barrier for HPC.
In contrast, our approach integrates the creation and deployment of VMs with the job submission and management workflow. This will let clients use a custom execution environment while retaining high performance. Furthermore, our HPC integration will feed our novel monitoring system.
HLRS, one of the largest public HPC providers in Germany, leads this work on HPC.
Making Monitoring Magnificent
Currently, two aspects limit monitoring systems:
- scalability and
- dynamic monitoring.
Too bad, because for system operators more systems data is always better.
Even the scalability of highly-performing time series databases such as InfluxDB and OpenTSDB limits the data that can be analysed to an unsatisfactory level. To boost the scale of data analysis, we will introduce a novel approach to monitoring that uses those monitoring systems and allows you to analyse more data at the same time.
Dynamic monitoring, the second limitations, has no widespread implementation yet. Dynamic monitoring is motivated by the question: Why monitor all services and metrics with equal resolution? Why not focus instead on the services and metrics that are important right now? Dynamic monitoring uses your monitoring resources in a smarter way.
Do these claims sound spurious to you? That’s ok. The concepts behind these claims and their implementation are still secret (so don’t tell anybody). But all of it will be released to the public some time in 2015.
Intel leads the work on holistic monitoring.
Map of Articles
Are you finally curious about our work? Then check out our articles below. We’ll post them one after the other during the next couple of weeks. Until we’ve published all articles, check out our most current project reports.
Each of those articles covers the limitations of the state of the art and our approaches to overcome those limitations.
- Hypervisor (check out our related article about sKVM at the KVM Forum)
- Guest OS
Do you worry you might miss one of those great articles? Then follow us on Twitter. We’ll keep you posted.