If you read our previous post on sKVM, you know that I/O performance is the main source of overhead in virtualized environments today. In the world of virtualization, all hypervisors suffer from reduced I/O throughput, and KVM is no exception. So the result is that people who have I/O intensive workloads tend to avoid virtualization wherever possible, because they know they are going to take a performance hit when moving off of bare metal machines. Regardless of the type of I/O (disk, network, etc.) the faster you try to go, the more you are going to feel that virtual machines are holding you back.
That is all well and good, but buying (or even leasing) bare metal machines is very expensive. Especially if you are a smaller company that does not require heavy computing power 24×7, it is hard to justify putting out the cash to get the system you need to run simulations periodically. So then virtualization looks more attractive, since you can share computing resources more easily between departments in a company, or even use a public cloud when needed. But despite the attractive price, you still wish there was some way to get the I/O performance you need.
This is where our improvements to KVM come into focus, which we call sKVM. The goal is simple — we want to improve the virtual I/O mechanism to near bare metal levels. Then we can enjoy both worlds — all the benefits of a virtualized environment (fast job setup time, migration, checkpointing, etc.) with the performance we expect from modern performance-oriented servers.
KVM’s I/O subsystem
To understand how and where sKVM can improve the I/O performance, let us take a cursory look at how the I/O subsystem in KVM is designed.
Hypervisors offer a few options in terms of virtual I/O devices, namely emulated, paravirtual, and passthrough. Each of these options has its relative strengths, but for the majority of us, paravirtual is the clear choice.
In KVM, that means using a protocol called virtio. Virtio was initially designed by a guy at IBM named Rusty Russell to try to unify and standardize the various methods of virtual I/O. Eventually it became the method used in KVM, with varying success among the other hypervisors and is now an official standard. Virtio specifies the “language” that guest virtual machines use in order to send and receive I/O. That means there are special drivers that are required in the guest OS that can speak this protocol. The drivers use the virtio protocol to communicate with the hypervisor (that is the meaning of paravirtual — that the guest knows that it is a virtual machine, and uses that knowledge to its advantage).
The hypervisor must also implement the other side of the virtio protocol to be able to receive requests from the guest, and return replies. There are also a few implementations of the hypervisor side available.
The first implementation (and arguably the more primitive of the two) is inside the QEMU process. Wait — we are talking about KVM, so why are parts implemented in QEMU? Good question. KVM (in the broader sense) refers to the type II (hosted) hypervisor implemented on top of Linux. This hypervisor arguably includes the entire Linux kernel, as well as some user space applications such as QEMU. QEMU provides system emulation for the hypervisor, so each time the guest VM thinks its using the PCIe bus to talk to a daughter card, that is actually QEMU making it look like there is a PCIe bus there.
So when virtio was first implemented for KVM, it was natural to extend the system emulation to include the backend of the paravirtual I/O devices. For a couple of reasons, the Linux community decided that was not good enough, and virtio should be implemented again as a kernel module. So a guy from Red Hat named Michael Tsirkin wrote vhost — an in-kernel implementation of the virtio backend. This in-kernel implementation has much better performance than its QEMU counterpart, so this is the obvious choice for configuring a virtual environment for any I/O intensive workloads.
Even with all of the advancements vhost has shown over its predecessors, I/O performance under KVM still falls short of the holy grail of bare metal performance. The question is: how close can we really expect to get to bare metal performance?
To get an idea of where the overhead is coming from, we have to remember that in the bare metal scenario, drivers communicate with hardware directly, and the hardware is usually the limiting factor. However this is not the case in a virtualized environment. When using paravirtual I/O devices, software plays a big role, and is directly involved in sending and receiving every packet and block of data. Software is inherently slower than hardware, so we have a real uphill battle. If you think about what software really is — a set of instructions to general purpose hardware (the CPU) that is able to accomplish a wide range of tasks — then it is easy to realize that hardware that is specially designed for a specific purpose can easily outperform software designed for the same task. But not all is lost!
Sometimes when writing software, we take the easiest path rather than the most efficient path, and there are actually several good reasons to take the “easy way”: code maintainability, legibility, programming time, reuse of generic software modules are just a few. These are tradeoffs that must be faced each time we write a software module. Sometimes during the lifetime of a product the tradeoffs change, and it turns out that a critical portion of code that was written to be easily maintainable should now be rewritten to be as efficient as possible, even at the cost of more complicated code.
What does that mean for vhost?
In the current implementation, each virtual device gets its own vhost thread. This is a very simple programming model since threads are a convenient abstraction, but not necessarily the most efficient. In essence, as the number of virtual machines increase, so does the number of virtual devices, and in turn the number of vhost threads. At some point, all of these threads start to affect each other, and the overhead of the scheduler trying to decide which thread to run and where, and how long to let it run starts to get in the way of the thread doing useful work. One idea that has been proposed is to use a shared vhost thread. It turns out that sharing a vhost thread among multiple devices can reduce overhead, and improve efficiency.
How to make it more efficient
Sometimes the I/O load is too much for a single core to handle, and sometimes the I/O is very light, and a dedicated core is not needed. In both cases, a static configuration of the I/O cores causes sub-optimal performance. To avoid these situations, we control the vhost threads from a user-space application called the IOManager which constantly monitors system performance, and automatically tunes the configuration accordingly. By creating (or destroying) vhost threads and assigning them to dedicated cores, we can control how many resources are dedicated to I/O processing. The IOManager also balances the assignment of virtual devices to vhost threads to ensure a balanced load across I/O cores. All of this balancing and configuration is done transparently to the user / VMs, ensuring maximum I/O performance even in the face of changing workloads.
To evaluate and showcase the performance on a real application, we use the Apache HTTP Server. We drive it using the ApacheBench which is distributed with Apache. It assesses the number of concurrent requests per second that the web server is capable of handling. We use 16 concurrent requests per VM for different file sizes, ranging from 64 bytes to 1 MB.
The figure above presents the baseline result alongside the best of various dedicated core configurations, depicted as “optimum”. Additionally, we present our automatic IOmanager (denoted by io-manager) which reallocates I/O cores based on the current state of the system.
We have shown how dynamically configured, dedicated I/O resources can improve performance for KVM. However, a change like this raises many questions, such as:
- Does this affect the security of the virtual machine’s I/O?
- Does this complicate the kernel code too much?
- Does the benefit outweigh the cost of implementing the change?
These are some of the questions we are faced with, and are trying to answer in this project with sKVM. We are investigating many different approaches to implementing a shared vhost thread, which touches on topics such as the kernel threading mechanism that should be used, and how a system administrator can control to which level sharing should occur in a user-friendly and an almost autonomic way.
Do you have an opinion?
We post results like these regularly. To stay up-to-date with our work follow us on Twitter.