Universal Unikernel - OSv - Mikelangelo - Horizon 2020 Project on Virtualization, Cloud Computing, and HPC

OSv is a new operating system designed specifically for running a single application in a VM. OSv is limited to running a single application (i.e., it is a unikernel) because the hypervisor already supports isolation between VMs, so an additional layer of isolation inside a VM is redundant and hurts performance. As a result, OSv does not support fork() (i.e., processes with separate address spaces) but does fully support multi-threaded applications and multi-core VMs.

While some of the other unikernels that appeared recently each focus on a specific programming language or application, the goal behind OSv’s design was to be able to run a wide range of unmodified (or only slightly modified) Linux executables. As a consequence many dozens of different applications and runtime environments can run today on OSv, including MIKELANGELO’s use cases, and much of the work in WP2 and WP4 revolved around improving OSv’s compatibility with additional Linux applications and runtime environments .

The MIKELANGELO cloud is based on 64-bit x86 VMs and the KVM hypervisor, so OSv needs to support those. But we wanted OSv to not be limited to that configuration, so today OSv supports most common cloud configurations: both x86 and ARM CPUs (both at 64-bit only) are supported, and so are most common hypervisors: KVM, Xen, VMware, VirtualBox. Support for HyperV is currently being added by an external contributor (who wants to run OSv in Microsoft’s Azure cloud).

In the rest of this chapter we will provide a high-level description of the architecture of OSv’s core - its kernel and its Linux compatibility. Two separate components that touch both OSv and the hypervisor - vRDMA and UNCLOT - will have separate chapters below. So will the MIKELANGELO’s framework for building and deploying OSv-based images - LEET.

The work plan for OSv for the last half year of the project (M30-M36) is mostly to continue to polish and debug the components we already have, focusing on correctly and efficiently supporting additional applications related to, or similar to, the MIKELANGELO use cases.

The Loader

Like all x86 operating systems, OSv’s bootstrap code starts with a real-mode boot-loader running on one CPU which loads the rest of the OSv kernel into memory (a compressed kernel is uncompressed prior to loading it). The loader then sets up all the CPUs (OSv fully supports multi-core VMs) and all the OS facilities, and ends by running the actual application, as determined by a “command line” stored in the disk image.

Virtual Hardware Drivers

General-purpose operating systems such as Linux need to support thousands of different hardware devices, and thus have millions of lines of driver code. But OSv only needs to implement drivers for the small number of (virtual) hardware presented by the sKVM hypervisor used in MIKELANGELO. This includes a minimal set of traditional PC hardware (PCI, IDE, APIC, serial port, keyboard, VGA, HPET), and paravirtual drivers: kvmclock (a paravirtual high-resolution clock much more efficient than HPET), virtio-net (for network) and virtio-blk (for disk).

Filesystem

OSv’s filesystem design is based on the traditional Unix “VFS” (virtual file system). VFS is an abstraction layer, first introduced by Sun Microsystems in 1985, on top of a more concrete file system. The purpose of VFS is to allow client applications to access different types of concrete file systems (e.g., ZFS and NFS) in a uniform way.

OSv currently has five concrete filesystem implementations: devfs (implements the “/dev” hierarchy for compatibility with Linux), procfs (similarly, for “/proc”), ramfs (a simple RAM disk), ZFS, and NFS.

ZFS is a sophisticated filesystem and volume manager implementation, originating in Solaris. We use it to implement a persistent filesystem on top of the block device or devices given to us (via virtio-blk) by the host.

We added NFS filesystem support (i.e., an NFS client), to allow applications to mount remote NFS shared storage, which is a common requirement for HPC applications.

The ELF Dynamic Linker

OSv executes unmodified Linux executables. Currently we only support relocatable dynamically-linked executables, so an executable for OSv must be compiled as a shared object (“.so”) or as a position-independent executable (PIE). Re-compiling an application as a shared-object or PIE is usually as straightforward as adding the appropriate compilation parameters (-fpic and -pic, or -fpie and -pie respectively) to the application’s Makefile. Existing shared libraries can be used without re-compilation or modification.

The dynamic linker maps the executable and its dependent shared libraries to memory (OSv has demand paging), and does the appropriate relocations and symbol resolutions necessary to make the code runnable - e.g., functions used by the executable but not defined there are resolved from OSv’s code, or from one of the other shared libraries loaded by the executable. ELF thread-local storage (gcc’s “__thread” or C++11’s thread_local) is also fully supported.

The ELF dynamic linker is what makes OSv into a library OS: There are no “system calls” or special overheads for system calls: When the application calls read(), the dynamic linker resolves this call to a call to the read() implementation inside the kernel, and it’s just a function call. The entire application runs in one address space, and in the kernel privilege level (ring 0).

OSv’s ELF dynamic linker also supports the concept of “ELF namespaces” - loading several different applications (or several instances of the same application) even though they may use the same symbol names. OSv ensures that when an application running in one ELF namespace resolves a dynamically-linked symbol, it is looked up in the same ELF namespace and not in that belonging to the second application. We added the ELF namespaces feature as a response to MIKELANGELO’s requirement of running Open MPI-based applications: Open MPI traditionally runs the same executable once on each core, each in a separate process. In OSv, we run those as threads (instead of processes), but we need to ensure that although each thread runs the same executable, they each get a separate copies of the global variables. The ELF namespace feature ensures that.

The ELF namespace feature also allowed us to give each of the Open MPI threads have their own separate environment variables. This works by ensuring that getenv() is resolved differently in each of those ELF namespaces.

C Library

To run Linux executables, we needed to implement in OSv all the traditional Linux system calls and glibc library calls, in a way that is 100% ABI-compatible (i.e., binary compatibility) with glibc. We implemented many of the C library functions ourselves, and imported some of the others - such as the math functions and stdio functions - from the musl-libc project - a BSD-licensed libc implementation. Strict binary compatibility with glibc for each of these functions is essential, because we want to run unmodified shared libraries and executables compiled for Linux.

The glibc ABI is very rich - it contains hundreds of different functions, with many different parameters and use cases for each of them. It is therefore not surprising that OSv’s reimplementation of those missed some, or implemented some functions imperfectly. As a result, much of the development effort that went in MIKELANGELO into OSv revolved around fixing functions which were either missing or incorrectly implemented, so that we could correctly run additional applications and use cases on OSv. More details about this work and what was improved every year is given in the WP4 series of deliverables.

Memory Management

OSv maintains a single address space for the kernel and all application threads. It supports both malloc() and mmap() memory allocations. For efficiency, malloc() allocations are always backed by huge pages (2 MB pages), while mmap() allocations are also backed by huge pages if large enough.

Disk-based mmap()supports demand paging as well as page eviction - these are assumed by most applications using mmap() for disk I/O. This use for mmap() is popular, for example, in Java applications due to Java’s heap limitations, and also due to mmap()’s performance superiority over techniques using explicit read()/write() system calls because of the fewer system calls and zero copy.

Despite their advantages, memory-mapped files are not the most efficient way to asynchronously access disk; In particular, a page-cache miss - needing to read a page from disk, or needing to write when memory is low - always blocks the running thread, so it requires multiple application threads to context-switch. When we introduce the Seastar library in the following section, we explain that for this reason Seastar applications use AIO (asynchronous I/O), not mmap().

OSv’s does not currently have full support for NUMA (non-uniform memory access). On NUMA (a.k.a. multi-socket) VMs, the VM’s memory and cores are divided into separate “NUMA nodes” - each NUMA node is a set of cores and part of the memory which is “closest” to these cores. A core may also access memory which does not belong to its NUMA node, but this access will be slower than access to memory inside the same node. Linux, which has full support for NUMA, provides APIs through which an application can ensure that a specific thread runs only on cores belonging to one NUMA node, and additionally only allocates memory from that node’s memory. High-performance applications, including the Open MPI HPC library, make use of these APIs to run faster on NUMA (multi-socket) VMs. OSv does not yet have full support for these APIs, so Open MPI performance on multi-socket VMs suffers as cores use memory which isn’t in their NUMA node. To avoid this issue, we recommend that very large, multi-socket VMs be split into separate single-socket (but multi-core) VMs, and shared memory be used to communicate between those VMs (this technique is explained in the UNCLOT section below).

Thread Scheduler

OSv does not support processes, but does have complete support for SMP (multi-core) VMs, and for threads, as almost all modern applications use them.

Our thread scheduler multiplexes N threads on top of M CPUs (N may be much higher than M), and guarantees fairness (competing threads get equal share of the CPU) and load balancing (threads are moved between cores to improve global fairness). Thread priorities, real-time threads, and other user-visible features of the Linux scheduler are also supported, but internally the implementation of the scheduler is very different from that of Linux. A longer description of the design and implementation of OSv’s scheduler can be found in our paper “OSv — Optimizing the Operating System for Virtual Machines”.

One of the consequences of our simpler and more efficient scheduler implementation is that in microbenchmarks, we measured OSv’s context switches to be 3 to 10 times faster than those on Linux. However, because good applications were typically written knowing that context switches are slow, and made an effort to reduce the number of context switches, the practical benefit of this speedup in most real-life applications is small.

Synchronization Mechanisms

OSv does not use spin-locks, which are a staple building block of other SMP operating systems. The is because spin-locks cause the so-called “lock-holder preemption” problem when used on virtual machines: If one virtual CPU is holding a spin-lock and then momentarily pauses (because of an exit to the host, or the host switching to run a different process), other virtual CPUs that need the same spin-lock will start spinning instead of doing useful work. The “lock-holder preemption” problem is especially problematic in clouds which over-commit CPUs (give a host’s guests more virtual CPUs than there are physical CPUs), but occurs even when there is no over-commitment, if exits to the host are common.

Instead of spin locks, OSv has a unique implementation of a lock-free mutex, as well as an extensive collection of lock-free data structures and an implementation of the RCU (“read-copy-update”) synchronization mechanism.

Network Stack

OSv has a full-featured TCP/IP network stack on top of the network driver like virtio-net which handles raw Ethernet packets.

The TCP/IP code in OSv was originally imported from FreeBSD, but has since undergone a major overhaul to use Van Jacobson’s “network channels” design which reduces the number of locks, lock operations and cache-line bounces on SMP VMs compared to Linux’s more traditional network stack design. These locks and cache-line bounces are very expensive (compared to ordinary computation) on modern SMP machines, so we expect (and indeed measured) significant improvements to network throughput thanks to the redesigned network stack.

We currently only implemented the netchannels idea for TCP, but similar techniques could also be used for UDP, if the need arises.

The basic idea behind netchannels is fairly simple to explain:

In a traditional network stack, we commonly have two CPUs involved in reading packets: We have one CPU running the interrupt or “soft interrupt” (a.k.a. “bottom half”) code, which receives raw Ethernet packets, processes them and copies the data into the socket’s data buffer. We then have a second CPU which runs the application’s read() on the socket, and now needs to copy that socket data buffer. The fact that two different CPUs need to read and write to the same buffer mean slow cache line bounces and locks, which are slow even if there is no contention (and very slow if there is).
In a netchannels stack (like OSv’s), the interrupt time processing does not access the full packet data. It only parses the header of the packet to determine which connection it belongs to, and then queues the incoming packets into a per-connection “network channel”, or queue of packets, without reading the packet data. Only when the application calls read() (or poll(), etc.) on the socket, the TCP processing is finally done on the packets queued on the network channel. When the read() thread does this, there are no cache bounces (the interrupt-handling CPU has not read the packet’s data!), and no need for locking. We still need some locks (e.g., to protect multiple concurrent read()s, which are allowed in the socket API), but fewer than in the traditional network stack design.

DHCP Client

OSv contains a built-in DHCP client, so it can find its IP address and hostname without being configured manually. For more extensive configuration, we also have cloud-init:

Cloud-init Client

OSv images can optionally include a “cloud-init” client, a common approach for contextualization of VM images, i.e., a technique by which a VM which is running identical code to other VMs can figure out what it is its intended role. Cloud-init allows the cloud management software to provide each VM with a unique configuration file in one of several ways (over the network, or as a second disk image), and specifies the format of this configuration file. The cloud management software can specify, for example, the VMs name, what NFS directory it should mount, what it should run, and so on.

REST API

OSv images can optionally include an “httpserver” module which allows remote monitoring of an OSv VM. “httpserver” is a small and simple HTTP server that runs in a thread, and implements a REST API, i.e., an HTTP-based API, for remote inquiries or control of the running guest. The reply of each of these HTTP requests is in the JSON format.

The complete REST API is described below, but two requests are particularly useful for monitoring a running guest:

“/os/threads” returns the list of threads on the system, and some information and statistics on each thread. This includes each thread’s numerical id and string name, the CPU number on which it last ran, the total amount of CPU time this thread has used, the number of context switches and preemptions it underwent, and the number of times it migrated between CPUs.

The OSv distribution includes a script, scripts/top.py, which uses this API to let a user get “top”-like output for a remote OSv guest: It makes a “/os/threads” request every few seconds, and subtracts the total amount of CPU time used by each thread in this and the previous iteration; The result is the percentage of CPU used by each thread, which we can now sort and show the top CPU-using threads (like in Linux’s “top”), and some statistics on each (e.g., similar subtraction and division can give us the number of context switches per second for each of those threads).

“/trace/count” enables counting of a specific tracepoint, or returns the counts of all enabled tracepoints.

OSv‘s tracepoints are a powerful debugging and statistics mechanism, inspired by a similar feature in Linux and Solaris: In many places in OSv’s source code, a “trace” call is embedded. For example, we have a “memory_malloc” trace in the beginning of the malloc() function, and a “sched_switch” trace when doing a context switch. Normally, this trace doesn’t do anything - it appears in the executable as a 5-byte “NOP” (do-nothing) instruction and has almost zero impact on the speed of the run. When we want to enable counting of a specific tracepoint, e.g., count the number of sched_switch events, we replace these NOPs by a jump to a small piece of code which increments a per-cpu counter. Because the counter is per-cpu, and has no atomic-operation overhead (and moreover, usually resides in the CPU’s cache), counting can be enabled even for extremely frequent tracepoints occurring millions of times each second (e.g., “memory_malloc”) - with a hardly noticeable performance degradation of the workload. Only when we actually query the counter, do we need to add these per-cpu values to get the total one.

The OSv distribution includes a script, scripts/freq.py, which uses this API to enable one or more counters, to retrieve their counts every few seconds, and display the frequency of the event (subtraction of count at two different times, divided by the time interval’s length). This script makes it very convenient to see, for example, the total number of context switches per second while the workload is running, and how it relates, for example, to the frequency of mutex_lock_wait, and so on.
The list of tracepoints supported by OSv at the time of this writing includes over 300 different tracepoints, and for brevity is omitted here - it was already shown in deliverable D2.16.

Beyond these two useful REST API requests, OSv supports many more requests, overviewed here. Note that this overview omits a lot of important information, such as the parameters that each request takes, or the type of its return value. For the full information, please refer to the modules/httpserver/api-doc/listings directory in OSv’s source distribution. OSv also optionally provides a “swagger” GUI to help a user determine exactly which REST API requests exist, and which parameters they take.

/api/batch: Perform batch API calls in a single command. Commands are performed sequentially and independently. Each command has its own response code.
/api/stop: Stopping the API server causing it to terminate. If the API server runs as the main application, it would cause the system to terminate.
/app: Run an application with its command line parameters.
/env: List environment variables, return the value of a specific environment variable, or modify or delete one - depending if the HTTP method used is GET, POST, or DELETE respectively.
/file: Return information about an existing file or directory, delete one, create one, rename one, or upload one.
/fs/df: Report filesystem usage of one mount point or all of them.
/hardware/processor/flags: List all present processor features.
/hardware/firmware/vendor
/hardware/hypervisor: Returns name of the hypervisor OSv is running on.
/hardware/processor/count
/network/ifconfig: Get a list of all the interfaces configuration and data.
/network/route
/os/name
/os/version
/os/vendor
/os/uptime: Returns the number of seconds since the system was booted.
/os/date: Returns the current date and time.
/os/memory/total: Returns total amount of memory usable by the system (in bytes).
/os/memory/free: Returns the amount of free memory in the system (in bytes).
/os/poweroff
/os/shutdown
/os/reboot
/os/dmesg: Returns the operating system boot log.
/os/hostname: Get or set the host’s name.
/os/cmdline: Get or set the image’s default command line.
/trace/status, /trace/event, /trace/count, /trace/sampler, /trace/buffers: Enable, disable and query tracepoints.

The full-stack MIKELANGELO Instrumentation and Monitoring system has been designed with a flexible plugin architecture. An OSv monitoring plugin was developed so that it can retrieve data from an OSv guest, or multiple OSv guests, using the OSv REST API just described. The available monitoring data, including thread and tracepoint data as well as hardware configuration, can be discovered at runtime and only the specific data of interest captured, processed and published to the monitoring back-end.