Introduction

Over the course of the MIKELANGELO project, different attempts have been made to improve the performance of applications running inside OSv operating system. In the context of this work, shorter latency and higher throughput are the main metrics of the network performance. MIKELANGELO project has already made significant improvements in supporting functional requirements of the workloads running in OSv. This has been demonstrated by the Cancellous bones and Aerodynamics use cases that have been successfully executed in OSv early in the project. On the other hand, OSv is still lacking performance improvements that would allow it to be used in the high performance computing domain. One of its most limiting factors is its inability to interpret and use NUMA (Non-uniform Memory Access) of modern processors. Because OSv equally treats memory from different NUMA domains, it is unable to optimise CPU and memory pinning the way general purpose operating systems do.

One way of addressing this problem is the adoption of Seastar framework that handles CPU and memory pinning internally and properly handles this limitation. However, this requires a complete rewrite of the application logic which is typically impractical or even impossible for large HPC simulation code bases. To this end, this section introduces a novel communication path for colocated virtual machines, i.e. VMs running on the same physical host. With the support of the hypervisor (KVM) the operating system (OSv) is able to choose the optimal communication path between a pair of VMs. The default mode of operation is to use the standard TCP/IP network stack. However, when VMs residing on the same host start communicating, standard TCP/IP stack is replaced with direct access to the shared memory pool initialised by the hypervisor. This will allow the cloud and HPC management layer to distribute VMs optimally between NUMA nodes. This section explains in detail the design and the findings based on the prototype implementation.

Another driver for investigating addition approach to optimising intra-host communication is the raise of so called serverless operations. Serverless architectures are becoming widely adopted by all major public cloud providers (Amazon, Google and Microsoft) in a form of so called cloud functions (Amazon Lambda, Google Cloud Functions and Azure Cloud Functions). This has been made possible with the tremendous technical advances in the container technology and container management layer planes. Cloud functions are stateless services communicating with each other and other application components over established channels using standard APIs and contracts. The very nature of these services allows for seamless optimisation of workloads, service placement management and reuse, as well as fine-grained control over the billable metrics in the infrastructure.

OSv-based unikernels are ideal for the implementation of such serverless architectures because they are extremely lightweight, boot extremely fast and support running most of the existing workloads, i.e. they mimic the advantages of containers. All the benefits of the hypervisor are furthermore preserved (for example, support for live migration, high security, etc.). In light of the aforementioned facts, optimising the networking stack for collocated unikernels will allow for an even better resource utilisation because the interrelated services will be able to bypass the complete network stack when sharing data between each other.

Background

The serverless operations described in the introduction are an extreme example of separating large monolithic applications into smaller services specialising for specific business logic. Even when not targeting serverless deployments, modern applications are typically split into multiple weakly coupled services, which communicate via IP network. Figure 1 presents typical communication paths used by applications running inside virtual machines. In this case, the server application, e.g., database, is running in VM-1, while client applications, e.g., data processing services, run in VM-2 and VM-3. Virtual machines VM-1 and VM-2 are collocated on the same, while VM-3 is placed on a different host.

Figure 1. Traditional communication paths used by applications running inside virtual machines

When client application (VM-3) runs on a different host than server application (VM-1) data traverses many layers. First, the data goes through the entire network stack of the guest operating system and sent to the software switch in the client virtualization host. From there, the packet is sent to the physical network. On the other side, the virtualization host of the server receives the packet from the physical network and forwards it to the software switch. The packet finally goes through Server’s network stack and is delivered to the server application.

When server and client machines are collocated on the same virtualization host (VM-2 and VM-1), the whole process is greatly simplified. In this case, the physical network in not used because the virtual switch in the host automatically detects collocated virtual machine and delivers the packet directly into the network stack of the target VM. This allows higher throughput at lower resource usage.

The work proposed in this section is expected to further improve the latter case, i.e. the intra-host TCP/IP-based communication by eliminating most of the IP processing. This is particularly important because the TCP protocol was designed on top of unreliable, lossy, packet-based network. As such it implements retransmission on errors, acknowledge and automatic window scaling in order to guarantee reliable and lossless data stream. However, it is safe to assume that communication between collocated VMs is highly reliable.

To eliminate most of the aforementioned processing overhead of the network stack we try to implement TCP/IP like communication based on shared memory. Instead of sending actual IP packets via software switch, the source VM puts data into memory shared between VMs on the same host. Target VM can then directly access the data. Multiple approaches to bypassing network stack have been analysed by Ren et. al.[i]. The approaches are classified according to the layer of the network stack at which the bypass is built into. In general, the higher in the application/network stack the bypass is implemented higher throughput and lower latency can be achieved because more layers of the networking stack are avoided. On the other hand, the higher the layer at which the bypass is implemented, the less transparency is preserved. For example, replacing all send() system calls with a simple memcpy() will result in the fast communication between two VMs. However, this performance benefit will result in significant additional constraints for the said application. It will require that every part of the application that needs to communicate with other VMs makes a change in the source code. Furthermore, it will prevent distributed deployment of such application because of memory-only communication. This thus requires additional development and debugging effort for each application which should benefit from modifications.

Design Considerations

Our main interest lies in massively parallel workloads on top of Open MPI and microservices. MPI-based workloads typically use shared memory interprocess communication between parallel workers running on the same host and Infiniband or similar interconnects when deployed in HPC environment. Both of these techniques reduce latency and improve the overall throughput of the communication. However, when workloads are deployed in virtual machines these mechanisms are no longer supported out of the box. While vRDMA brings Infiniband into virtual machines by providing paravirtual device drivers (Linux and OSv) inter-VM communication between virtual machines running on the same host has not been addressed in OSv. Standard TCP/IP networking is used when VMs from the same host need to share data. This adds a significant overhead due to the fact that communication must adhere the protocols that have been designed for unreliable media. When it comes to microservice and serverless architectures, this problem escalates even more because services that need to collaborate to fulfill the tasks typically reside in different container or virtual machine.

Both of the above types of workloads typically use TCP (Transmission Control Protocol) transmission protocol. This allows us to focus only on bypassing the TCP protocol using shared memory approach, while preserving UDP (User Datagram Protocol) or any other protocol intact.

For Open MPI, we intend to use bypassed TCP to overcome current OSv limitation regarding NUMA memory pinning support. When used on a server with multiple NUMA nodes (e.g. with multiple physical CPU sockets) Open MPI uses memory pinning to ensure that CPU cores use memory with lowest possible latency. MPI processes within one server communicate via shared memory based BTL (Byte Transport Layer), and MPI processes on different servers communicate via dedicated interconnect (infiniband) BTL or via TCP/IP BTL.

We have already mentioned in the introduction that OSv is currently not NUMA memory aware. OSv VM with multiple NUMA nodes correctly pins MPI threads to CPU cores, however the memory for a specific thread is allocated from the memory pool of the entire virtual machine. Memory of a thread pinned to one NUMA node can thus be allocated on a different (foreign) NUMA node resulting in deteriorated performance (according to our measurements, this is typically about 30%). To overcome this, we intend to exploit the network stack bypass to deploy one OSv VM per each NUMA node. The libvirt domain specification used to configure virtual machines will ensure that each VM will get memory from a single NUMA node. Just as before, MPI processes running in the same VM will communicate between each other via shared memory BTL[ii]. Processes running on different physical hosts will also use the same channel (either via dedicated interconnect or via TCP/IP). The main difference will be for the processes running in VMs on the same physical hosts. These will still use the TCP/IP BTL provided by Open MPI, however, OSv with the support of the hypervisor will ensure that the channel is properly bypassed exploiting the memory shared between VMs. The approach is expected to improve the performance of MPI computing in OSv by improving the communication performance itself.

As discussed in the introduction to this section, the higher in the communication stack the shared memory is introduced the higher the performance gain is expected. In case of Open MPI this means that optimal performance would be achieved by replacing communication channel either in the application (for example OpenFOAM) or by introducing an additional BTL component. However, this would result in loss of transparency and would only work for a single workload. While this would be sufficient for the use cases of the MIKELANGELO project, we decided to make the bypass one layer below, i.e. built directly into network stack of the OSv kernel. The existing networking API (POSIX socket API) is fully preserved offering existing applications ability to use the new communication channel when available and in transparent way.

The implementation of this bypass requires changes in the OSv implementation ensuring that standard functions like socket(), bind(), listen(), connect(), send(), recv() are reimplemented in order to support shared memory as a communication backend.

The following diagram shows a high-level design. Neither server nor client application needs to know about the change in the communication channel. The aforementioned API functions transparently choose either standard network stack or shared memory depending on the IP address of packet’s destination.

Figure 2. High-level architecture of UNCLOT.

The current IP bypass will be implemented for KVM virtualization only. The KVM/QEMU ivshmem[iii] will be used to established shared memory between VMs on the same host. Further to basic read/write access to shared memory, notification mechanism will have to be implemented. This will be used by the sending VMs to notify the receiving ones of the availability of new data. Different types of notification system are going to be examined (e.g. IPI interrupts, ivshmem interrupts or simple polling).

The detection of colocated VMs should be done automatically. Colocation detection assumes that the IP is used as unique identifier of a given virtual machine. A central registry of all VMs running on the same host and their IP addresses) has to be integrated allowing the host (or VMs on that host) to check whether the IP belongs to a VM on the same host. This registry needs to be updated by Libvirt start/stop hooks as well as some other changes in the VM networking. The mechanisms are thus built into the network stack of the underlying operating system which, with the support of the hypervisor, automatically identifies and configures networking channels between VMs depending on their location.

An important consequence of this is that the live migration of the all or just a subset of virtual machines running on a single node is not prevented by the introduction of these mechanisms. Similarly to vRDMA, certain prerequisites are required in order to ensure smooth migration:

  • Disable writing to the socket
  • Wait for old data in socket to get consumed, or, alternatively, copy data to normal TCP/IP socket buffer
  • Disable bypass for that socket and restart normal TCP/IP; reuse the same file descriptor
  • Continue with the live migration

After a successful migration, the following steps are required to resume with the proper functioning of the networking stack:

  • Identify sockets suitable for bypass
  • Temporarily stop writing to that socket
  • Wait for old data in socket to be consumed by the receiver, or, alternatively, copy data to the bypassed socket buffer
  • Enable bypass for the corresponding sockets

For IP bypassed sockets, the IOCTLs (Input/Output Control) for selecting blocking/non-blocking mode, enabling/disabling Nagle’s algorithm, etc., should also work. Some of the socket options specified by the POSIX API do not have much impact when used in shared memory. For example, the Nagle’s algorithm[iv] was developed in 1984 to improve performance by reducing the number of packets sent over the network – a feature that is not mandatory when communicating over shared memory. However, other options have to be implemented in a transparent way. This will allow existing applications to rely on the same contract offered by the API as they are using right now. For example, reading from a blocking or non-blocking socket behaves differently when there are no data to read, and existing applications were written with that in mind.

The initial testing and benchmarking will be done using the common synthetic tools like iperf[v], netperf[vi] and Apache benchmark[vii]. Additional synthetic tests will be implemented for functional testing, ensuring that specific mechanisms of the network bypass are properly supported and compliant with the POSIX socket API. This should cover basic socket API functions (socket, bind, listen, accept, connect, send, write, recv, read etc) and ioctl() behaviour. As many new applications use select/poll/epoll and event based processing, behaviour of those calls should be checked too. Final evaluation will be performed using real world workloads such as OpenFOAM.

[i] Yi Ren, Ling Liu, Qi Zhang, Qingbo Wu, Jianbo Guan, Jinzhu Kong, Huadong Dai, and Lisong Shao. 2016. Shared-memory optimizations for inter virtual machine communication. ACM Comput. Surv. 48, 4, Article 49 (February 2016), 42 pages. http://www.cc.gatech.edu/grads/q/qzhang90/docs/YiRen-CSUR.pdf

[ii] Open MPI Shared Memory BTL, https://www.open-mpi.org/faq/?category=sm.

[iii] https://github.com/qemu/qemu/blob/master/docs/specs/ivshmem-spec.txt

[iv] Nagle’s algorithm, https://en.wikipedia.org/wiki/Nagle%27s_algorithm

[v] Iperf tool, https://iperf.fr

[vi] Netperf tool, https://linux.die.net/man/1/netperf

[vii] Apache bench tool, http://httpd.apache.org/docs/current/programs/ab.html