Kernel bypassacm reference format irene zhang

I’m Not Dead Yet! The Role of the

Operating System in a Kernel-Bypass Era

Microsoft Research Microsoft Research

miroberts@microsoft.com anirudh.badam@microsoft.com

ACM Reference Format:
Irene Zhang, Jing Liu, Amanda Austin, Michael Lowell Roberts, and Anirudh Badam. 2019. I’m Not Dead Yet! The Role of the Op-erating System in a Kernel-Bypass Era. In Workshop on Hot Topics in Operating Systems (HotOS ’19), May 13–15, 2019, Bertinoro, Italy. ACM, New York, NY, USA, 8 pages.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

In contrast to classic I/O accelerators, modern datacenter accelerators commonly offer kernel bypass along with their other features, as the kernel adds significant overhead to ev-ery I/O access [5, 31, 51]. These kernel-bypass accelerators implement the needed OS features – multiplexing, isolation, address translation – to let applications safely access I/O with-out going through the kernel. While today’s kernel-bypass accelerators have some limitations, eventually accelerators will eliminate the OS kernel from the fast I/O path, relegating the kernel to slow, control-path operations.

This datacenter trend raises an important question for oper-ating system researchers: What role does the operating system play in the upcoming kernel-bypass era? As shown in Fig-ure 1, kernel-bypass accelerators remove the OS kernel from the I/O data path but do not replace all of its functionality. Im-portantly, they lack high-level, device-agnostic abstractions offered by OS kernels, like files, sockets, and pipes.

Control & Data Path	Control Path Data Path		Table 1. Examples of kernel-bypass accelerators. We categorize
Control & Data Path	Control Path Data Path		acclerators based on their offered features. Some devices (left) only


	OS Kernel	Gap ?	RDMA provides a limited networking stack), and others (right) add
	OS Kernel	Gap ?
CPU	CPU

Figure 1. Comparison of traditional server architecture and kernel-bypass server architecture. Kernel-bypass accelerators let applica-tions safely access I/O devices but do not replace the bypassed OS functionality. Importantly, there is no longer a high-level, device-agnostic I/O abstraction.

This paper argues for an evolution of the datacenter operat-ing system to provide a high-level kernel-bypass I/O abstrac-tion. Just because kernel-bypass accelerators eliminate the OS kernel does not mean that application programmers must do without the benefits of an OS. We discuss how kernel-bypass devices have changed the datacenter and how operating sys-tems should change as well. We propose a new OS architec-ture, the Demikernel, and discuss design challenges for the Demikernel and future datacenter OSes.

suggests the need for a completely re-designed user-level OS

3.2

CPU

I/O Device

While user-level libraries that preserve the POSIX API (e.g., mTCP [25], F-stack[19]) are easy to use with existing applications, the legacy I/O abstraction imposes too much overhead. The POSIX abstraction requires applications copy from kernel buffers into application buffers. This copy is both inefficient (copying a 4k page takes 1µs on a 4Ghz CPU, adding 50% overhead to Redis), and unnecessary (since the data is already in the user-level address space). Second, UNIX pipes force applications to operate on streams of data; how-ever, applications like Redis operate on atomic units of data. Redis can only process a read operation after the entire request has arrived; by the time Redis has inspected a pipe and found that its read operation is incomplete, it could have processed a request that was ready. The incompatibility of the existing POSIX interface with high-performance I/O processing calls for a new abstraction.

3.3 Implement Differing OS Functionality

For performance, the Demikernel library OSes try to pre-serve the application data unit on the device if possible. Demikernel queues are not bound by hardware limitations (e.g., limited capacity queues, fixed packet sizes) and are uni-form between different devices. Since I/O devices commonly use hardware queues to interact, we found that the queue abstraction is general enough to apply to a wide range of I/O accelerators. The queue abstraction also lets applications express application-specific functions that can be offloaded to the I/O device through queue filter and map functions.

4.3 Demikernel System Call Interface

The merge system call (line 14, Figure 3) returns a new queue that merges two queues. A pop from either queue results in a pop from the merged queue and a push to the merged queue results in a push to both queues.

The filter system call (line 15, Figure 3) returns a new queue with only the filtered elements from the original queue. A pop from the original queue results in a pop from the new

/ /	c o n t r o l			path		network				1	/ /	c o n t r o l		path
int qd = socket(...)										2
int err = listen(int qd, ...);										3
										4
int qd = accept(int qd, ...)										5	int qd = sort(int qd1, bool (sort)(sgarray* &sga1, sgarray &sga1));
int err = connect(int qd, ...)										6
										7
/ /	c o n t r o l			path		f i l e
int qd = open(...); int qd = creat(...);
/ /		data	path		queue		c a l l s

/ /	i d e n t i c a l			to a push ,					f ol lo we d			on	the
ssize_t ret = blocking_push(int qd, sgarray &sga);
/ /	i d e n t i c a l			to a pop ,				f ol lo we d		by a wait		on	the	r e t u r n e d	qtoken

Zero-copy I/O requires applications to coordinate shared memory access with the I/O devices; that is, the application cannot write or free any memory currently being accessed by an I/O device. Similarly, when a device finishes processing an I/O request, it needs to notify the application that it can modify or free the buffer. Such coordination between device and application must often be done across threads or components, making it difficult for applications to accomplish on their own.

To minimize this coordination, the Demikernel interface provides free-protection for I/O memory buffers. Applica-tions can free buffers while they are in use by a device, but the libOS will not deallocate the buffer until the device com-pletes its I/O. The Demikernel interface does not offer write-protection for I/O buffers, which would be too expensive. Thus, applications must still wait until their I/O completes (i.e., push returns or wait on a qtoken completes) to modify buffers as they do for traditional zero-copy I/O.

A similar challenge for libraryOSes appears in efficient access to storage. If Demikernel library OSes used a custom disk layout for performance, any application would have to find a compatible libOS to read stored data. Existing disk layouts (e.g., ext4) may impose unnecessary overhead since each Demikernel libOS supports only a single application, which may not require an entire UNIX file system. Future work could include design of an accelerator-specific storage layout.

6 Related Work

throughput and low latency. In Proc. of OSDI, 2014.

[6] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In ACM SIGARCH Computer Architecture News, volume 28, pages 117–128.

C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, et al. P4: Program- [33] A. Kaufmann, T. Stamler, S. Peter, N. K. Sharma, T. Anderson, and

ming protocol-independent packet processors. Proc. of SIGCOMM, A. Krishnamurthy. TAS: TCP acceleration as a service. In Proc. of

.

[11] Y. Chen, Z. Wang, B. Zang, and

[13] G. Kroah-Hartman. Linux Device Drivers: Where the Kernel Meets the Hardware. " O’Reilly Media, Inc.", 2005. [14] A. Currid. Tcp offload to the rescue. ACM Queue, 2004.

[15] Data plane development kit. .

[20] B. Fitzpa with memcached. Linux Journal, 2004.

[21] H. Gilmore. The Cloud as a Tectonic Shift in IT: The Death of Operating Systems (as We Know Them). Cloud- bees, July 2012.

[25] . A. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park. mTCP: a highly scalable user-level tcp stack for multicore

[36] Y. Kwon, H. Fingler, T. Hunt, S. Peter, E. Witchel, and T. Anderson.

drivers: Achieved performance. Journal of Computer Science and

Technology, 20(5):654–664, 2005.

[39] libevent: an event notification library. .

[40] S. McCanne and V. Jacobson. The BSDarchitecture

_

[43] Mellanox. Innova Flex Smart NIC.

nasamy, and S. Shenker. Revisiting network support for rdma. In Proc.

of SIGCOMM, 2018.

[28] A. Kalia, M. Kaminsky, and D. Andersen. Datacenter RPCs can be general and fast. In nsdi, 2019.

[29] A. Kalia, M. Kaminsky, and D. G. Andersen. Using rdma efficiently for key-value services. ACM SIGCOMM Computer Communication Review, 44(4):295–306, 2015.

[54] L. Rizzo. Netmap: a novel framework for fast packet i/o. In Proc. of USENIX Security, pages 101–112, 2012.

[55] SolarFlare. y. [56] Storage per