Project Bardiche: Framed I/O

The series so far

If you’re getting started you’ll want to see the previous entries on Project Bardiche:


Framed I/O is a new abstraction that we’re currently experimenting with through Project Bardiche. We call this framed I/O, because the core concept is what we call a frame: a variable amount of discrete data that has a maximum size. In this article, we’ll call data that fits this framed. For example, Ethernet devices work exactly this way. They have a maximum size based on their MTU, but there may very well be less data available than the maximum. There are a few overarching goals that led us down this path:

The primary use case of framed I/O is for vnd and virtual machines. However, we believe that the properties here make it desirable to other portions of the stack which operate in terms of frames. To understand why we’re evaluating this abstraction, it’s worth talking about the other existing abstractions in the system.

read(2) and write(2)

Let’s start with the traditional and most familiar series of I/O interfaces: read(2), write(2), readv(2), and writev(2). These are the standard I/O system calls that most C programmers are familiar with. read(2) and write(2) originated in first edition UNIX. readv(2) and writev(2) supposedly came about during the development of 4.2 BSD. The read and write routines operate on streams of data. The callers and file descriptors have no inherent notion of data being framed and all framing has to be built into consumption of the data. For a lot of use cases, that is the correct abstraction.

The readv(2) and writev(2) interfaces allowed that stream to be vectored. It’s hard to say if these were vectored I/O abstraction in Operating Systems, but it certainly is one of the most popular ones from early systems that’s still around. Where read(2) and write(2) map a stream to a single buffer, these calls map a stream to a series of arbitrarily sized vectors. The act of vectorizing data is not uncommon and can be very useful. Generally, this is done when combining what may be multiple elements into one discrete stream for transferring. For example, if a program maintains one buffer for a protocol header and another buffer is used for the payload, then being able to specify a vector that includes both of these in one call can be quite useful.

When operating with framed data, these interfaces fall a bit short. The problem is that you’ve lost information that the system had regard the framing. It may be that the protocol itself includes the delineations, but there’s no guarantee that that data is correct. For example, if you had a buffer of size 1500, not only would something like read(2) only give you the total number of bytes returned, you wouldn’t be able to get the total number of frames. A return value of 1500 could be one large 1500 byte frame, it could be multiple 300 byte frames or anything in between.

getmsg(2) and putmsg(2)

The next set of APIs that are worth looking at are getmsg(2) and putmsg(2). These APIs are a bit different from the normal read(2) and write(2) APIs, they’re designed around framed messages. These routines use a struct strbuf which has the following members:

    int    maxlen;      /* maximum buffer length */
    int    len;         /* length of data */
    char   *buf;        /* ptr to buffer */

These interfaces allows for the consumer to properly express the maximum size of the frame that they expect and the amount of data that the given frame actually includes. This is very useful for framed data. Unfortunately, this API has some deficiencies. It doesn’t have the ability to break down the data into vectors nor do systems really have a means of working with multiple vectors at a time.

sendmsg(2) and recvmsg(2)

The next set of APIs that I explored and looked at focused on were the recvmsg(2) family, particularly the extensions that were introduced into the Linux kernel via sendmmsg(2) and [recvmmsg(2). The general design of the msghdr structure is good, though it understandably is designed around the socket interface. Unfortunately something like sendmsg(2) is not something that device drivers in most systems get, it currently only works for socket file systems, and a lot of things don't look like sockets. Things like ancillary data and the optional addresses are not as useful and don't have meaning for other styles of messages or if they do, they may not fit the abstraction that's been defined there.

Framed I/O

Based on our evaluations with the above APIs, a few of us chatted around Joyent's San Francisco office and tried to come up with something that might have the properties we felt made more sense for something like KVM networking. To help distinguish it from traditional Socket semantics or STREAMS semantics, we named it after the basic building block of the frame. The general structure is called a frameio_t which itself has a series of vector structures called a framevec_t. The structures roughly look like:

typedef struct framevec {
    void    *fv_buf;        /* Buffer with data */
    size_t  fv_buflen;      /* Size of the buffer */
    size_t  fv_actlen;      /* Amount of buffer consumed, ignore on error */
} framevec_t;

typedef struct frameio {
    uint_t  fio_version;    /* Should always be FRAMEIO_CURRENT_VERSION */
    uint_t  fio_nvpf;       /* How many vectors make up one frame */
    uint_t  fio_nvecs;      /* The total number of vectors */
    framevec_t fio_vecs[];  /* C99 VLA */
} frameio_t;

The idea here is that, much like a struct msgbuf, each vector component has a notion of what it's maximum size is and then the actual size of data consumed. These vectors can then be constructed into series of frames in multiple ways through the fio_nvpf and fio_nvecs members. The fio_nvecs field describes the total number of vectors and the fio_nvpf describes how many vectors are in a frame. You might think of the fio_nvpf member as basically describing how many iovec structures make up a single frame.

Consider that you have four vectors to play with, you might want to rig it up in one of several ways. You might want to map each message to a signle vector, meaning that you could read four messsages at once. You might want the opposite and map a single message to all four vectors. In that case you'd only ever read one message at a time, broken into four components. You could also break it down such that you always broke down a message into two vectors, that means that you'd be able to read two messages at a time. The following ASCII art might help.

1:1 Vector to Frame mapping

 +-------+  +-------+  +-------+  +-------+
 | msg 0 |  | msg 1 |  | msg 2 |  | msg 3 |
 +-------+  +-------+  +-------+  +-------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

4:1 Vector to Frame mapping

 |                  msg 0                 |
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

2:1 Vector to Frame Mapping

 +------------------+  +------------------+
 |       msg 0      |  |       msg 1      |
 +------------------+  +------------------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

Currently the maximum number of vectors allowed in a given call is limited to 32. As long as the total number evenly divides the number of frames per vector, than any value is alright.

By combining these two different directions, we believe that this'll be a useful abstraction and suitable for other parts of the system that operate on framed data, for example a USB stack. Another thing that this design lets us do is that by not constraining the content of the vectors, it would be possible to replicate something like the struct msghdr where the protocol header data was actually in the first vector.

Framed I/O now and in the Future

Today we've started plumbing this through in QEMU to account for its network device backend APIs that allow one to operate on traditional iovec. However, there's a lot more that can be done with this. For example, one of the things that is on our minds is writing a vhost-net style driver for illumos that can easily map data between the framed I/O representation and the virtio driver. With this, it'd even be possible to do something that's mostly zero-copy as well. Alternatively, we may also explore just redoing a lot of QEMU's internal networking paths to make it more friendly for sending and receiving multiple packets at once. That should certainly help with the overhead today of networking I/O in virtual machines.

We think that this might fit in other parts of the system as well, for example, it may make sense to be used as part of the illumos USB3 stack's design as the unit that we send data in. Whether it makes sense as anything more than just for the vnd device and this style of I/O time will tell.

Today vnd devices are exposed in libvnd through the vnd_frameio_read(3VND) and vnd_frameio_write(3VND) interfaces. So these can also be used for someone who's trying to develop their own services using vnd, for example, user land switches, firewalls, etc.

Next in the Bardiche Series

Next in the bardiche series, we'll be delving into some of the additional kernel subsystems and new DLPI abstractions that were created. Following those, we'll end with a recap entry on bardiche as a whole and what may come next.

Posted on March 25, 2014 at 8:57 am by rm · Permalink
In: Bardiche, Networking · Tagged with: ,