Tales from a Core File – Page 2 – rm@blog ~ $ rm -rf / ; Robert Mustacchi's Musings on Technology

illumos Overlay Networks Development Preview 02

September 23, 2014

I’m happy to announce the second development preview of my network vitalization or if you like to use buzzwords, software defined networking, for illumos. Like the previous entry, the goal of this is to give folks something to play around with and get a sense of what this looks like for a user and administrator.

The dev-overlay branch of illumos-joyent has all the source code and has been merged up to illumos and illumos-joyent from September 22nd.

This is a development preview, so it’s using a debug build of illumos. This is not suitable for production use. There are bugs; expect panics.

How we got here

It’s worth taking a little bit of time to understand the class of problems that we’re trying to solve. At the core of this work is a desire to have multiple logical layer two networks be able to all use one physical, or underlay, network. This means, that you can run multiple virtual machines that each have their own independent set of VLANs and private address space, so both Alice and Bob can have their own VMs using the same private IP addresses, like 10.1.2.3/24 and be confident that they will not see each others traffic.

What’s in this Release

This release builds on from the last release which had simple point to point tunnels. This release adds support for the following:

snoop support for decoding VXLAN frames
Kernel overlay driver support for dynamic plugins
A new files backend for varpd that supports proxy arp, ndp, and dhcpv4

This release also has a similar set of known issues:

All overlay devices are temporary
Overlay device deletion still isn’t 100% there
Overlay devices only work in the global zone
It is missing manual pages

Dynamic Plugins

In the first release, overlay devices only supported the direct plugin which always sent all traffic to a single destination. While useful, it meant that a given tunnel was limited to being point to point. The notion of a dynamic plugin changes this entirely. In this world, traffic can be encapsulated and sent to different hosts based on its destination mac address. Instead of getting a single destination from userland at device creation, the kernel goes and asks userland to supply it with the destination on demand.

Allowing an answer to be supplied this way makes it much easier to write different ways of answering the question in userland. As individuals and organizations figure out their own strategy here, it makes it much easier to interface with existing centralized databases or extant distributed systems.

In addition, as part of writing a simple files backend, I wrote several routines that can be used to inject proxy ARP, proxy NDP, and proxy DHCPv4 requests. Having these primitives in the common library makes it much easier for different backends which don’t support multicast or broadcast traffic to have something to use.

The files plugin format

In the next section we’ll show an example of getting started and having three different VMs use the same file for understanding our virtual network’s layout. Here’s a copy of the file /var/tmp/hosts.json that I’ve been using:

# cat /var/tmp/hosts.json
{

        "de:ad:be:ef:00:00": {
                "arp": "10.55.55.2",
                "ip": "10.88.88.69",
                "ndp": "fe80::3",
                "port": 4789
        }, "de:ad:be:ef:00:01": {
                "arp": "10.55.55.3",
                "dhcp-proxy": "de:ad:be:ef:00:00",
                "ip": "10.88.88.70",
                "ndp": "fe80::4",
                "port": 4789
        }, "de:ad:be:ef:00:02": {
                "arp": "10.55.55.4",
                "ip": "10.88.88.71",
                "ndp": "fe80::5",
                "port": 4789
        }
}

In this JSON blob, the key is the MAC address of a VNIC. With each key, there must be a member entitled ip and port. These are used by the plugin to answer the question of, where should a packet with this mac address be sent? The ip member may be either an IPv4 or IPv6 address.

Machines send packets to a specific MAC address. They look up the mappings between a MAC address and an IP address using different mechanisms for IPv4 and IPv6. IPv4 uses ARP to get this information which devolves into using broadcast frames, while NDP is built into IPv6 and uses ICMPv6. Those messages are generally sent using specific multicast addresses. However, because this backend does not support broadcast or multicast traffic, we need to do something a little different.

When the kernel encounters a destination MAC address that it doesn’t recognize, it asks userland where it should send it. Userland in turn looks at the layer 2 header and determines what it should do. When it sees something that gives the sign that it might be an ARP or NDP packet, it pulls down a copy of the entire packet and if it confirms that it is in fact an ARP or NDP packet, it will generate a response on its own using information encoded in the JSON file above and that will be injected into the overlay device for delivery.

The system determines the mapping between an IPv4 address and its MAC address by supplying an IP address in the arp field. It determines the mapping between an IPv6 address and its mac address by using the ndp field.

Finally, to better explore this prototype, I implemented a DHCP proxy capability. While DHCP has a defined system of relaying, the relay expects to be able to receive layer 2 broadcast packets. Instead, if we see a UDP broadcast packet that’s doing a DHCP query, we rewrite the frame to send it explicitly to the destination MAC address listed in the dhcp-proxy member. In this case, if I run a DHCPv4 server on the host listed on the first entry, it will properly serve a DHCP address to the mac address that has the dhcp-proxy entry. However, an important thing to remember with this, is that even though DHCP was able to assign an address, one still needs to be able to perform ARP and therefore if the address doesn’t match the one in the files entry, it will not work. To be able to do that properly, you need to write a plugin that’s a bit more complicated than the files backend.

Getting Started

This development release of SmartOS comes in three flavors:

Once you boot this version of SmartOS, you should be good to go. As an example, I’ll show how I set up three individual hosts, which we’ll call zlan, zlan2, and zlan3. I put the JSON file shown above, as the file /var/tmp/hosts.json. On the host zlan I ran the following
commands:

# dladm create-overlay -v 23 -e vxlan -s files -p vxlan/listen_ip=10.88.88.69 -p files/config=/var/tmp/hosts.json overlay0
# dladm create-vnic -m de:ad:be:ef:00:00 -l overlay0 foo0
# ifconfig foo0 plumb up 10.55.55.2/24
# ifconfig foo0 inet6 plumb up
# ifconfig foo0 inet6 fe80::3

On the host zlan2 I ran the following:

# dladm create-overlay -v 23 -e vxlan -s files -p vxlan/listen_ip=10.88.88.70 -p files/config=/var/tmp/hosts.json overlay0
# dladm create-vnic -m de:ad:be:ef:00:01 -l overlay0 foo0
# ifconfig bar0 plumb up 10.55.55.3/24
# ifconfig bar0 inet6 plumb up
# ifconfig bar0 inet6 fe80::4

And finally on the host zlan3, I ran the following:

# dladm create-overlay -v 23 -e vxlan -s files -p vxlan/listen_ip=10.88.88.71 -p files/config=/var/tmp/hosts.json overlay0
# dladm create-vnic -m de:ad:be:ef:00:02 -l overlay0 baz0
# ifconfig baz0 plumb up 10.55.55.4/24
# ifconfig baz0 inet6 plumb up
# ifconfig baz0 inet6 fe80::5

With all that done, all three hosts could ping and access network services on each other.

Concluding

The dynamic plugins allow us to start building and experimenting with something a bit more and interesting than the point to point tunnel. From here, there isn’t as much core functionality that’s necessary to add, but there’s a lot more stability and improvements throughout the stack. In addition, from here, I’ll be experimenting with some more distributed systems to make the next dynamic plugin, much more dynamic.

If you have any feedback, suggestions, or anything else, please let me know. You can find me on IRC (rmustacc in #smartos and #illumos on irc.freenode.net) or on the smartos-discuss mailing list. If you’d like to work on support for other encapsulation methods such as NVGRE or want to see how implementing a dynamic mapping service might be, reach out to me.

illumos Overlay Networks Development Preview 01

July 25, 2014

At Joyent I’ve been spending my time designing and building support for network virtualization in the form of protocols like VXLAN. I’ve gotten far enough along that I’m happy to announce the first SmartOS developmental preview of this work. The goal of this is to just give something for folks to play around with and start getting a sense of what this looks like. If you have any feedback, please send it my way!

All the development of this is going on in its own branch of illumos-joyent: dev-overlay. You can see all of the developments, including a README that gives a bit of an introduction and background, on that branch here.

The development preview below is a debug build of illumos. This is not suitable for production use. There are bugs. Expect panics.

What’s in this release

This release adds the foundation for overlay devices and their management in user land. With this you can create and list point-to-point VXLAN tunnels and create vnics on top of them. This is all done through dladm. This release also includes the preliminary version of the varpd daemon which manages user land lookups and will be used for custom lookup mechanisms in the future.

However, there are known things that don’t work:

All overlay devices are temporary — not persisted with dlmgmtd
Overlay device deletion isn’t properly wired up with varpd
Overlay devices only work in the global zone

Getting Started

This development release comes in the standard SmartOS flavors:

Once you boot this version of the platform, you’ll find that most things look the same. You’ll find a new service has been created and should be online — varpd. You can verify this with the svcs command. Next, I’ll walk through an example of starting everything up, creating an overlay device, and a VNIC on top of that.

[root@00-0c-29-ca-c7-23 ~]# svcs varpd
STATE          STIME    FMRI
online         21:43:00 svc:/network/varpd:default
[root@00-0c-29-ca-c7-23 ~]# dladm create-overlay -e vxlan -s direct \
    -p vxlan/listen_ip=10.88.88.69 -p direct/dest_ip=10.88.88.70 \
    -p direct/dest_port=4789 -v 23 demo0
[root@00-0c-29-ca-c7-23 ~]# dladm show-overlay
LINK         PROPERTY            PERM REQ VALUE       DEFAULT     POSSIBLE
demo0        mtu                 rw   -   0           --          --
demo0        vnetid              rw   -   23          --          --
demo0        encap               r-   -   vxlan       --          vxlan
demo0        varpd/id            r-   -   1           --          --
demo0        vxlan/listen_ip     rw   y   10.88.88.69 --          --
demo0        vxlan/listen_port   rw   y   4789        4789        1-65535
demo0        direct/dest_ip      rw   y   10.88.88.70 --          --
demo0        direct/dest_port    rw   y   4789        --          1-65535
[root@00-0c-29-ca-c7-23 ~]# dladm create-vnic -l demo0 foo0
[root@00-0c-29-ca-c7-23 ~]# ifconfig foo0 plumb up 10.55.55.2/24

Let’s take this apart. The first thing that we did is create an overlay device. The -e vxlan option tells us that we should use vxlan for encapsulation. Currently only VXLAN is supported. The -s direct specifies that we should use the direct or point-to-point module for determining where packets flow. In other words, there’s only a single destination.

Following this we set three required properties. The vxlan/listen_ip which tells us what IP addresses to listen on. The direct/dest_ip which tells us which IP to send the results to, and finally, direct/dest_port which says what port to use. We didn’t end up setting the property vxlan/listen_port because VXLAN specifies a default port which is 4789.

Finally, we specified a virtual network id with -v, in this case 23. And then we ended it all with a name.

After that, it became visible in the dladm show-overlay which displayed everything that we wanted. You’ll want to take similar steps on another machine, just make sure to swap the IP addresses around.

Concluding

This is just the tip of the iceberg here. There’s going to be a lot more functionality and a lot more improvements down the road. I’ll be doing additional development previews along the way.

DLPI and the IP Fastpath

April 3, 2014

The series so far

If you’re getting started you’ll want to see the previous entries on Project Bardiche:

The illumos Networking Stack

This blog post is going to dive into more detail about what the ‘fastpath’ is in illumos for networking, what it means, and a bit more about how it works. We’ll also go through and cover a bit more information about some of the additions we made as part of this project. Before we go too much further, let’s take another look at the picture of the networking stack from the entry on architecture of vnd:

             +---------+----------+----------+
             | libdlpi |  libvnd  | libsocket|
             +---------+----------+----------+
             |         ·          ·    VFS   |
             |   VFS   ·    VFS   +----------+
             |         ·          |  sockfs  |
             +---------+----------+----------+
             |         |    VND   |    IP    |
             |         +----------+----------+
             |            DLD/DLS            |
             +-------------------------------+
             |              MAC              |
             +-------------------------------+
             |             GLDv3             |
             +-------------------------------+

If you don’t remember what some of these components are, you might want to refresh your memory with the vnd architecture entry. Importantly, almost everything is layered on top of the DLD and DLS modules.

The illumos networking stack comes from a long lineage of technical work done at Sun Microsystems. Initially the networking stack was implemented using STREAMs. STREAMs is a message passing interface where message blocks (mblk_t) are sent around from one module to the next. For example, there are modules for things like arp, tcp/ip, udp, etc. These are chained together and can be seen in mdb using the ::stream dcmd. Here’s an example for my zone development zone:

> ::walk dld_str_cache | ::print dld_str_t ds_rq | ::q2stream | ::stream

+-----------------------+-----------------------+
| 0xffffff0251050690    | 0xffffff0251050598    |
| udp                   | udp                   |
|                       |                       |
| cnt = 0t0             | cnt = 0t0             |
| flg = 0x20204022      | flg = 0x20244032      |
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xffffff02510523f8    | 0xffffff0251052300    | if: net0
| ip                    | ip                    |
|                       |                       |
| cnt = 0t0             | cnt = 0t0             |
| flg = 0x00004022      | flg = 0x00004032      |
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xffffff0250eda158    | 0xffffff0250eda060    |
| vnic                  | vnic                  |
|                       |                       |
| cnt = 0t0             | cnt = 0t0             |
| flg = 0x00244062      | flg = 0x00204032      |
+-----------------------+-----------------------+
...

If I sent a udp packet, it would first be processed by the udp streams module, then the ip streams module, and finally make its way to the DLD/DLS layer which is represented by the vnic entry here. The means of this communication was part of the DLPI. DLPI itself defines several different kinds of messages and responses which can be found in the illumos source code here. The general specification is available here, though there’s a lot more to it than is worth reading. In illumos, it’s been distilled down into libdlpi.

Recall from the vnd architecture entry that the way devices and drivers communicate with a datalink is by initially using STREAMS modules and by opening a device in /dev/net/. Each data link in the system is represented by a dls_link_t. When you open a device in /dev/net, you get a dld_str_t which is an instance of a STREAMS device.

The DLPI allows consumers to bind to what they call a SAP or a service attachment point. What this means depends on the kind of data link. In the case of Ethernet, this refers to the ethertype. In other words, a given dld_str_t can be bound to something like IP, ARP, LLDP, etc. If this were something other than Ethernet, then that name space would be different.

For a given data link, only one dld_str_t can be actively bound to a given SAP (ethertype) at a time. An active bind refers to something that is actively consuming and sending data. For example, when you create an IP interface using ifconfig or ipadm, that does an active bind. Another example of an active bind is a daemon used for LLDP. There are also passive binds, like in the case of something trying to capture packets like snoop or tcpdump. That allows something to capture the data without worrying about blocking someone from using that attachment point.

Speeding things up

While the fundamentals of the DLPI are sound, the implementation in STREAMS, particularly for sending data left something to be desired. It greatly complicated the locking and it was hard to get it to perform in the way that was needed for saturating 10 GbE networks with TCP traffic. For all the details on what happened here and a good background, I’ll refer you to Sunay Tripathi’s Blog, where he covers a lot of what changed in Solaris 10 to fix this.

There are two parts to what folks generally end up calling the ‘IP fastpath’. One part of which we leverage for vnd and the other part which is still firmly used by IP. We’ll touch on the first part of this which eliminates the use of sending STREAMS messages. Instead it uses direct callbacks. Today this happens by negotiating with DLPI messages that discover capabilities of devices and then enabling them. The code for the vnd driver does this, as well as the ip driver. Specifically, you first send down a DL_CAPABILITY_REQ message. The response contains a list of capabilities that exist.

If the capability, DL_CAPAB_DLD is returned, then you can enable direct function calls to the DLD and DLS layer. The returned values give you a function pointer, which you can then use to do several things, and ultimately use to request to enable DLD_CAPAB_DIRECT. When you make a call to enable, you specify a function pointer for DLD to call directly when a packet is received. It then will give you a series of functions to use for things like checking flow control and transmitting a packet. These functions allow the system to bypass the issues with STREAMS and directly transmit along packets.

The second part of the ‘IP fastpath’ is something that primarily the IP module uses. In the IP module there is a notion of a neighbor cache entry or nce. This nce describes how to reach another host. When that host is found, the nce asks the lower layers of the stack to generate a layer two header that’s appropriate for this traffic. In the case where you have an Ethernet device, this means that it generates the MAC header including the source and destination mac addresses, ethertype, and vlan tags if there should be one. The IP stack then uses this pre-generated header each time rather than trying to create a new one from scratch for every packet. In addition, the IP module is subscribed to change events that get generated when something like a mac address changes, so that it can regenerate these headers when the administrator makes a change to the system.

New Additions

Finally, it’s worth taking a little bit of time to talk about the new DLPI additions that we added with project bardiche. We needed to solve two problems. Specifically:

We needed a way for a consumer to try and claim exclusive access to a data link, not just a single ethertype
We needed a way to tell the system when using promiscuous mode not to loop back the packets we sent

To solve the first case, we added a new request called a DL_EXCLUSIVE_REQ. This adds a new mode for the bind state of the dld_str_t. In addition to being active or passive, it can now be exclusive. This can only be requested if no one is actively using the device. If someone is, for example, an IP interface has already been created, then the DL_EXCLUSIVE_REQ will fail. The opposite is true as well, if someone is using the dld_str_t in exclusive mode, then the request to bind to the IP ethertype will also fail. This exclusive request lasts until the consumer closes the dld_str_t.

When a vnd device is created, it makes an explicit request for exclusive access to the device, because it needs to send and receive on all of the different ethertypes. If an IP interface is already active, it doesn’t make sense for a vnd device to be created there. Once the vnd device is destroyed, then anything can use the data link.

Solving our second problem was actually quite simple. The core logic to not loop back packets that were transmitted was already present in the MAC layer. To do that, we created a new promiscuous option that could be specified in the DLPI DL_PROMISCON_REQ called DL_PROMISC_RX_ONLY. Enabling this would pass along the flag MAC_PROMISC_FLAGS_NO_TX_LOOP down to the mac layer which actually does the heavy lifting of duplicating the necessary amount of packets.

Conclusion

This gives a rather rough introduction to the fastpath in the illumos networking stack. The devil, as always, is in the details.

In the next entries, we’ll go over the other new extensions that were added as part of this work: the sdev plugin interface and generalized serialization queues. Finally, we’ll finish with a recap and go over what’s next.

Project Bardiche: Framed I/O

March 25, 2014

The series so far

If you’re getting started you’ll want to see the previous entries on Project Bardiche:

Background

Framed I/O is a new abstraction that we’re currently experimenting with through Project Bardiche. We call this framed I/O, because the core concept is what we call a frame: a variable amount of discrete data that has a maximum size. In this article, we’ll call data that fits this framed. For example, Ethernet devices work exactly this way. They have a maximum size based on their MTU, but there may very well be less data available than the maximum. There are a few overarching goals that led us down this path:

The system should be able to transmit and receive several frames at once
The system should be able to break down a frame into a series of vectors
The system should be able to communicate back the actual size of the frame and vector

The primary use case of framed I/O is for vnd and virtual machines. However, we believe that the properties here make it desirable to other portions of the stack which operate in terms of frames. To understand why we’re evaluating this abstraction, it’s worth talking about the other existing abstractions in the system.

read(2) and write(2)

Let’s start with the traditional and most familiar series of I/O interfaces: read(2), write(2), readv(2), and writev(2). These are the standard I/O system calls that most C programmers are familiar with. read(2) and write(2) originated in first edition UNIX. readv(2) and writev(2) supposedly came about during the development of 4.2 BSD. The read and write routines operate on streams of data. The callers and file descriptors have no inherent notion of data being framed and all framing has to be built into consumption of the data. For a lot of use cases, that is the correct abstraction.

The readv(2) and writev(2) interfaces allowed that stream to be vectored. It’s hard to say if these were vectored I/O abstraction in Operating Systems, but it certainly is one of the most popular ones from early systems that’s still around. Where read(2) and write(2) map a stream to a single buffer, these calls map a stream to a series of arbitrarily sized vectors. The act of vectorizing data is not uncommon and can be very useful. Generally, this is done when combining what may be multiple elements into one discrete stream for transferring. For example, if a program maintains one buffer for a protocol header and another buffer is used for the payload, then being able to specify a vector that includes both of these in one call can be quite useful.

When operating with framed data, these interfaces fall a bit short. The problem is that you’ve lost information that the system had regard the framing. It may be that the protocol itself includes the delineations, but there’s no guarantee that that data is correct. For example, if you had a buffer of size 1500, not only would something like read(2) only give you the total number of bytes returned, you wouldn’t be able to get the total number of frames. A return value of 1500 could be one large 1500 byte frame, it could be multiple 300 byte frames or anything in between.

getmsg(2) and putmsg(2)

The next set of APIs that are worth looking at are getmsg(2) and putmsg(2). These APIs are a bit different from the normal read(2) and write(2) APIs, they’re designed around framed messages. These routines use a struct strbuf which has the following members:

    int    maxlen;      /* maximum buffer length */
    int    len;         /* length of data */
    char   *buf;        /* ptr to buffer */

These interfaces allows for the consumer to properly express the maximum size of the frame that they expect and the amount of data that the given frame actually includes. This is very useful for framed data. Unfortunately, this API has some deficiencies. It doesn’t have the ability to break down the data into vectors nor do systems really have a means of working with multiple vectors at a time.

sendmsg(2) and recvmsg(2)

The next set of APIs that I explored and looked at focused on were the recvmsg(2) family, particularly the extensions that were introduced into the Linux kernel via sendmmsg(2) and [recvmmsg(2). The general design of the msghdr structure is good, though it understandably is designed around the socket interface. Unfortunately something like sendmsg(2) is not something that device drivers in most systems get, it currently only works for socket file systems, and a lot of things don’t look like sockets. Things like ancillary data and the optional addresses are not as useful and don’t have meaning for other styles of messages or if they do, they may not fit the abstraction that’s been defined there.

Framed I/O

Based on our evaluations with the above APIs, a few of us chatted around Joyent’s San Francisco office and tried to come up with something that might have the properties we felt made more sense for something like KVM networking. To help distinguish it from traditional Socket semantics or STREAMS semantics, we named it after the basic building block of the frame. The general structure is called a frameio_t which itself has a series of vector structures called a framevec_t. The structures roughly look like:

typedef struct framevec {
    void    *fv_buf;        /* Buffer with data */
    size_t  fv_buflen;      /* Size of the buffer */
    size_t  fv_actlen;      /* Amount of buffer consumed, ignore on error */
} framevec_t;

typedef struct frameio {
    uint_t  fio_version;    /* Should always be FRAMEIO_CURRENT_VERSION */
    uint_t  fio_nvpf;       /* How many vectors make up one frame */
    uint_t  fio_nvecs;      /* The total number of vectors */
    framevec_t fio_vecs[];  /* C99 VLA */
} frameio_t;

The idea here is that, much like a struct msgbuf, each vector component has a notion of what it’s maximum size is and then the actual size of data consumed. These vectors can then be constructed into series of frames in multiple ways through the fio_nvpf and fio_nvecs members. The fio_nvecs field describes the total number of vectors and the fio_nvpf describes how many vectors are in a frame. You might think of the fio_nvpf member as basically describing how many iovec structures make up a single frame.

Consider that you have four vectors to play with, you might want to rig it up in one of several ways. You might want to map each message to a signle vector, meaning that you could read four messsages at once. You might want the opposite and map a single message to all four vectors. In that case you’d only ever read one message at a time, broken into four components. You could also break it down such that you always broke down a message into two vectors, that means that you’d be able to read two messages at a time. The following ASCII art might help.

1:1 Vector to Frame mapping

 +-------+  +-------+  +-------+  +-------+
 | msg 0 |  | msg 1 |  | msg 2 |  | msg 3 |
 +-------+  +-------+  +-------+  +-------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+


4:1 Vector to Frame mapping

 +----------------------------------------+
 |                  msg 0                 |
 +----------------------------------------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

2:1 Vector to Frame Mapping

 +------------------+  +------------------+
 |       msg 0      |  |       msg 1      |
 +------------------+  +------------------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

Currently the maximum number of vectors allowed in a given call is limited to 32. As long as the total number evenly divides the number of frames per vector, than any value is alright.

By combining these two different directions, we believe that this’ll be a useful abstraction and suitable for other parts of the system that operate on framed data, for example a USB stack. Another thing that this design lets us do is that by not constraining the content of the vectors, it would be possible to replicate something like the struct msghdr where the protocol header data was actually in the first vector.

Framed I/O now and in the Future

Today we’ve started plumbing this through in QEMU to account for its network device backend APIs that allow one to operate on traditional iovec. However, there’s a lot more that can be done with this. For example, one of the things that is on our minds is writing a vhost-net style driver for illumos that can easily map data between the framed I/O representation and the virtio driver. With this, it’d even be possible to do something that’s mostly zero-copy as well. Alternatively, we may also explore just redoing a lot of QEMU’s internal networking paths to make it more friendly for sending and receiving multiple packets at once. That should certainly help with the overhead today of networking I/O in virtual machines.

We think that this might fit in other parts of the system as well, for example, it may make sense to be used as part of the illumos USB3 stack’s design as the unit that we send data in. Whether it makes sense as anything more than just for the vnd device and this style of I/O time will tell.

Today vnd devices are exposed in libvnd through the vnd_frameio_read(3VND) and vnd_frameio_write(3VND) interfaces. So these can also be used for someone who’s trying to develop their own services using vnd, for example, user land switches, firewalls, etc.

Next in the Bardiche Series

Next in the bardiche series, we’ll be delving into some of the additional kernel subsystems and new DLPI abstractions that were created. Following those, we’ll end with a recap entry on bardiche as a whole and what may come next.

Project Bardiche: vnd Architecture

March 24, 2014

My previous entry introduced Project Bardiche, a project which revamps how we do networking for KVM guests. This entry focuses on the design and architecture of the vnd driver and how it fits into the broader networking stack.

The illumos networking stack

The illumos networking stack is broken into several discrete pieces which is summarized in the following diagram:

             +---------+----------+----------+
             | libdlpi |  libvnd  | libsocket|
             +---------+----------+----------+
             |         ·          ·    VFS   |
             |   VFS   ·    VFS   +----------+
             |         ·          |  sockfs  |
             +---------+----------+----------+
             |         |    VND   |    IP    |
             |         +----------+----------+
             |            DLD/DLS            |
             +-------------------------------+
             |              MAC              |
             +-------------------------------+
             |             GLDv3             |
             +-------------------------------+

At the top of the diagram are the common interfaces to user land. The first and most familiar is libsocket. That contains all of the common interfaces that C programmers are used to seeing: connect(3SOCKET), accept(3SOCKET), etc. On the other hand, there’s libdlpi, which provides an alternate means for interacting with networking devices. It is often used by software like DHCP servers and for LLDP. libvnd is new and a part of project bardiche. For now, we’ll stick to describing the other two paths.

Next, operations transit through the virtual file system (VFS) to reach their next destination. For most folks, that’ll be the illumos file system sockfs. The sockfs file system provides a means of translating between the gory details of TCP, UDP, IP, and friends, and the actual sockets that they rely on. The next step for such sockets is what folks traditionally think of as the TCP/IP stack. This encompasses everything related to the actual protocol processing. For example, connecting a socket and going through the TCP dance of the SYN and SYN/ACK is all handled by the logic in the TCP/IP stack. In illumos, both TCP and IP are implemented in the same kernel module called IP.

The next layer is comprised of two kernel modules which work together called DLD and DLS. DLD is the data-link driver and DLS is the data-link services module. The two modules work together. Every data link in illumos, whether it’s a virtual nic or physical nic, is modeled as a dld device. When you open something like /dev/net/igb0, that’s an instance of a DLD device. These devices provide an implementation of all of the DLPI (Data-link Provider Interface) STREAMS messages and are used to negotiate the fast path. We’ll go into more detail about that in a future entry.

Everything transits out of DLD and DLS and enters the MAC layer. The MAC layer handles taking care of interfacing with the actual device drivers, programming unicast and multicast addresses into devices, controlling whether or not the devices are in promiscuous mode, etc. The final layer is the Generic Lan Device version three (GLDv3) Device Driver. GLDv3 is a standard interface for networking device drivers and represents a set of entry points that the Operating System expects to use with them.

vnd devices

A vnd device is created on top of a data link similar to how an IP interface is created on top of a data link. Once a vnd device has been created, it can be used to read and write layer two frames. In addition, a vnd device can optionally be linked into the file system name space allowing others to open the device.

Similar to /dev/net, vnd devices show up under /dev/vnd. A control node is always created at /dev/vnd/ctl. This control node is referred to as a self-cloning device. That means that any time the device is opened, a new instance of the device is created. Once the control node has been opened, it is associated with a data link and then it is bound into the file system name space with some name that usually is identical to the name of the data link. After the device has been bound, it then shows up in /dev/vnd. If a vnd device was named net0 then it would show up as /dev/vnd/net0. Just as /dev/net displays all of the data links in th various zones under /dev/net/zone, the same is true for vnd. The vnd devices in any zone are all located under /dev/vnd/zone and follow the pattern /dev/vnd/zone/%zonename/%vnddevice. These devices are never directly manipulated. Instead, they are used by libvnd and vndadm.

Once a vnd device has been created and bound into the name space, it will persist until it is removed with either vndadm or libvnd or the zone it is present in is halted. The removal of vnd devices from the name space is similar to calling unlink(2) on a file. If any process has the vnd device open after it is has been removed from the name space, it will persist until all open handles have been closed.

If a data link already has an IP interface or is being actively used for any other purpose, a vnd device cannot be created on top of it, and vice versa. Because vnd devices operate at a layer two, if various folks are already consuming layer three, it doesn’t make sense to create a vnd device on top of it. The opposite also holds.

The command vndadm was written to manipulate vnd devices. It’s worth stepping through some basic examples of using the command. Even more examples can be found in its manual page. With that, let’s get started and create a vnic and then a device. Substitute your physical link for anything you prefer.

# dladm create-vnic -l e1000g0 vnd0
# vndadm create vnd0
# ls /dev/vnd
ctl   vnd0  zone

With that, we’ve created a device. Next, we can use vndadm to list devices as well as get and set properties.

# vndadm list
NAME             DATALINK         ZONENAME
vnd0             vnd0             global
# vndadm get vnd0
LINK          PROPERTY         PERM  VALUE
vnd0          rxbuf            rw    65536
vnd0          txbuf            rw    65536
vnd0          maxsize          r-    4194304
vnd0          mintu            r-    0
vnd0          maxtu            r-    1518
# vndadm set txbuf=2M
# vndadm get vnd0 txbuf
LINK          PROPERTY         PERM  VALUE
vnd0          txbuf            rw    2097152

You’ll note that there are two properties that we can set, rxbuf and txbuf. These are the sizes of buffers that an instance of a vnd device maintains. As frames come in, they are put into the receive buffer where they sit until they are read by someone, usually a KVM guest. If a frame would come in that would exceed the size of that buffer, then it will be dropped instead. The transmit buffer controls the total amount of outstanding data that can exist at any given time in the vnd subsystem. The vnd device has to keep track of this to deal with cases like flow control.

Finally, we can go ahead and remove the device via:

# vndadm destroy vnd0

While not shown here, all of these commands can operate on device that are in another zone, if the user is in the global zone. To get statistics about device throughput and packet drops, you can use the command vndstat. Here’s a brief example:

$ vndstat 1 5
 name |   rx B/s |   tx B/s | drops txfc | zone
 net0 | 1.45MB/s | 14.1KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 3.50MB/s | 19.5KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 2.83MB/s | 30.8KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 3.08MB/s | 30.6KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 3.21MB/s | 30.6KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525

The drops column sums up the total number of drops while the txfc column shows the number of times that the device has been flow controlled during that period.

Programmatic Use

So far, I’ve demonstrated the use of the user commands. For most applications, you’ll want to use the fully featured C library libvnd. The introductory manual is where you’ll want to get started for all information in using it. It will point you out to all of the rest of the functions which can all be found in the manual section 3VND. Please keep in mind, that until the library makes its way up into illumos, portions of the API may still end up changing and isn’t considered stable yet.

Peeking under the hood

So far we’ve talked about how you can use these devices, now let’s go under the hood and talk about how this is constructed. For all of the full gory details, you should turn to the vnd big theory statement. There are multiple components that make up the general architecture of the vnd sub-system, though only character devices are shown. The following bit of ascii art from the big theory statement describes the general architecture:

+----------------+     +-----------------+
| global         |     | global          |
| device list    |     | netstack list   |
| vnd_dev_list   |     | vnd_nsd_list    |
+----------------+     +-----------------+
    |                    |
    |                    v
    |    +-------------------+      +-------------------+
    |    | per-netstack data | ---> | per-netstack data | --> ...
    |    | vnd_pnsd_t        |      | vnd_pnsd_t        |
    |    |                   |      +-------------------+
    |    |                   |
    |    | nestackid_t    ---+----> Netstack ID
    |    | vnd_pnsd_flags_t -+----> Status flags
    |    | zoneid_t       ---+----> Zone ID for this netstack
    |    | hook_family_t  ---+----> VND IPv4 Hooks
    |    | hook_family_t  ---+----> VND IPv6 Hooks
    |    | list_t ----+      |
    |    +------------+------+
    |                 |
    |                 v
    |           +------------------+       +------------------+
    |           | character device |  ---> | character device | -> ...
    +---------->| vnd_dev_t        |       | vnd_dev_t        |
                |                  |       +------------------+
                |                  |
                | minor_t       ---+--> device minor number
                | ldi_handle_t  ---+--> handle to /dev/net/%datalink
                | vnd_dev_flags_t -+--> device flags, non blocking, etc.
                | char[]        ---+--> name if linked
                | vnd_str_t * -+   |
                +--------------+---+
                               |
                               v
        +-------------------------+
        | STREAMS device          |
        | vnd_str_t               |
        |                         |
        | vnd_str_state_t      ---+---> State machine state
        | gsqueue_t *          ---+---> mblk_t Serialization queue
        | vnd_str_stat_t       ---+---> per-device kstats
        | vnd_str_capab_t      ---+----------------------------+
        | vnd_data_queue_t ---+   |                            |
        | vnd_data_queue_t -+ |   |                            v
        +-------------------+-+---+                  +---------------------+
                            | |                      | Stream capabilities |
                            | |                      | vnd_str_capab_t     |
                            | |                      |                     |
                            | |    supported caps <--+-- vnd_capab_flags_t |
                            | |    dld cap handle <--+-- void *            |
                            | |    direct tx func <--+-- vnd_dld_tx_t      |
                            | |                      +---------------------+
                            | |
           +----------------+ +-------------+
           |                                |
           v                                v
+-------------------+                  +-------------------+
| Read data queue   |                  | Write data queue  |
| vnd_data_queue_t  |                  | vnd_data_queue_t  |
|                   |                  |                   |
| size_t        ----+--> Current size  | size_t        ----+--> Current size
| size_t        ----+--> Max size      | size_t        ----+--> Max size
| mblk_t *      ----+--> Queue head    | mblk_t *      ----+--> Queue head
| mblk_t *      ----+--> Queue tail    | mblk_t *      ----+--> Queue tail
+-------------------+                  +-------------------+

At a high level there are three different core components. There is a per-netstack data structure, there is a character device and there is a STREAMS device.

A netstack, or networking stack, is a concept in illumos that contains an independent set of networking information. This includes TCP/IP state, routing tables, tunables, etc. Every zone in SmartOS has its own netstack which allows zones to more fully control and interface with networking. In addition, the system has a series of IP hooks which are used by things like ipfilter and ipd to manipulate packets. When the vnd kernel module is first loaded, it registers with the netstack sub-system which ensures that allows the vnd kernel module to create its per-netstack data. In addition to hooking, the per-netstack data is used to make sure that when a zone is halted that all of the associated vnd devices are torn down.

The character device is the interface between consumers and the system. The vnd module is actually a self-cloning device. Whenever a library handle is created it first opens the control node which is /dev/vnd/ctl. The act of opening that creates a clone of the device with a new minor number. When an existing vnd device is opened, then no cloning takes place, it opens one of the existing character devices.

The major magic happens when a vnd character device is asked to associate with a data link. This happens through an ioctl that the library wraps up and takes care of. When the device is associated, the kernel itself does what we call a layered open – it opens and holds another character or block device. In this case the vnd module does a layered open of the data link. However, the devices that back data links are still STREAMS devices that speak DLPI. To take care of dealing with all of the DLPI messages and set up the normal fast path, we use the third core component: the vnd STREAMS device.

The vnd STREAMS device is fairly special, it cannot be used outside of the kernel and is an implementation detail of the vnd driver. After doing the layered open, the vnd STREAMS device is pushed onto the stream head and it begins to exchange DLPI messages to set up and configure the data link. Once it has successfully walked through its state machine, the device is full and ready to go. As part of doing that, it asks for exclusive access to the device, enables us to receive all the packets that are originally destined for the device, and enables direct function calls for this through what’s referred to commonly as the fastpath. Once that’s set up, the character device and STREAMS device wire up with one another. Once that’s all been finished successfully, the character device can be fully initialized.

At this point in time, the device can be fully used for reading and writing packets. It can optionally be bound into the file system name space. That binding is facilitated by the sdev file system and its new plugin interface. We’ll go into more detail about that in a future entry.

The STREAMS device contains a lot of the meat for dealing with data. It contains the data queues and it controls all the interactions with DLD/DLS and the fastpath. In addition, it also knows about its gsqueue (generic serialization queue). The gsqueue is used to ensure that we properly handle the order of transmitted packets, especially when subject to flow control.

The following two diagrams (from the big theory statement) describe the path that data takes when received and when transmitted.

Receive path

                                 |
                                 * . . . packets from gld
                                 |
                                 v
                          +-------------+
                          |     mac     |
                          +-------------+
                                 |
                                 v
                          +-------------+
                          |     dld     |
                          +-------------+
                                 |
                                 * . . . dld direct callback
                                 |
                                 v
                         +---------------+
                         | vnd_mac_input |
                         +---------------+
                                 |
                                 v
  +---------+             +-------------+
  | dropped |<--*---------|  vnd_hooks  |
  |   by    |   .         +-------------+
  |  hooks  |   . drop probe     |
  +---------+     kstat bump     * . . . Do we have free
                                 |         buffer space?
                                 |
                           no .  |      . yes
                              .  +      .
                          +---*--+------*-------+
                          |                     |
                          * . . drop probe      * . . recv probe
                          |     kstat bump      |     kstat bump
                          v                     |
                       +---------+              * . . fire pollin
                       | freemsg |              v
                       +---------+   +-----------------------+
                                     | vnd_str_t`vns_dq_read |
                                     +-----------------------+
                                              ^ ^
                              +----------+    | |     +---------+
                              | read(9E) |-->-+ +--<--| frameio |
                              +----------+            +---------+

Transmit path

  +-----------+   +--------------+       +-------------------------+   +------+
  | write(9E) |-->| Space in the |--*--->| gsqueue_enter_one()     |-->| Done |
  | frameio   |   | write queue? |  .    | +->vnd_squeue_tx_append |   +------+
  +-----------+   +--------------+  .    +-------------------------+
                          |   ^     .
                          |   |     . reserve space           from gsqueue
                          |   |                                   |
             queue  . . . *   |       space                       v
              full        |   * . . . avail          +------------------------+
                          v   |                      | vnd_squeue_tx_append() |
  +--------+          +------------+                 +------------------------+
  | EAGAIN |<--*------| Non-block? |<-+                           |
  +--------+   .      +------------+  |                           v
               . yes             v    |     wait          +--------------+
                           no . .*    * . . for           | append chain |
                                 +----+     space         | to outgoing  |
                                                          |  mblk chain  |
    from gsqueue                                          +--------------+
        |                                                        |
        |      +-------------------------------------------------+
        |      |
        |      |                            yes . . .
        v      v                                    .
   +-----------------------+    +--------------+    .     +------+
   | vnd_squeue_tx_drain() |--->| mac blocked? |----*---->| Done |
   +-----------------------+    +--------------+          +------+
                                        |                     |
      +---------------------------------|---------------------+
      |                                 |           tx        |
      |                          no . . *           queue . . *
      | flow controlled .               |           empty     * . fire pollout
      |                 .               v                     |   if mblk_t's
    +-------------+     .      +---------------------+        |   sent
    | set blocked |<----*------| vnd_squeue_tx_one() |--------^-------+
    | flags       |            +---------------------+                |
    +-------------+    More data       |    |      |      More data   |
                       and limit       ^    v      * . .  and limit   ^
                       not reached . . *    |      |      reached     |
                                       +----+      |                  |
                                                   v                  |
    +----------+          +-------------+    +---------------------------+
    | mac flow |--------->| remove mac  |--->| gsqueue_enter_one() with  |
    | control  |          | block flags |    | vnd_squeue_tx_drain() and |
    | callback |          +-------------+    | GSQUEUE_FILL flag, iff    |
    +----------+                             | not already scheduled     |
                                             +---------------------------+

Wrapping up

This entry introduces the tooling around vnd and provides a high level overview of the different components that make up the vnd module. In the next entry in the series on baridche, we’ll cover the new framed I/O abstraction. Entries following that will cover the new DLPI extensions, the sdev plugin interface, generalized squeues, and finally the road ahead.

Project Bardiche: Introduction

March 20, 2014

I just recently landed Project Bardiche into SmartOS. The goal of Bardiche has been to create a more streamlined data path for layer two networking in illumos. While the primary motivator for this was for KVM guests, it’s opened up a lot of room for more than just virtual machines. This bulk of this project is comprised of changes to illumos-joyent; however, there were some minor changes made to smartos-live, illumos-kvm, and illumos-kvm-cmd.

Project Highlights

Before we delve into a lot more of the specifics in the implementation, let’s take a high level view of what this project has brought to the system. Several of these topics will have their own follow up blog entries.

The global zone can now see data links for every zone in /dev/net/zone/%zonename/%datalink.
libdlpi(3LIB) was extended to be able to access those nics.
snoop(1M) now has a -z option to capture packets on a data link that belongs to a zone.
A new DLPI(7P) primitive DL_EXCLUSIVE_REQ was added.
A new DLPI(7P) promiscuous mode was added: DL_PROMISC_RX_ONLY.
ipfilter can now filter packets for KVM guests.
The IP squeue interface was generalized to allow for multiple consumers in a new module called gsqueue.
The sdev file system was enhanced with a new plugin interface.
A new driver vnd was added, an associated library libvnd, and
commands vndadm and vndstat. This driver provides the new layer two data path.
A new abstraction for sending and receiving framed data called framed I/O.

There’s quite a bit there. The rest of this entry will go into detail on the motivation for this work and a bit more on the new /dev/net/zone, libdlpi, and snoop features. Subsequent entries will cover the the new vnd architecture and the new DLPI primitives, the new gsqueue interface, shine a bit more light on what gets referred to as the fastpath, and cover the new sdev plugin architecture.

Motivation

Project bardiche started from someone asking what would it take to allow a hypervisor-based firewall to be able to filter packets that were being sent from a KVM guest. We wanted to focus on allowing the hypervisor to provide the firewall because of the following challenges associated with managing a firewall running in the guest.

While it’s true that practically all the guests that you would run under hardware virtualization have their own firewall software, they’re rarely the same. If we wanted to leverage the firewall built into the guest, we’d need to build an agent that lived in each guest. Not only does that mean that we’d have to write one of these for every type of guest we wanted to manage, given that customers are generally the super-user in their virtual machine (VM), they’d be able to simply kill the agent or change these rules, defeating the API.

While dealing with this, there were several other deficiencies in how networking worked for KVM guests today based on how QEMU, the program that actually runs the VM, interacted with the host networking. For each Ethernet device that was in the guest, there was a corresponding virtual NIC in the host. The two were joined with the vnic back end in QEMU which originally used libdlpi to bind them. While this worked, there were some problems with it.

Because we had to put the device in promiscuous mode, there was no way to tell it not to send back traffic that came from ourselves. In addition to just being a waste of cycles, this causes duplicate address detection, often performed with IPv6, to fail for many systems.

In addition, the dlpi interfaces had no means of reading or writing multiple packets at a time. A lot of these issues extend from the history of the illumos networking stack. When it was first implemented, it was done using STREAMS. Over time, that has changed. In Solaris 10 the entire networking stack was revamped with a project called Fire Engine. That project, among many others, transitioned the stack from a message passing interface to one that used a series of direct calls and serialization queues (squeues). Unfortunately, the means by which we were using libdlpi, left us still using STREAMS.

While exploring the different options and means to interface with the existing firewall, we eventually reached the point where we realized that we needed to go out and create a new interface that solved this, and the related problems that we had, as well as, lay the foundation for a lot of work that we’d still like to do.

First Stop: Observability Improvements

When I first started this project, I knew that I was going to have to spend a lot of time debugging. As such, I knew that I need to solve one of the more frustrating aspects of working with KVM networking: the ability to snoop and capture traffic. At Joyent, we always run a KVM instance inside of zone. This gives us all the benefits of zones: the associated security and resource controls.

However, before this project, data links that belonged to zones were not accessible from the global zone. Because of the design of the KVM branded zone, the only process running is QEMU and you cannot log in, which makes it very hard to pass the data link to snoop or tcpdump. This set up does not make it impossible to debug. One can use DTrace or use snoop on a device in the global zone; however, both of those end up requiring a bit more work or filtering.

The solution to this is to allow the global zone to see the data links for all devices across all zones under /dev/net and then enhance the associated libraries and commands to support accessing the new devices. If you’re in the global zone, there is now a new directory called /dev/net/zone. Don’t worry, this new directory can’t break you, as all data links in the system need to end with a number. On my development virtual machine which has a single zone with a single vnic named net0, you’d see the following:

[root@00-0c-29-37-80-28 ~]# find /dev/net | sort
/dev/net
/dev/net/e1000g0
/dev/net/vmwarebr0
/dev/net/zone
/dev/net/zone/79809c3b-6c21-4eee-ba85-b524bcecfdb8
/dev/net/zone/79809c3b-6c21-4eee-ba85-b524bcecfdb8/net0
/dev/net/zone/global
/dev/net/zone/global/e1000g0
/dev/net/zone/global/vmwarebr0

Just as you always have in SmartOS, you’ll still see the data links for your zone at the top level in /dev/net, eg. /dev/net/e1000g0 and /dev/net/vmwarebr0. Next, each of the zones on the system, in this case the global zone, and the non-global zone named 79809c3b-6c21-4eee-ba85-b524bcecfdb8, show up in /dev/net/zone. Inside of each of those directories are the data links that live in that zone.

The next part of this was exposing this functionality in libdlpi and then using that in snoop. For the moment, I added a private interface called dlpi_open_zone. It’s similar to dlpi_open except that it takes an extra argument for the zone name. Once this change gets up to illumos it’ll then become a public interface that you can and should use. You can view the manual page online here or if you’re on a newer SmartOS box you can run man dlpi_open_zone to see the documentation.

The use of this made its way into snoop in the form of a new option: -z zonename. Specifying a zone with -z will cause snoop to use dlpi_open_zone which will try to open the data link specified via its -d option from the zone. So if we wanted to watch all the icmp traffic over the data link that the KVM guest used we could run:

# snoop -z 79809c3b-6c21-4eee-ba85-b524bcecfdb8 -d net0 icmp

With this, it should now be easier, as an administrator of multiple zones, to observe what’s going on across multiple zones without having to log into them.

Thanks

There are numerous people whom helped this project along the way. The entire Joyent engineering team helped from the early days of bouncing ideas about APIs and interfaces all the way through to the final pieces of review. Dan McDonald, Sebastien Roy, and Richard Lowe, all helped review the various write ups and helped deal with various bits of arcana in the networking stack. Finally, the broader SmartOS community helped go through and provide additional alpha and beta testing.

What’s Next

In the following entries, we’ll take a tour of the new sub-systems ranging from the vnd architecture and framed I/O abstraction through the sdev interfaces.

How we got here

What’s in this Release

Dynamic Plugins

The files plugin format

Getting Started

Concluding

What’s in this release

Getting Started

Concluding

The series so far

The illumos Networking Stack

Speeding things up

New Additions

Conclusion

The series so far

Background

read(2) and write(2)

getmsg(2) and putmsg(2)

sendmsg(2) and recvmsg(2)

Framed I/O

Framed I/O now and in the Future

Next in the Bardiche Series

The illumos networking stack

vnd devices

Programmatic Use

Peeking under the hood

Wrapping up

Project Highlights

Motivation

First Stop: Observability Improvements

Thanks

What’s Next

Recent Posts

Archives

Archives