Project Bardiche: vnd Architecture

My previous entry introduced Project Bardiche, a project which revamps how we do networking for KVM guests. This entry focuses on the design and architecture of the vnd driver and how it fits into the broader networking stack.

The illumos networking stack

The illumos networking stack is broken into several discrete pieces which is summarized in the following diagram:

             | libdlpi |  libvnd  | libsocket|
             |         ·          ·    VFS   |
             |   VFS   ·    VFS   +----------+
             |         ·          |  sockfs  |
             |         |    VND   |    IP    |
             |         +----------+----------+
             |            DLD/DLS            |
             |              MAC              |
             |             GLDv3             |

At the top of the diagram are the common interfaces to user land. The first and most familiar is libsocket. That contains all of the common interfaces that C programmers are used to seeing: connect(3SOCKET), accept(3SOCKET), etc. On the other hand, there’s libdlpi, which provides an alternate means for interacting with networking devices. It is often used by software like DHCP servers and for LLDP. libvnd is new and a part of project bardiche. For now, we’ll stick to describing the other two paths.

Next, operations transit through the virtual file system (VFS) to reach their next destination. For most folks, that’ll be the illumos file system sockfs. The sockfs file system provides a means of translating between the gory details of TCP, UDP, IP, and friends, and the actual sockets that they rely on. The next step for such sockets is what folks traditionally think of as the TCP/IP stack. This encompasses everything related to the actual protocol processing. For example, connecting a socket and going through the TCP dance of the SYN and SYN/ACK is all handled by the logic in the TCP/IP stack. In illumos, both TCP and IP are implemented in the same kernel module called IP.

The next layer is comprised of two kernel modules which work together called DLD and DLS. DLD is the data-link driver and DLS is the data-link services module. The two modules work together. Every data link in illumos, whether it’s a virtual nic or physical nic, is modeled as a dld device. When you open something like /dev/net/igb0, that’s an instance of a DLD device. These devices provide an implementation of all of the DLPI (Data-link Provider Interface) STREAMS messages and are used to negotiate the fast path. We’ll go into more detail about that in a future entry.

Everything transits out of DLD and DLS and enters the MAC layer. The MAC layer handles taking care of interfacing with the actual device drivers, programming unicast and multicast addresses into devices, controlling whether or not the devices are in promiscuous mode, etc. The final layer is the Generic Lan Device version three (GLDv3) Device Driver. GLDv3 is a standard interface for networking device drivers and represents a set of entry points that the Operating System expects to use with them.

vnd devices

A vnd device is created on top of a data link similar to how an IP interface is created on top of a data link. Once a vnd device has been created, it can be used to read and write layer two frames. In addition, a vnd device can optionally be linked into the file system name space allowing others to open the device.

Similar to /dev/net, vnd devices show up under /dev/vnd. A control node is always created at /dev/vnd/ctl. This control node is referred to as a self-cloning device. That means that any time the device is opened, a new instance of the device is created. Once the control node has been opened, it is associated with a data link and then it is bound into the file system name space with some name that usually is identical to the name of the data link. After the device has been bound, it then shows up in /dev/vnd. If a vnd device was named net0 then it would show up as /dev/vnd/net0. Just as /dev/net displays all of the data links in th various zones under /dev/net/zone, the same is true for vnd. The vnd devices in any zone are all located under /dev/vnd/zone and follow the pattern /dev/vnd/zone/%zonename/%vnddevice. These devices are never directly manipulated. Instead, they are used by libvnd and vndadm.

Once a vnd device has been created and bound into the name space, it will persist until it is removed with either vndadm or libvnd or the zone it is present in is halted. The removal of vnd devices from the name space is similar to calling unlink(2) on a file. If any process has the vnd device open after it is has been removed from the name space, it will persist until all open handles have been closed.

If a data link already has an IP interface or is being actively used for any other purpose, a vnd device cannot be created on top of it, and vice versa. Because vnd devices operate at a layer two, if various folks are already consuming layer three, it doesn’t make sense to create a vnd device on top of it. The opposite also holds.

The command vndadm was written to manipulate vnd devices. It’s worth stepping through some basic examples of using the command. Even more examples can be found in its manual page. With that, let’s get started and create a vnic and then a device. Substitute your physical link for anything you prefer.

# dladm create-vnic -l e1000g0 vnd0
# vndadm create vnd0
# ls /dev/vnd
ctl   vnd0  zone

With that, we’ve created a device. Next, we can use vndadm to list devices as well as get and set properties.

# vndadm list
NAME             DATALINK         ZONENAME
vnd0             vnd0             global
# vndadm get vnd0
LINK          PROPERTY         PERM  VALUE
vnd0          rxbuf            rw    65536
vnd0          txbuf            rw    65536
vnd0          maxsize          r-    4194304
vnd0          mintu            r-    0
vnd0          maxtu            r-    1518
# vndadm set txbuf=2M
# vndadm get vnd0 txbuf
LINK          PROPERTY         PERM  VALUE
vnd0          txbuf            rw    2097152

You’ll note that there are two properties that we can set, rxbuf and txbuf. These are the sizes of buffers that an instance of a vnd device maintains. As frames come in, they are put into the receive buffer where they sit until they are read by someone, usually a KVM guest. If a frame would come in that would exceed the size of that buffer, then it will be dropped instead. The transmit buffer controls the total amount of outstanding data that can exist at any given time in the vnd subsystem. The vnd device has to keep track of this to deal with cases like flow control.

Finally, we can go ahead and remove the device via:

# vndadm destroy vnd0

While not shown here, all of these commands can operate on device that are in another zone, if the user is in the global zone. To get statistics about device throughput and packet drops, you can use the command vndstat. Here’s a brief example:

$ vndstat 1 5
 name |   rx B/s |   tx B/s | drops txfc | zone
 net0 | 1.45MB/s | 14.1KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 3.50MB/s | 19.5KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 2.83MB/s | 30.8KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 3.08MB/s | 30.6KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 3.21MB/s | 30.6KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525

The drops column sums up the total number of drops while the txfc column shows the number of times that the device has been flow controlled during that period.

Programmatic Use

So far, I’ve demonstrated the use of the user commands. For most applications, you’ll want to use the fully featured C library libvnd. The introductory manual is where you’ll want to get started for all information in using it. It will point you out to all of the rest of the functions which can all be found in the manual section 3VND. Please keep in mind, that until the library makes its way up into illumos, portions of the API may still end up changing and isn’t considered stable yet.

Peeking under the hood

So far we’ve talked about how you can use these devices, now let’s go under the hood and talk about how this is constructed. For all of the full gory details, you should turn to the vnd big theory statement. There are multiple components that make up the general architecture of the vnd sub-system, though only character devices are shown. The following bit of ascii art from the big theory statement describes the general architecture:

+----------------+     +-----------------+
| global         |     | global          |
| device list    |     | netstack list   |
| vnd_dev_list   |     | vnd_nsd_list    |
+----------------+     +-----------------+
    |                    |
    |                    v
    |    +-------------------+      +-------------------+
    |    | per-netstack data | ---> | per-netstack data | --> ...
    |    | vnd_pnsd_t        |      | vnd_pnsd_t        |
    |    |                   |      +-------------------+
    |    |                   |
    |    | nestackid_t    ---+----> Netstack ID
    |    | vnd_pnsd_flags_t -+----> Status flags
    |    | zoneid_t       ---+----> Zone ID for this netstack
    |    | hook_family_t  ---+----> VND IPv4 Hooks
    |    | hook_family_t  ---+----> VND IPv6 Hooks
    |    | list_t ----+      |
    |    +------------+------+
    |                 |
    |                 v
    |           +------------------+       +------------------+
    |           | character device |  ---> | character device | -> ...
    +---------->| vnd_dev_t        |       | vnd_dev_t        |
                |                  |       +------------------+
                |                  |
                | minor_t       ---+--> device minor number
                | ldi_handle_t  ---+--> handle to /dev/net/%datalink
                | vnd_dev_flags_t -+--> device flags, non blocking, etc.
                | char[]        ---+--> name if linked
                | vnd_str_t * -+   |
        | STREAMS device          |
        | vnd_str_t               |
        |                         |
        | vnd_str_state_t      ---+---> State machine state
        | gsqueue_t *          ---+---> mblk_t Serialization queue
        | vnd_str_stat_t       ---+---> per-device kstats
        | vnd_str_capab_t      ---+----------------------------+
        | vnd_data_queue_t ---+   |                            |
        | vnd_data_queue_t -+ |   |                            v
        +-------------------+-+---+                  +---------------------+
                            | |                      | Stream capabilities |
                            | |                      | vnd_str_capab_t     |
                            | |                      |                     |
                            | |    supported caps <--+-- vnd_capab_flags_t |
                            | |    dld cap handle <--+-- void *            |
                            | |    direct tx func <--+-- vnd_dld_tx_t      |
                            | |                      +---------------------+
                            | |
           +----------------+ +-------------+
           |                                |
           v                                v
+-------------------+                  +-------------------+
| Read data queue   |                  | Write data queue  |
| vnd_data_queue_t  |                  | vnd_data_queue_t  |
|                   |                  |                   |
| size_t        ----+--> Current size  | size_t        ----+--> Current size
| size_t        ----+--> Max size      | size_t        ----+--> Max size
| mblk_t *      ----+--> Queue head    | mblk_t *      ----+--> Queue head
| mblk_t *      ----+--> Queue tail    | mblk_t *      ----+--> Queue tail
+-------------------+                  +-------------------+

At a high level there are three different core components. There is a per-netstack data structure, there is a character device and there is a STREAMS device.

A netstack, or networking stack, is a concept in illumos that contains an independent set of networking information. This includes TCP/IP state, routing tables, tunables, etc. Every zone in SmartOS has its own netstack which allows zones to more fully control and interface with networking. In addition, the system has a series of IP hooks which are used by things like ipfilter and ipd to manipulate packets. When the vnd kernel module is first loaded, it registers with the netstack sub-system which ensures that allows the vnd kernel module to create its per-netstack data. In addition to hooking, the per-netstack data is used to make sure that when a zone is halted that all of the associated vnd devices are torn down.

The character device is the interface between consumers and the system. The vnd module is actually a self-cloning device. Whenever a library handle is created it first opens the control node which is /dev/vnd/ctl. The act of opening that creates a clone of the device with a new minor number. When an existing vnd device is opened, then no cloning takes place, it opens one of the existing character devices.

The major magic happens when a vnd character device is asked to associate with a data link. This happens through an ioctl that the library wraps up and takes care of. When the device is associated, the kernel itself does what we call a layered open – it opens and holds another character or block device. In this case the vnd module does a layered open of the data link. However, the devices that back data links are still STREAMS devices that speak DLPI. To take care of dealing with all of the DLPI messages and set up the normal fast path, we use the third core component: the vnd STREAMS device.

The vnd STREAMS device is fairly special, it cannot be used outside of the kernel and is an implementation detail of the vnd driver. After doing the layered open, the vnd STREAMS device is pushed onto the stream head and it begins to exchange DLPI messages to set up and configure the data link. Once it has successfully walked through its state machine, the device is full and ready to go. As part of doing that, it asks for exclusive access to the device, enables us to receive all the packets that are originally destined for the device, and enables direct function calls for this through what’s referred to commonly as the fastpath. Once that’s set up, the character device and STREAMS device wire up with one another. Once that’s all been finished successfully, the character device can be fully initialized.

At this point in time, the device can be fully used for reading and writing packets. It can optionally be bound into the file system name space. That binding is facilitated by the sdev file system and its new plugin interface. We’ll go into more detail about that in a future entry.

The STREAMS device contains a lot of the meat for dealing with data. It contains the data queues and it controls all the interactions with DLD/DLS and the fastpath. In addition, it also knows about its gsqueue (generic serialization queue). The gsqueue is used to ensure that we properly handle the order of transmitted packets, especially when subject to flow control.

The following two diagrams (from the big theory statement) describe the path that data takes when received and when transmitted.

Receive path

                                 * . . . packets from gld
                          |     mac     |
                          |     dld     |
                                 * . . . dld direct callback
                         | vnd_mac_input |
  +---------+             +-------------+
  | dropped |<--*---------|  vnd_hooks  |
  |   by    |   .         +-------------+
  |  hooks  |   . drop probe     |
  +---------+     kstat bump     * . . . Do we have free
                                 |         buffer space?
                           no .  |      . yes
                              .  +      .
                          |                     |
                          * . . drop probe      * . . recv probe
                          |     kstat bump      |     kstat bump
                          v                     |
                       +---------+              * . . fire pollin
                       | freemsg |              v
                       +---------+   +-----------------------+
                                     | vnd_str_t`vns_dq_read |
                                              ^ ^
                              +----------+    | |     +---------+
                              | read(9E) |-->-+ +--<--| frameio |
                              +----------+            +---------+

Transmit path

  +-----------+   +--------------+       +-------------------------+   +------+
  | write(9E) |-->| Space in the |--*--->| gsqueue_enter_one()     |-->| Done |
  | frameio   |   | write queue? |  .    | +->vnd_squeue_tx_append |   +------+
  +-----------+   +--------------+  .    +-------------------------+
                          |   ^     .
                          |   |     . reserve space           from gsqueue
                          |   |                                   |
             queue  . . . *   |       space                       v
              full        |   * . . . avail          +------------------------+
                          v   |                      | vnd_squeue_tx_append() |
  +--------+          +------------+                 +------------------------+
  | EAGAIN |<--*------| Non-block? |<-+                           |
  +--------+   .      +------------+  |                           v
               . yes             v    |     wait          +--------------+
                           no . .*    * . . for           | append chain |
                                 +----+     space         | to outgoing  |
                                                          |  mblk chain  |
    from gsqueue                                          +--------------+
        |                                                        |
        |      +-------------------------------------------------+
        |      |
        |      |                            yes . . .
        v      v                                    .
   +-----------------------+    +--------------+    .     +------+
   | vnd_squeue_tx_drain() |--->| mac blocked? |----*---->| Done |
   +-----------------------+    +--------------+          +------+
                                        |                     |
      |                                 |           tx        |
      |                          no . . *           queue . . *
      | flow controlled .               |           empty     * . fire pollout
      |                 .               v                     |   if mblk_t's
    +-------------+     .      +---------------------+        |   sent
    | set blocked |<----*------| vnd_squeue_tx_one() |--------^-------+
    | flags       |            +---------------------+                |
    +-------------+    More data       |    |      |      More data   |
                       and limit       ^    v      * . .  and limit   ^
                       not reached . . *    |      |      reached     |
                                       +----+      |                  |
                                                   v                  |
    +----------+          +-------------+    +---------------------------+
    | mac flow |--------->| remove mac  |--->| gsqueue_enter_one() with  |
    | control  |          | block flags |    | vnd_squeue_tx_drain() and |
    | callback |          +-------------+    | GSQUEUE_FILL flag, iff    |
    +----------+                             | not already scheduled     |

Wrapping up

This entry introduces the tooling around vnd and provides a high level overview of the different components that make up the vnd module. In the next entry in the series on baridche, we’ll cover the new framed I/O abstraction. Entries following that will cover the new DLPI extensions, the sdev plugin interface, generalized squeues, and finally the road ahead.

Posted on March 24, 2014 at 8:31 am by rm · Permalink
In: Bardiche, Networking · Tagged with: , , ,