USB Topology

USB devices have been a mainstay of extending x86 systems for some time now. At Joyent, we used USB keys to contain our own version of iPXE to boot. As part of discussions around RFD 77 Hardware-backed per-zone crypto tokens with Alex Wilson we talked about knowing and constricting which USB devices were trusted based on whether or not they were plugged into an internal USB port or external USB port.

While this wasn’t the first time that this idea had come up, by the time I started working on ideas on improving data center management, having better USB topology ended up on the list of problems I wanted to solve in RFD 89 Project Tiresias. Though at that point, how it was going to work was still a bit of an unknown.

The rest of this blog entry will focus on giving a bit of background on how USB works, some of the building blocks used for topology, examples of how we use the topology information, and then how to flesh it out for a new system.

USB Background

While USB, the Universal Serial Bus, is rather ubiquitous, some of its underlying implementation may not be. This section describes USBv1, v2, and v3 devices. The USBv4 spec is basically Thunderbolt, which is a different enough beast that I don’t want to lump it into here.

Every USB device is plugged into a device called a hub. Each hub consists of one or more ports and may itself be plugged into another hub. In this manner, you can think of USB like a tree. When you get to the root of this tree, you reach what is called the root hub. The root hub is often a bit different from the other hubs — it bridges USB to the rest of the system. Most USB root hubs are either built into the platform’s chipsets or they are on external PCI express cards. The operating system interfaces with these devices generally using standards like xHCI, the eXtensible Host Controller Interface (they already used the E for EHCI – the Enhanced Host Controller Interface).

USB 2.0 and USB 3.x Compatibility

There are many versions of USB devices out on the market; however, even newer devices work on older systems (if you can find the right kind of plug). The secret to this is that every USB 3.x port must support USB 2.0. The way this works is that a USB 3.x port has the wiring for both USB 3.x and USB 2.0 at the same time. In general, this has been a good thing. It means that older devices will work on newer systems and newer devices will work on older systems albeit not always at the maximum speed that they support. However, this does make our life a bit more complicated when it comes to topology.

While a single physical port can support both USB 3.x and USB 2.0, to the operating system the one physical port shows up as two different logical ports on the host controller. Generally, a device will select either USB 3.x or USB 2.0 signaling based on what they support and therefore it will only show up on one of the two logical ports. However, when it comes to topology, the user cares only about the fact that it’s in a given physical port, they don’t (generally speaking) care about the fact that there are multiple logical ports.

USB hubs, which allow for more devices to exist, are an exception to the rule of only using USB 3.x or USB 2.0 signaling. A USB 3.x hub is actually two hubs in one! When a USB 3.x hub is plugged into a port that supports USB 3.x, it will enumerate as two different hubs: one on the USB 2.0 logical port and one on the USB 3.x logical port. This means that the OS will actually see two distinct USB Hubs that it will enumerate and manage independently.

Ultimately, these are all good properties for USB devices to have. It does mean that we have to do a bit more work to map everything together, but that’s fine — that’s our job.

Multiple Host Controllers

The picture we painted above is a nice one, but doesn’t reflect all systems. One of the challenges of USB 3.0 support was that it introduced a new host controller interface: xhci replaced ehci. Now, to help with the transition, Intel produced a number of chipsets that had both xhci and ehci controllers on them. When the system booted up, all of the USB ports would be directed towards the ehci controller. However, on these platforms an xhci device driver could write to a special register which would result in rerouting all of the ports from the ehci controller to the xhci controller.

This allowed operating systems which didn’t support xhci to still have working USB. On Intel platforms, this duality was removed with Intel’s Intel’s Skylake chipsets.

From topology’s perspective, this means that the same physical port could show up not just as two different ports on the same controller, but actually as multiple, disjoint ports on different controllers!

Companion Controllers

With USB 3.x, a single host controller can support USB 3.x, 2.0, and 1.x devices. However, before USB 3.0 this wasn’t the case. Instead, platforms placed what was called a ‘companion controller’ on the motherboard. The basic idea was that the USB 2.0 ports were wired up to one controller and the other ports were wired up to a companion USB 1.0/USB 1.1 controller (ohci or uhci).

The companion controller model required the various drivers to be aware of this reality and trade things back and forth between them. Folding them together in xhci made things simpler. From a topology perspective, this can result in the same problem hat we have in the pre-Skylake USB 3.0 supporting systems — a given physical port can show up under multiple distinct devices.

USB Descriptors and Capabilities

Information about USB devices is broken down into two different groups of information:

  1. Descriptors
  2. Capabilities

Descriptors are used to identify information about the device such as the manufacturer ID, the device ID, the USB revision the device supports, etc.. There are descriptors which identify characteristics about a shared class of devices and others which identify information about different configurations that the device supports. For the purposes of USB topology, we primarily care about the device descriptor.

USB capabilities are stored in what’s called the binary object store. Capabilities first showed up in the USB 3.0 specification (though they appeared first in the briefly used Wireless USB specification). These capabilities are required for devices and generally describe USB-wide aspects of the device.

Topology Building Blocks

USB Topology is complicated for a few different reasons. The first is the fact a single physical port can show up as two different logical ports to the operating system. The second challenge is actually figuring out what all the ports are used for — if they’re used at all.

This second problem deserves a bit more explanation. Most systems have USB support from their platform chipset. The platform chipset implements a number of USB ports. For example, let’s look at one of the Intel 300 series chipsets, the Z390. If you look at the I/O specifications, it lists that it supports 14 USB ports, all of which can support USB 2.0 and some of which can support different forms of USB 3.1. Now, the standard system doesn’t actually have 14 USB ports all wired up, only a subset of them. Even the mobile chipsets are the same and I certainly don’t have 14 USB ports all over my laptop. This means that there are ports that the OS can see, but may not be used or wired up at all. Or they may be wired up to an internal hub.

While this is a challenge, it is surmountable. We’ll talk about a few of the things we can use to map ports together and a few things that also don’t work for us.

ACPI

ACPI, the Advanced Configuration and Power Interface, provides a multitude of different capabilities to the system. However, there are a few that are specific to USB that are useful for us.

In ACPI, there is a notion of a tree of devices. Every device in the tree has properties and methods that the operating system can read and invoke which are provided by the platform firmware. When the operating system is looking for devices, it searches for ACPI devices in the same way it might search for PCI devices. In this tree, we’ll find three different relevant items: PCI devices, USB hubs, and USB ports. A USB host controller will be represented as a PCI device and it will have a child device, which is a USB hub, representing the root hub that the operating system sees. The hub will then have a port entry that corresponds to each logical port that the USB device has. If the platform has other hubs built into it (not ones that a user a plugs in), then they might also be represented in the ACPI tree.

For each port in the tree, there are three attributes that we care about. The first is the _ADR method. It is a generic ACPI method that determines the address of a given object. The type of address will vary based on the type of the device. A PCI device would have its device and function number while a SATA device would have the port number. In the case of a USB port, it gives us the port number on the hub which corresponds to the logical view that the operating system will see. This gives us a way of correlating the ACPI port objects with the ports that the operating system sees.

The next thing that we use from ACPI is the optional _UPC method, which is used to return the USB port capabilities. It tells us two different types of information:

  1. Whether a device may be connected to the port or not.
  2. The type of USB port. For example, whether it’s a Type A or Type C connector.

The next piece that we use is the _PLD method, which is the physical location of device. This method returns a binary description of the physical device information. It includes some information like the panel, orientation, and more. While theoretically useful, in practice, the binary payload makes it hard to really deterministically say something useful about the layout of the ports.

Now, you may ask if it’s hard to make something sensible about it, then why bother using it at all. The answer to that lies in the xHCI specification. The xHCI specification says that if you want to map two physical ports on an xhci controller together — such as a USB 2 port and its corresponding USB 3 port, then you can actually use the physical location of device information to map them together. If two ports, have the same panel, horizontal and vertical position, shape, group orientation, position, and token, then they are the same port. This only works across a single controller. Unfortunately, you cannot map two ports together on different controllers this way.

Exposing Information

For each USB root controller and its corresponding ports, we end up creating a logical device node for it in the devices tree. ACPI devices are rooted under the fw node. Each USB root hub shows up under it with a way to map it to its corresponding PCI device. Under each hub is a port, and if there’s a hub under that, then another USB hub and its ports.

Each port has a series of properties that correspond to the various ACPI methods discussed above. More specifically, we have the following properties:

  1. acpi-address: The value of the address found through the _ADR method.
  2. acpi-physical-location: A byte array that corresponds to the raw ACPI values. The kernel does not try to interpret the data and instead that is done in user land.
  3. usb-port-connectable: A property that if present indicates that the port is connectable.
  4. usb-port-type: A property that indicates the ACPI USB port type.

These properties can all be read with the libdevinfo(3LIB) library, which allows software to take a snapshot of the tree, walk the various nodes, and read the properties of the different nodes.

A Building Block Not Taken: SMBIOS

SMBIOS, is the system management BIOS. It provides tables of static information about the system. For example, lists of CPUs, memory devices, and more. One of the things I’ve enjoyed doing in illumos is keeping this information up to date and improving it as new releases of the specification come out. It’s proven to be invaluable for other efforts like labeling PCIe devices.

This time though, I mention SMBIOS because it’s something that one might normally think to use, but actually doesn’t work. One of the SMBIOS tables is a list of ports and what they connect to. Unfortunately, the SMBIOS tables usually refer to USB ports based on the headers on the motherboard. While this can be useful for some cases, it isn’t for what we care about — mapping ports that you plug devices into back to the corresponding ports that the operating systems see.

Enumerating USB Topology

With the different building blocks in place, let’s turn directions and now look at how we expose all of this information in the FMA topology trees. We’ll first look at what we expose and then we’ll come back and explain how that’s put together in FMA. We enumerate USB ports in FMA’s topology in three different groups:

  1. USB ports that we know correspond to the chassis
  2. USB ports that come from a PCIe add on card.
  3. All the remaining USB ports, which we place off of the motherboard.

For each port, we list the following information:

  1. The USB revisions the port supports, such as 2.0, 3.0, etc.
  2. The type of the port if we have ACPI information or a metadata file to tell us about it.
  3. Whether we consider the port connectable, visible, or disconnected.
  4. Information about whether we consider the port internal or external, if we have explicit metadata.
  5. A list of all of the logical ports this physical port represents. For example, if a port is wired up to an ehci controller and two ports on an xhci controller (one for USB 2.0 and one for USB 3.x), then we’ll list three different children here.
  6. A label that describes the port, if metadata provides it. This is a string that a person uses to know how to identify a port. For example, ‘Rear Upper Left USB’.

The following is an example of what a single USB port node might look like in fmtopo:

hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740/chassis=0/port=0
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740/chassis=0/port=0
    FRU               fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740/chassis=0
    label             string    Rear Upper Left USB
  group: authority                      version: 1   stability: Private/Private
    product-id        string    Joyent-S10G5
    chassis-id        string    S287161X8300740
    server-id         string    magma
  group: port                           version: 1   stability: Private/Private
    type              string    usb
  group: usb-port                       version: 1   stability: Private/Private
    port-type         string    USB 3 Standard-A connector
    usb-versions      string[]  [ "2.0" "3.0" ]
    port-attributes   string[]  [ "user-visible" "external-port" ]
    logical-ports     string[]  [ "xhci0@2" "xhci0@18" ]

Under a port, if a USB device is plugged in, we’ll list information about the device. This includes:

  1. The USB revision of the device. For example, this could be 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, etc..
  2. The numeric vendor and device identifiers, which are used to inform the system about the device so the right driver can be attached.
  3. The revision ID of the device. This is a vendor-specific name.
  4. The device’s USB vendor and product name strings, if it provides them.
  5. The USB device’s serial number, if it has one.
  6. The speed of the device, for example super-speed, full-speed, etc.. These represent the type of protocol speed that the system has.

Next, we’ll create a set of properties that describe the driver that’s attached to the device, if any. This is a standard property group that you’ll find on other nodes in the tree, such as a PCIe device. This includes:

  1. The name of the driver.
  2. The instance of the driver (a logical construct in the OS).
  3. The path of the driver in /devices.
  4. The module information for the driver, such as its FMRI (fault management resource identifier).

Here’s an example of the information for a device itself in fmtopo:

hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740:serial=00241D8CE563C1B1E94FEBB4:part=DataTraveler-2.0:revision=100/motherboard=0/port=5/usb-device=0
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740:serial=00241D8CE563C1B1E94FEBB4:part=DataTraveler-2.0:revision=100/motherboard=0/port=5/usb-device=0
    FRU               fmri      hc://:product-id=Joyent-S10G5:server-id=magma:chassis-id=S287161X8300740:serial=00241D8CE563C1B1E94FEBB4:part=DataTraveler-2.0:revision=100/motherboard=0/port=5/usb-device=0
    label             string    Internal USB
  group: authority                      version: 1   stability: Private/Private
    product-id        string    Joyent-S10G5
    chassis-id        string    S287161X8300740
    server-id         string    magma
  group: usb-properties                 version: 1   stability: Private/Private
    usb-port          uint32    0xa
    usb-vendor-id     int32     2352
    usb-product-id    int32     25924
    usb-revision-id   string    100
    usb-version       string    2.0
    usb-vendor-name   string    Kingston
    usb-product-name  string    DataTraveler 2.0
    usb-serialno      string    00241D8CE563C1B1E94FEBB4
    usb-speed         string    high-speed
  group: io                             version: 1   stability: Private/Private
    driver            string    scsa2usb
    instance          uint32    0x0
    devfs-path        string    /pci@0,0/pci15d9,981@14/storage@a
    module            fmri      mod:///mod-name=scsa2usb/mod-id=110
  group: binding                        version: 1   stability: Private/Private
    occupant-path     string    /pci@0,0/pci15d9,981@14/storage@a/disk@0,0

Finally, based on the kind of device we encounter, we might enumerate children nodes. Right now there are two cases that we’ll enumerate child nodes:

  1. If we encounter a hub and we have found its children, then we’ll enumerate them as we described above and any USB devices that we find under them.
  2. If we encounter a USB device that represents a disk, like an external hard drive or USB key, then we’ll set some additional properties on the node and call into the disk enumerator. This’ll create a disk node that can be used to map disk information back to the physical device.

Other Uses

While having the items in the tree makes it easier for us to see everything in the system, once we have location information and serial numbers, there are other ways we can use this. The tool diskinfo lists disks and, when we have it, the physical location of them and their serial number. For example, when using SATA and SAS drives, this can tell you which drive bay they’re in, whether it’s a front or rear drive, and more.

To get this information, the diskinfo program takes a snapshot of the system topology and then maps the discovered disks to the corresponding disk nodes in the system’s topology. We’ve done the same for these USB devices. So if we have topology information, we can tell you which USB device it is that’s plugged in. For example:

# diskinfo -P
DISK                    VID      PID              SERIAL               FLT LOC LOCATION
c1t0d0                  Kingston DataTraveler 2.0 00241D8CE563C1B1E94FEBB4 -   -   Internal USB
c2t5000CCA0496FCA6Dd0   HGST     HUSMH8010BSS204  0HWZGX6A             no  no  [0] Slot00
c2t5000CCA25319F125d0   HGST     HUH721212AL4200  8DGG88SZ             no  no  [0] Slot01
c2t5000CCA25318BE1Dd0   HGST     HUH721212AL4200  8DGELUWZ             no  no  [0] Slot02
c2t5000CCA2530F9CD5d0   HGST     HUH721212AL4200  8DG8L5JZ             no  no  [0] Slot03
c2t5000CCA25318BE15d0   HGST     HUH721212AL4200  8DGELUUZ             no  no  [0] Slot04
...

Constructing Topology

Now that we’ve talked about how we use topology and most of the operating system building blocks, it’s worth spending some time talking about how we actually build the USB topology itself.

We gather data from three different sources:

  1. A USB topology metadata file.
  2. Walking the devices tree looking for USB root hubs and their children (non-ACPI).
  3. Walking the ACPI firmware tree, looking for USB information.

Once we gather information from all three of these sources, we combine them all together to create a single, coherent map. We first map an ACPI node to its corresponding devcfg node. Then, if we’ve opted to map ports together based on ACPI (more on that in a little bit), then we’ll combine the different logical nodes together.

USB Metadata File

The topology USB metadata file allows us to create a per-vendor, per-product map of additional information. The file is a simple format that has keywords and arguments.

A file first identifies a given port. From a port, it will then provide additional metadata such as a label and whether or not it is internal or external. Next, if we need to override the ACPI port type either because it’s missing or incorrect, then we can do so here. Finally, a series of ACPI paths that describe the port are listed. This way, when a port has a USB 2.0 and a USB 3.x component, because we’ve listed both, we’ll be able to apply this metadata to either port.

Finally, there are a number of top-level directives. These describe the matching behavior that we’d like to use. We can do the following:

  1. Disable the use of ACPI entirely on this platform.
  2. Disable the use of ACPI matching. We’ve done this on platforms where we’ve determined that the ACPI information that the platform has is incorrect.
  3. We can enable matching based on the metadata information. This is useful in tandem with the above. Here, we use the ACPI paths to perform matching the same way that we did elsewhere.

Here’s a portion of a USB metadata file:

port
        label
                Rear Lower Right USB
        chassis
        external
        port-type
                0x3
        acpi-path
                \_SB_.PCI0.XHCI.RHUB.HS11
        acpi-path
                \_SB_.PCI0.XHCI.RHUB.SSP2
        acpi-path
                \_SB_.PCI0.EHC2.HUBN.PR01.PR15
end-port

port
        label
                Internal USB
        internal
        port-type
                0x3
        acpi-path
                \_SB_.PCI0.XHCI.RHUB.HS07
        acpi-path
                \_SB_.PCI0.XHCI.RHUB.SSP4
        acpi-path
                \_SB_.PCI0.EHC2.HUBN.PR01.PR13
end-port

This example has two ports present. Each port has a label which is used to identify where the port is for a human. The ‘internal’ and ‘external’ keywords are used to indicate whether the port is internal to the system or external. In this case, the ‘internal’ port is found on the motherboard of the system. So it cannot be serviced or used without opening the system. The ‘chassis’ label indicates that this port is found on the chassis of the system itself. This is where most USB ports are that a user would find and use.

The port-type here indicates that they are USB 3 Type-A connectors, meaning that they support both USB 2.0 and USB 3.x. Finally, the various ‘acpi-path’ entries are used to indicate the ACPI path towards the port. Note how the ports are labeled based on the names of the ACPI device nodes. Each ‘.’ separates each node. The starting ‘\’ character is just part of the constructed path, it is not an escape character.

Writing Your Own Map

The way I’ve done this for other platforms is finding a USB 2.0 and USB 3.0 port and plugging it into each port subsequently. At each point, I look at the information in FMA and in the devices tree with prtconf. By looking at the port numbers and what exists in the devices tree, one can, with a bit of manual work, piece together what’s required.

One challenge with doing this is when you’re on a system that has both the ehci and xhci controllers. Generally this is on Intel platforms from Sandy Bridge through Broadwell. In that case, you need to go into the BIOS and do this with xhci enabled and disabled in the BIOS. This will make sure that you can get all the ports connected to the ehci controller.

Further Reading

If you’re interested in the illumos implementation:

For more on the specifications mentioned:

Looking Ahead

While we’ve done some work, there’s more that we can do to improve the situation here in the future. This discusses some of those future directions.

Container ID Capability

The Container ID capability is the first tool at our disposal. The container ID is a 128-bit UUID — a universally unique identifier. All USB 3.x hubs are required to implement this capability and it can be read from the USB binary object store.

The idea behind this capability is that a device will have the same container ID value regardless of the type of bus that they’re on. So even though a USB 3.x hub will appear as two distinct hubs to the operating system, if they’re the same device, they’ll have the same container ID UUID. This gives the operating system a way to map such devices together.

When the USB Container ID capability is found in the binary object store, we add a property to the device node that indicates the UUID. This translates into a 16-byte byte array on the node whose value is the UUID. With this in mind, the USB topology plugin could go ahead and find hubs with matching container IDs.

More Consumers of USB Topology

While we’ve enhanced FMA and some of the tools like diskinfo(1M), we can do more here. For example, tools like cfgadm(1M) could be enhanced to query topology information when available listing devices in verbose
mode.

If we have a mapping that we feel confident in, it could even make sense to add another alias under /dev to the device. Though these labels aren’t necessarily stable right now (as they’re meant for humans), so we’ll have to see what makes sense there.

Easier Tools to Build Topology Maps

Right now, it can take a bit of effort to build a topology map. It would be great if we had easy tooling for developing USB topology maps for different platforms that would walk someone through putting this together. It would also be useful if we had a way for a user to generate a topology for their system. That way, even if it’s something custom that’s been put together, it still isn’t too hard to put together a topology for their system.

What’s Next?

There’s a lot more to talk about with USB, topology, and hardware in general. If you’re interested in working on any of these aspects, reach out. I’m sure there’ll be more to do here as we have to deal with USB 3.2, Thunderbolt, and USB 4.0.

If you’d like to get involved, get in touch with the illumos community on IRC in #illumos on Freenode or a mailing list and I or someone else will help you out and see what we can do. As long as you’re willing to learn, receive feedback, and keep going despite difficulties, then it doesn’t matter what your experience is.

Posted on September 27, 2019 at 9:34 am by rm · Permalink · Comments Closed
In: Miscellaneous

Transceivers: The Device Between the NIC and the Network

One of the stories that has stuck with me over the years came from a support case that a former colleague, Ryan Nelson, had point on. At Joyent, we had third parties running our cloud orchestration software in their own data centers with hardware that they had acquired and assembled themselves. In this particular episode, Ryan was diagnosing a case where a customer was complaining about the fact that networking wasn’t working for them. The operating system saw the link as down, but the customer insisted it was plugged into the switch and that a transceiver was plugged in. Eventually, Ryan asked them to take a picture of the back of the server, which is where the NIC (Network Interface Card) would be visible. It turned out that the transceiver looked like it had been run over by a truck and had been jammed in — it didn’t matter what NIC it was plugged into, it was never going to work.

As part of a broader push on datacenter management, I was thinking about this story and some questions that had often come up in the field regarding why the NIC said the link was down. These were:

  1. Was there actually a transceiver plugged into the NIC?
  2. If so, did the NIC actually support using this transceiver?

Now, the second question is a bit of a funny one. The NIC obviously knows whether or not it can use what’s plugged in, but almost every time, the system doesn’t actually make it easy to find out. A lot of NIC drivers will emit a message that goes to a system log when the transceiver is plugged in or the NIC driver first attaches, but if you’re not looking for that message or just don’t happen to be on the system’s console when that happens, suddenly you’re out of luck. You might also ask why are there transceivers that aren’t supported by a NIC, but that’s a real can of worms.

Anyways, with that all in mind, I set out on a bit of a journey and put together some more concrete proposals for what to do here in terms of RFD 89: Project Tiresias. We’ll spend the rest of this entry going into a bit of background on transceivers and then discuss how we go from knowing whether or not they’re plugged in to actually determining who made them and where they are in the system.

What is a transceiver?

We’ve been using the term 'transceiver' quite a bit so far, but that’s a pretty generic term. Let’s spend a bit of time talking through that. First, we’re really focused on transceivers as used in the context of networking. When people think of wired networking, the most common thing that comes to mind are Ethernet Cables. Ethernet isn’t the only type of cable that’s been used. Before Ethernet was common, BNC coaxial cables were used on some NICs as well.

However, in the data center, Ethernet didn’t end up keeping up with the speeds and distances that connections were being use for (though 10 Gigabit Ethernet, 10GBASE-T, has started becoming more common). In this space, fiber-optic cables and copper twinaxial cables (twinax) are much more prominent. Note, twinaxial cables are rather different from their BNC coaxial relatives. Coaxial cables are used more often when there are shorter distances to cover, such as between a top of rack switch and a server. Fiber optic cables often cover longer distances or have higher throughputs.

Because different types of cables are used in different situations, several vendors got together and agreed upon a set of standards to use when manufacturing these cables. This allowed NIC manufacturers to design a single physical and electrical interface, but still support different types of transceivers. These standards (technically a multi-source agreement) are maintained by the Small Form Factor (SFF) Committee. The committee manages standards not only for networking, but also for SAS cables and other devices.

If you’ve worked in this space, you may have heard of what are called SFP and SFP+ cables. These cables generally support 1 and 10 Gigabit networking respectively. The transceiver is controlled over an i2c bus by the NIC. The addresses and their meanings are standardized. They were originally standardized in a standard called INF-8074, but the current active standard for these devices is called SFF-8472.

With faster networking speeds, there have been additional revisions and standards put out. Devices that support 40 Gigabit networking are called QSFP+ because they combine 4 SFP+ devices. To support 25 Gigabit networking, a variant of SFP+ was created called SFP28. Finally, to support 100 Gigabit networking, they combined 4 SFP28 devices together. The 40 Gigabit devices are standardized in SFF-8436 and the 100 Gigabit have their management interface described in SFF-8636.

The standards for various devices have somewhat similar layouts. They break data into a series of different pages of which a specific offset into the page can then be accessed via the NIC’s i2c bus. These pages contain some of the following information:

The pages and addresses change from specification to specification, though a large amount of the data overlaps between them. The health information of the device is required when the connector is considered active (generally fiber-optic cables with lasers) and is optional when you have a passive device (such as a copper twinax cable).

The MAC Transceiver Capability

The first part of our adventure with getting to this data begins in the operating system kernel. Similar to the case of managing NIC LEDs, the networking device driver framework has an optional capability that a driver can implement to expose this information called MAC_CAPAB_TRANSCEIVER.

The transceiver capability has a structure that the device driver has to fill out which describes some basic information for dealing with transceivers. This includes the following fields:

The driver first indicates the number of transceivers that are present for it. In general, this is one. However some devices actually support combining multiple transceivers and ports into one logical device — though this isn’t commonly used. The next item, the mct_info() function, is used to answer the two questions that were posed at the beginning of this: Does the NIC think a transceiver is present and can the NIC use the transceiver? Finally, the mct_read() function allows us to go through and read specific regions of the memory map of the transceiver. Generally, user land reads an entire 256-byte page at any given time.

The kernel only facilitates reading information. It generally doesn’t try and make semantic sense of any of the data. That is purposefully left to user land — unless there’s a good reason to parse binary data in the kernel, you’re usually better off not doing that.

The following device families and drivers support the MAC_CAPAB_TRANSCEIVER capability. Some drivers that you may be more familiar with such as 1 Gigabit Ethernet devices aren’t on this list because they don’t support transceivers. Supported devices include:

The way that each driver gets access to this information varies from device to device. Some, like i40e and cxgbe, issue a firmware command to read information from the transceiver. Others have dedicated registers that can be programmed to read arbitrary data from the i2c device.

libsff and dltraninfo

Once we have the ability to read the actual data from the transceiver, we have to make logical sense of it. To handle that, the first thing I did was write a small library called libsff. The goal of libsff is to parse the various SFF binary data payloads and return structured data as a set of name-value pairs.

If you look at the header file, libsff.h, you’ll see a list of different keys that can be looked up. Some of these are rather straightforward, such as the string "vendor", which has the name of the manufacturer of the transceiver. Others are a bit more opaque and require referencing the actual SFF documents. Another useful feature of the library is that it tries to abstract out the differences between different versions of the specifications. The goal is that when there is similar data, it should always be found under the same key even if they are found in wildly different parts of the memory map or the way we have to parse the data is different. The goal of libraries (or really any interface and abstraction) should be to take something grotty and transform it into something more usable as though reality were as simple as it presents.

The one thing that the library doesn’t generally do today is parse all of the sensor data that may be available on the transceiver. The main reason for this is that the vast majority of transceivers that I had access to, did not implement it. On SFP, SFP+, and SFP28 devices, sensor information is optional for twinax based devices. With a few devices to test with, it would be pretty straightforward to add support for it though.

On its own, a library isn’t useful unless it has a consumer. The first consumer that I’ll discuss is dltrainfo. This is an unstable, development program that I wrote to exercise this functionality and to try and get a sense of what interfaces might be useful. There are two forms of the dltrainfo command. The first answers the questions that we laid out in the beginning about whether the transceiver is present or usable. When run this way, you see something like:

# /usr/lib/dl/dltraninfo
ixgbe0: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes
ixgbe1: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes
ixgbe2: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes
ixgbe3: discovered 1 transceiver
        transceiver 0 present: yes
        transceiver 0 usable: yes

The next option is to read the information from the transceiver. Here’s an example of reading this on an Intel 10 Gbit fiber-optic transceiver:

# /usr/lib/dl/dltraninfo -v ixgbe1
ixgbe1: discovered 1 transceivers
        transceiver 0 present: yes
        transceiver 0 usable: yes
        Identifier: 'SFP/SFP+/SFP28'
        Extended Identifier: 4
        Connector: 'LC (Lucent Connector)'
        10G+ Ethernet Compliance Codes[0]: '10G Base-SR'
        Ethernet Compliance Codes[0]: '1000BASE-SX'
        Encoding: '64B/66B'
        BR, nominal: '10300 MBd'
        Length 50um OM2: '80 m'
        Length 62.5um OM1: '30 m'
        Length OM3: '300 m'
        Vendor: 'Intel Corp'
        OUI[0]: 0
        OUI[1]: 27
        OUI[2]: 33
        Part Number: 'FTLX8571D3BCV-IT'
        Revision: 'A'
        Laser Wavelength: '850 nm'
        Options[0]: 'Rx_LOS implemented'
        Options[1]: 'TX_FAULT implemented'
        Options[2]: 'TX_DISABLE implemented'
        Options[3]: 'RATE_SELECT implemented'
        Serial Number: 'AKR0EQ0'
        Date Code: '110618'
        Extended Options[0]: 'Soft Rate Select Control Implemented'
        Extended Options[1]: 'Soft RATE_SELECT implemented'
        Extended Options[2]: 'Soft RX_LOS implemented'
        Extended Options[3]: 'Soft TX_FAULT implemented'
        Extended Options[4]: 'Soft TX_DISABLE implemented'
        Extended Options[5]: 'Alarm/Warning flags implemented'
        8472 Compliance: 'Rev 10.2'

This allows us to interact with the information in a readable way. Effectively, this dumps out the entire name-value pair set that we construct when parsing data with libsff. There are two additional ways to print this data. The first one, -x, dumps out the data as hex data (kind of like if you run the program xxd). The second option -w writes out the first page, 0xa0, to a file. This allows you to take the raw data with you.

Seeing Transceivers in Topo

The next step with all this work is to expose the transceivers as part of the system topology in the fault management architecture (FMA). This is useful for a few reasons:

  1. It allows us to see what devices are present in the same snapshot as other devices like disks, CPUs, DIMMs, etc.
  2. FMA’s topology is a natural place for us to expose sensors.
  3. If a device is in topology, then we can generate error reports and faults against those devices.

Basically, being visible in the topology allows us to integrate it more fully into the system and makes it easy for various monitoring and inventory tools in the system to see these devices without having to make them aware of the underlying ways of getting data.

The topology information is organized as a tree. When we encounter hardware that we believe is a networking device (because its PCI class indicates it is), then we ask the kernel about how many transceivers it supports. For each transceiver, we create a port node under the NIC whose type indicates that it is intended for SFF devices.

When a transceiver is present, then we will place a transceiver node under the port. This node has two different groups of properties. The first is generic to all transceivers, which is where we indicate whether or not the hardware can use the transceiver. The second group are properties that we derive from the SFF specifications about the transceiver’s manufacturing data. This includes the vendor, part number, serial number, etc. The following block of text shows three different nodes: the NIC, the port, and the transceiver:

hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
iexdev=0/pciexfn=1
    label             string    MB
    FRU               fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0
    ASRU              fmri      dev:////pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X9SCL-X9SCM
    chassis-id        string    0123456789
    server-id         string    ivy
  group: io                             version: 1   stability: Private/Private
    dev               string    /pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1
    driver            string    ixgbe
    module            fmri      mod:///mod-name=ixgbe/mod-id=242
  group: pci                            version: 1   stability: Private/Private
    device-id         string    10fb
    extended-capabilities string    pciexdev
    class-code        string    20000
    vendor-id         string    8086
    assigned-addresses uint32[]  [ 2197946640 0 3750756352 0 1048576 2164392216 0 57344 0 32 2197946656 0 3753902080 0 16384 ]

hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
iexdev=0/pciexfn=1/port=0
    FRU               fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
iexdev=0/pciexfn=1
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X9SCL-X9SCM
    chassis-id        string    0123456789
    server-id         string    ivy
  group: port                           version: 1   stability: Private/Private
    type              string    sff

hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
    FRU               fmri      hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X9SCL-X9SCM
    chassis-id        string    0123456789
    server-id         string    ivy
  group: transceiver                    version: 1   stability: Private/Private
    type              string    sff
    usable            string    true
  group: sff-transceiver                version: 1   stability: Private/Private
    vendor            string    Intel Corp
    part-number       string    FTLX8571D3BCV-IT
    revision          string    A
    serial-number     string    AKR0EQ0

If you plug in a transceiver dedicated to fibre channel, then we’ll properly note that we can’t use the transceiver by setting the usable property to false. The following is an example of the transceiver node in that case:

hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
    FRU               fmri      hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    X10SLM+-LN4F
    chassis-id        string    0123456789
    server-id         string    haswell
  group: transceiver                    version: 1   stability: Private/Private
    type              string    sff
    usable            string    false

Further Reading

If you’d like to read more on this, there are a couple of different places that I can send you.

For more on the SFF standards, there’s:

If you’re interested in the illumos implementation, there’s:

Looking Ahead

If you have a favorite NIC that uses SFP-based transceivers and it isn’t supported, reach out and we’ll see what we can do. If you’d find it interesting to work on exposing more of the sensor information present in the SFPs, then we’d be happy to further mentor someone there. Once these pieces are exposed in topology, it could also make sense to wire up the FRU monitor to watch for temperature thresholds, voltage drops, or device faults.

Up next, we’ll talk about understanding the topology of USB devices.

Posted on September 12, 2019 at 9:27 am by rm · Permalink · Comments Closed
In: Miscellaneous

A Tale of Two LEDs

It was the brightest of LEDs, it was the darkest of LEDs, it was the age of data links, it was the age of AHCI enclosure services, …

Today, I’d like to talk about two aspects of a project that I worked on a little while back under the aegis of RFD 89 Project Tiresias. This project covered improving the infrastructure for how we gathered and used information about various components in the system. So, let’s talk about LEDs for a moment.

LEDs are strewn across systems and show up on disks, networking cards, or just to tell us the system is powered on. In many cases, we rely on the blinking lights of a NIC, switch, hard drive, or another component to see that data is flowing. The activity LED is a mainstay of many devices. However, there’s another reason that we want to be able to control the LEDs: for identification purposes. If you have a rack of servers and you’re trying to make sure you pull the right networking cable, it can be helpful to be able to turn a LED on, off, or blink it with a regular pattern. So without further ado, let’s talk about how we control LEDs for network interface cards (NICs or data links) and a class of SATA hard drives.

NIC LEDs

The first class of devices that I’d like to talk about are networking cards. Most every NIC that you can plug an Ethernet cable or a copper/fiber-optic transceiver in has at least two LEDs for each port. One that indicates that traffic is flowing over the link, an activity LED, and one that indicates the fact that the link is actually up. Sometimes different colors are used to indicate different speeds. For example, a 25 GbE capable device may use an orange color to indicate that the link is operating at 10 GbE and a green one to indicate that it is operating at 25 GbE.

The first challenge we have in the operating system is to figure out how to actually tell the device to cause its LEDs to behave in a specific way. Unfortunately, practically every NIC (or family of NICs) has its own unique way of doing this and well, just about everything else. This is the operating system’s curse and one of its primary responsibilities — designing abstractions for device drivers that make it easy to enable new hardware and take advantage of unique features that different devices offer while minimizing the amount of work that is required to do so.

In illumos, there is a framework for networking device drivers called mac. The mac framework is named after the device driver that implements it, mac(9E). Sometimes you’ll hear it called by an alternative name: ‘Generic LAN Device version 3′ (GLDv3). The mac framework uses the concept of capabilities, which represent different features that hardware offers. For example, this includes such things as checksum offload, TCP segmentation offload (TSO), and more. Each capability can have arbitrary data associated with it that describes more information about the capability itself.

For controlling the LEDs, there’s a new capability called MAC_CAPAB_LED. With the capability, a driver has to fill in three pieces of information:

  1. A set of flags to indicate optional behavior. Currently none are defined, but this is here for future expansion.
  2. A set of supported modes that the LED can be set to. This includes:
    • MAC_LED_ON which indicates that the LED should be turned on solidly.
    • MAC_LED_OFF which indicates that the LED should be turned off.
    • MAC_LED_IDENT which indicates that the LED should be set in a way to identify the device. Generally this means that it should blink. When this can use a different color, that’s even better.
  3. A function that the operating system should call to set the LED state.

The structure looks something like:

typedef struct mac_capab_led {
    uint_t mcl_flags;
    mac_led_mode_t mcl_modes;
    int (*mcl_set)(void *driver, mac_led_mode_t mode, uint_t flags);
} mac_capab_led_t;

Basically, when the operating system wants to change the LED state, it’ll end up calling the mcl_set function to set the new mode. When the LED state should be changed back to normal, a fourth state is passed: MAC_LED_DEFAULT. The operating system guarantees that it will never call this function in parallel for the driver to try to simplify the programming model, though the driver will likely have I/O ongoing.

Some devices, such as older Intel client parts based on the e1000g driver don’t actually have blink functionality. As such, the driver today emulates that, though it would be good to move that into the broader mac framework when we need to do that for another driver.

The following devices and drivers currently have support for this functionality:

Plumbing it up in user land

With all of the above hardware support, there’s currently a private utility in illumos to control these called dlled that can be found at /usr/lib/dl/dlled. Now, you might ask: why is the program hiding in /usr/lib/dl? The main reason is that we’re still not sure what the right interface for controlling this should be. We should likely integrate it into the dladm(1M) command and allow the LEDs to be controlled through the Fault Management Architecture like we do with other LEDs. However, until we figure out exactly what we want, this gives us something to experiment with.

If you run it on a system, you’ll see something like:

# /usr/lib/dl/dlled
LINK                 ACTIVE       SUPPORTED
igb0                 default      default,off,on,ident
igb1                 default      default,off,on,ident
igb3                 default      default,off,on,ident
igb2                 default      default,off,on,ident

From here, we can use the -s option to then control and change the state. If you set the state, it should persist across anyone pulling or removing a cable in the device, but nothing that’s set will persist across a reboot of the system. If we set igb0 to ident mode, then we’ll see that the current state is updated:

# /usr/lib/dl/dlled -s ident igb0
# /usr/lib/dl/dlled
LINK                 ACTIVE       SUPPORTED
igb0                 ident        default,off,on,ident
igb1                 default      default,off,on,ident
igb3                 default      default,off,on,ident
igb2                 default      default,off,on,ident

While we have a video to demo this in action, for the moment let’s instead talk about another class of LEDs.

AHCI LEDs

Many systems have an AHCI controller built into their primary chipset. The AHCI Controller is used to attach SATA hard disk drives (HDDs) and solid state drives (SSDs). AHCI stands for ‘Advanced Host Controller Interface’. It is a specification that describes how to discover attached devices and send commands to HDDs and SSDs.

Now, you may be saying to yourself, I’ve seen a hard drive or an SSD, but I don’t recall seeing any LEDs on them. And that’s right. Unlike NICs, which have the LEDs built in, the LEDs aren’t built into the drives, but into enclosures — the bays on the system that house drives.

LED Modes

When dealing with hard drives, there have historically been four different things that the system wants to communicate:

  1. That the drive in the bay has a fault — it is no longer working or it is OK.
  2. That the drive in the bay is ‘OK to remove’.
  3. To identify a specific bay by blinking the LED.
  4. To show that there is activity to the device in the bay.

Typically, the first three items share a single LED, while a second LED is used to drive activity. A side effect of this is that somewhere in either hardware or firmware there is a hierarchy for the first three tasks. For example, if nothing else is going on then the drive’s LED may be a solid green color. If the drive has been faulted, it may instead turn to an amber color. Blinking the LED may be in the amber color or it may be something else entirely.

In the majority of cases, the activity LED isn’t controllable by software; however, the other LED can be. This means that when say ZFS decides a drive is bad, it can eventually lead to the operating system turning on an amber LED to indicate where a problem is.

While we have mentioned the ‘OK to remove’ LED, that has become less and less commonly used and implemented. It’s listed here for completeness, but may not actually be something that you can control or even see on some systems.

Enclosure Management

As we mentioned earlier, the LEDs are part of the bays and not part of the drives. This has lead to a suite of different standards and means for enclosure management. The one that applies will actually vary depending on the wiring of the system. For example, while the same 2.5 inch drive bay can be used for SATA, SAS, and NVMe devices, the way that the enclosure is controlled varies a lot based on the disk controller and the system’s broader design.

In systems that are using SAS controllers (regardless of whether one uses SATA or SAS drives), there is a specification that describes how to enumerate the different bays that devices can show up in, figure out which drives are in which bays, and control the LEDs and get other information. In a SAS world, this is called SCSI enclosure services or SES. When using an all NVMe system, SES won’t exist. Similarly, when you’re using SATA devices and an AHCI controller, then something different ends up happening. In fact, you’re not even guaranteed that there will be a SES device even in a SAS world!

The AHCI specification provides optional enclosure management features. If you want to follow along in the AHCI specification, open up to chapter 12 — Enclosure Management. There are three primary items the related to enclosure management:

This region of memory that can be used to send and receive messages is used because the standard allows the system to participate in one of four different messaging schemes. Messages specific to the underlying scheme are placed in a region of memory and if the scheme supports replies, it’ll place replies in there.

Now, the specification supports capability bits for up to four different schemes:

  1. SAF-TE protocol
  2. SES-2 protocol
  3. SGPIO
  4. The AHCI specification’s LED format

Despite all of the different forms mentioned above, we’ve yet to discover a system that indicates support for something other than the LED format. Though we have found one or two systems that incorrectly say they support the LED format and don’t. As a result, the LED format is the only one that we currently support.

The message format is pretty straightforward. You indicate which port on the controller you care about and then indicate whether you want to enable the identification LED or the fault LED.

Ultimately though, whether or not any of this works at all depends on the manufacturer of the system and what they wire up. Hopefully, if they advertise the LED format support, then we’ll be able to properly drive it.

Wiring it up

To make all of this work, the first thing we had to do was to add support to the ahci(7D) driver to make it detect whether or not the controller had enclosure services and provide a means to control it. The state and capabilities are all managed by a private ioctl(2). These ioctls allow one to get information about what’s supported, the number of ports, and the current state of each port. A second ioctl exists to set the state of a specific port.

Hardware has the constraint that only a single message can be sent at any given time. Therefore the driver serializes these requests and centralizes all the activity in a task queue. In that taskq, we’ll then attempt to send and receive the messages that hardware expects. You can see the details of creating the message and writing it to hardware in the ahci_em_set_led() function.

Don’t worry. You don’t have to try and write ioctls yourself. Similar to the dlled private command, there’s a private command to manipulate this. Let’s take a look at it:

# /usr/lib/ahci/ahciem
PORT                 ACTIVE       SUPPORTED
ahci0/0              default      ident,fault,default
ahci0/1              default      ident,fault,default
ahci0/2              default      ident,fault,default
ahci0/3              default      ident,fault,default
ahci0/4              default      ident,fault,default
ahci0/5              default      ident,fault,default

You can set the state of an individual port here in the same way as with the dlled command. With the ‘default’ state being its normal behavior, with neither the ident or fault LED turned on.

Mapping LEDs to bays and disks

Now, there’s still a problem. We haven’t actually solved the problem of mapping up the ports described above to actual disks. Because this is being driven by the AHCI controller, it only provides us the means of toggling this on its logical ports. While we can know what disk is attached to what port, it doesn’t actually tell us where that disk physically is.

In illumos, we have the ability to construct a per-platform map that defines additional information about the topology of the system. It helps us answer the question of what is where. My colleague Jordan Hendricks was the one who solved this problem for us. She added support for us to relate specific bays that we declare in a per-platform topology map to the corresponding port. This allows us to answer the mapping question as well as control the LEDs through the topology like we do for other parts of the system. It’s great work that takes what was a building block and actually makes it useful in the broader system. You can find a lot of examples of it in the illumos bug tracker and you should read the code itself!

See it in action

When I had a working demo of the ability to control NIC LEDs, I ended up recording an impromptu demo with the help of my occasional partner in crime, Alex Wilson.

What’s Next?

If you’re interested in adding support for NIC LEDs to another device that you have, get in touch with the illumos community on IRC in #illumos on Freenode or a mailing list and I or someone else will help you out and see what we can do. Ultimately, both of these changes are building blocks that can be built on top of. Jordan’s work is an example of that in the AHCI case. There’s more that can be built on top of the basic NIC functionality as well.

Next time, we’ll touch on another piece of related work: understanding the state of copper and fiber-optic transceivers for different NICs and reading their EEPROMs.

Posted on September 6, 2019 at 10:44 am by rm · Permalink · Comments Closed
In: Miscellaneous

CPU and PCH Temperature Sensors in illumos

A while back, I did a bit of work that I’ve been meaning to come back to and write about. The first of these are all about making it easier to see the temperature that different parts of the system are working with. In particular, I wanted to make sure that I could understand the temperature of the following different things:

While on some servers this data is available via IPMI, that doesn’t help you if you’re running a desktop or a laptop. Also, if the OS can know about this as a first class item, why bother going through IPMI to get at it? This is especially true as IPMI sometimes summarizes all of the different readings into a single one.

Seeing is Believing

Now, with these present, you can ask fmtopo to see the sensor values. While fmtopo itself isn’t a great user interface, it’s a great building block to centralize all of the different sensor information in the system. From here, we can build tooling on top of the fault management architecture (FMA) to better see and understand the different sensors. FMA will abstract all of the different sources. Some of them may be delivered by the kernel while others may be delivered by user land. With that in mind, let’s look at what this looks like on a small Kaby Lake NUC:

[root@estel ~]# /usr/lib/fm/fmd/fmtopo -V *sensor=temp
TIME                 UUID
Aug 12 20:44:08 88c3752d-53c2-ea3a-c787-cbeff0578cd0

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=0?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    43.000000

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=1?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=1?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    49.000000

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    49.000000

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chipset=0?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chipset=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    46.000000

While it’s a little bit opaque, you might be able to see that we have four different temperature sensors here:

On an AMD system, the output is similar, but the sensor exists on a slightly different granularity than a per-core basis. We’ll come back to that a bit later.

Pieces of the Puzzle

To make all of this work, there are a few different pieces that we put together:

  1. Kernel device drivers to cover the Intel and AMD CPU sensors, and the Intel PCH
  2. A new evolving, standardized way to export simple temperature sensors
  3. Support in FMA’s topology code to look for such sensors and attach them

We’ll work through each of these different pieces in turn. The first part of this was to write three new device drivers one to cover each of the different cases that we cared about.

coretemp driver

The first part of the drivers is the coretemp driver. This uses the temperature interface that was introduced on Intel Core family processors. It allows an operating system to read a MSR (Model Specific Register) to determine what the temperature is. The support for this is indicated by a bit in one of the CPUID registers and exists on almost every Intel CPU that has come out since the Intel Core Duo.

Around the time of Intel’s Haswell processor (approximately), Intel added another CPUID bit and MSR that indicates what the temperature is on each socket.

The coretemp driver has two interesting dynamics and problems:

  1. Because of the use of the rdmsr instruction, which reads a model-specific register, one can only read the temperature for the CPU that you’re currently executing on. This isn’t too hard to arrange in the kernel, but it means that when we read the temperature we’ll need to organize what’s called a ‘cross-call’. You can think of a cross-call as a remote procedure call, except that the target is a specific CPU in the system and not a remote host.

  2. Intel doesn’t actually directly encode the temperature in the MSRs. Technically, the value we read represents an offset from the processor’s maximum junction temperature, often abbreviated Tj Max. Modern Intel processors provide a means for us to read this directly via an MSR. However, older ones, unfortunately, do not. On such older processors, the Tj Max actually varies based on not just the processor family, but also the brand, so different processors running at different frequencies have different values. Some of this information can be found in various datasheets, but for the moment, we’ve only enabled this driver for CPUs that we can guarantee the Tj Max value. If you have an older CPU and you’d like to see if we could manually enable it, please reach out.

pchtemp driver

The pchtemp driver is a temperature sensor driver for the Intel platform controller hub (PCH). The driver supports most Intel CPUs since the Haswell generation, as the format of the sensor changed starting with the Haswell-era chipsets.

This driver is much simpler than the other ones. The PCH exposes a dedicated pseudo-PCI device for this purpose. The pchtemp driver simply attaches to that and reads the temperature when required. Unlike the coretemp driver, the offset in temperature is the same across all of the currently supported platforms so we don’t need to encode anything special there like we do for the coretemp driver.

amdf17nbdf driver

The amdf17nbdf driver is a bit of a mouthful. It stands for the AMD Family 17h North Bridge and Data Fabric driver. Let’s take that apart for a moment. On x86, CPUs are broken into different families to represent different generations. Currently all of AMD’s Ryzen/EPYC processors that are based on the Zen microarchitecture are all grouped under Family 17h. The North Bridge is a part of the CPU that is used to interface with various I/O components on the system. The Data Fabric is a part of AMD CPUs which connects CPUs, I/O devices, and DRAM.

On AMD Zen family processors, the temperature sensor exists on a per-die basis. Each die is a group of cores. The physical chip has a number of such units, each of which in the original AMD Naples/Zen 1 processor has up to 8 cores. See the illumos cpuid.c big theory statement for the details on how the CPU is constructed and this terminology. Effectively, you can think of it as there are a number of different temperature sensors, one for each discrete group of cores.

To talk to the temperature sensor, we need to send a message on the what AMD calls the ‘system management network’ (SMN). The SMN is used to connect different management devices together. The SMN can be used for a number of different purposes beyond just temperature sensors. The operating system has a dedicated means of communicating and making requests over the SMN by using the corresponding north bridge, which is exposed as a PCI device.

The same way that with the coretemp driver you needed to issue a rdmsr instruction for the core that you wanted the temperature from, you need to do the same thing here. Each die has its own north bridge and therefore we need to use the right instance to talk to the right group of CPUs.

The wrinkle with all of this is that the north bridge on its own doesn’t give us enough information to map it back to a group of cores that an operator sees. This is critical, since if you can’t tell which cores you’re getting the temperature reading for, it immediately becomes much less useful.

This is where the data fabric devices come into play. The data fabric devices exist at a rather specific PCI bus, device, and function. They all are always defined to be on PCI bus 0. The data fabric device for the first die is always defined to be at device 18h. The next one is at 19h, and so on. This means that we have a deterministic way to map between a data fabric device and a group of cores. Now, that’s not enough on its own. While we know the data fabric, we don’t know how to map that to the north bridge.

Each north bridge in the system is always defined to be on its own PCI bus. The device we care about is always going to be device and function 0. The data fabric happens to have a register which defines for us the starting PCI bus for its corresponding north bridge. This means that we have a path now to get to the temperature sensor. For a given die, we can find the corresponding data fabric. From the data fabric, we can find the corresponding north bridge. And finally, from the north bridge, we can find the corresponding SMN (system management network) that we can communicate with.

With all that, there’s one more wrinkle. On some processors, most notably the Ryzen and ThreadRipper families, the temperature that we read has an offset encoded with it. Unfortunately, these offsets have only been documented in blog posts by AMD and not in the formal documentation. Still, it’s not too hard to take this into account once official documentation becomes available.

While our original focus was on adding support for AMD’s most recent processors, if you have an older AMD processor and are interested in wiring up the temperature sensors on it, please get in touch and we can work together to implement something.

sys/sensors.h

Now that we have drivers that know how to read this information, the next problem we need to tackle is how do we expose this information to user land. In illumos, the most common way is often some kind of structured data that can be retrieved by an ioctl on a character device, or some other mechanism like a kernel statistic.

After talking with some folks, we put together a starting point for a way for a kernel to exposes sets of statistics and created a new header file in illumos called sys/sensors.h. This header file isn’t currently shipped and is only used by software in illumos itself. This makes it easy for us to experiment with the API and change it without worrying about breaking user software. Right now, each of the above drivers implements a specific type of character device that implements the same, generic interface.

The current interface supports two different types of commands. The first, SENSOR_IOCTL_TYPE, answers the question of what kind of sensor is this. Right now, the only valid answer is SENSOR_KIND_TEMPERATURE. The idea is that if we have other sensors, say voltage, current, or
something else, we could return a different kind. Each kind, in turn, promises to implement the same interface and information.

For temperature devices, we need to fill in a singular structure which is used to retrieve the temperature. This structure currently looks something like:

typedef struct sensor_ioctl_temperature {
        uint32_t        sit_unit;
        int32_t         sit_gran;
        int64_t         sit_temp;
} sensor_ioctl_temperature_t;

This is kind of interesting and incorporates some ideas from Joshua Clulow and Alex Wilson. The sit_unit member is used to describe what unit the temperature is in. For example, it may be in Celsius, Kelvin, or Fahrenheit.

The next two members are a little more interesting. The sit_temp member contains a temperature reading, the sit_gran member is whats important in how we interpret that temperature. While many sensors end up communicating to digital consumers using a power of 2 based reading, that’s not always the case. Some sensors often may report a reading in units such as 1/10th of a degree. Others may actually report something in a granularity of 2 degrees!

To try and deal with this menagerie, the sit_gran member indicates the number of increments per degree in the sit_temp member. If this was set to 10, then that would mean that sit_temp was in 10ths of a degree and to get the actual value in degrees, one would need to divide by 10. On the other hand, a negative value instead means that we would need to multiply. So, a value of -2 would mean that sit_temp was in units of 2 degrees. To get the actual temperature, you would need to multiply sit_temp by 2.

Now, you may ask why not just have the kernel compute this and have a ones digit and a decimal portion. The main goal is to avoid floating point math in the kernel. For various reasons, this historically has been avoided and we’d rather keep it that way. While this may seem a little weird, it does allow for the driver to do something simpler and lets user land figure out how to transform this into a value that makes semantic sense for it. This gets the kernel out of trying to play the how many digits after the decimal point would you like game.

Exposing Devices

The second part of this bit of kernel support is to try and provide a uniform and easy way to see these different things under /dev in the file system. In illumos, when someone creates a minor node in a device driver, you have to specify a type of device and a name for the minor node. While most of the devices in the system use a standard type, we added a new class of types for sensors that translate roughly into where you’ll find them.

So, for example, the CPU drivers use the class that has the string "ddi_sensor:temperature:cpu" (usually as the macro DDI_NT_SENSOR_TEMP_CPU). This is used to indicate that it is a temperature sensor for CPUs. The coretemp driver then creates different nodes for each core and socket. For example, on a system with an Intel E3-1270v3 (Haswell), we see the following devices:

[root@haswell ~]# find /dev/sensors/
/dev/sensors/
/dev/sensors/temperature
/dev/sensors/temperature/cpu
/dev/sensors/temperature/cpu/chip0
/dev/sensors/temperature/cpu/chip0.core0
/dev/sensors/temperature/cpu/chip0.core1
/dev/sensors/temperature/cpu/chip0.core2
/dev/sensors/temperature/cpu/chip0.core3
/dev/sensors/temperature/pch
/dev/sensors/temperature/pch/ts.0

On the other hand on an AMD EPYC system with two AMD EPYC 7601 processors, we see:

[root@odyssey ~]# find /dev/sensors/
/dev/sensors/
/dev/sensors/temperature
/dev/sensors/temperature/cpu
/dev/sensors/temperature/cpu/procnode.0
/dev/sensors/temperature/cpu/procnode.1
/dev/sensors/temperature/cpu/procnode.2
/dev/sensors/temperature/cpu/procnode.3
/dev/sensors/temperature/cpu/procnode.4
/dev/sensors/temperature/cpu/procnode.5
/dev/sensors/temperature/cpu/procnode.6
/dev/sensors/temperature/cpu/procnode.7

The nice thing about the current scheme is that anything of type ddi_sensor will have a directory hierarchy created for it based on the different interspersed : characters. This makes it very easy for us to experiment with different kinds of sensors without having to go through too much effort. That said, this is all still evolving, so there’s no promise that this won’t change. Please don’t write code that relies on this. If you do, it’ll likely break!

FMA Topology

The last piece of this was to wire it up in FMA’s topology. To do that, I did a few different pieces. The first was to make it easy to add a node to the topology that represents a sensor backed by this kernel interface. There’s one generic implementation of that which is parametrized by the path.

With that, I first modified the CPU enumerator. The logic will use a core sensor if available, but can also fall back to a processor-node sensor if it exists. Next, I added a new piece of enumeration, which was to represent the chipset. If we have a temperature sensor, then we’ll enumerate the chipset under the motherboard. While this is just the first item there, I suspect we’ll add more over time as we try to properly capture more information about what it’s doing, the firmware revisions that are a part of it, and more.

This piece is, in some ways, the simplest of them all. It just builds on everything else that was already built up. FMA already had a notion of a sensor (which is used for everything from disk temperature to the fan RPM), so this was just a simple matter of wiring things up.

Now, we have all of the different pieces that made the original example of the CPU and PCH temperature sensor work.

Further Reading

If you’re interested in learning more about this, you can find more information in the following resources:

In addition, you can find theory statements that describe the purpose of the different drivers and other pieces that were discussed earlier and
their manual pages:

Finally, if you want to see the actual original commits that integrated these changes, then you can find the following from illumos-gate:

commit dc90e12310982077796c5117ebfe92ee04b370a3
Author: Robert Mustacchi <rm@joyent.com>
Date:   Wed Apr 24 03:05:13 2019 +0000

    11273 Want Intel PCH temperature sensor
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Mike Zeller <mike.zeller@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Gergő Doma <domag02@gmail.com>
    Reviewed by: Paul Winder <Paul.Winder@wdc.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

commit f2dbfd322ec9cd157a6e2cd8a53569e718a4b0af
Author: Robert Mustacchi <rm@joyent.com>
Date:   Sun Jun 2 14:55:56 2019 +0000

    11184 Want CPU Temperature Sensors
    11185 i86pc chip module should be smatch clean
    Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
    Reviewed by: Jordan Hendricks <jordan.hendricks@joyent.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Garrett D'Amore <garrett@damore.org>

Looking Ahead

While this is focused on a few useful temperature sensors, there are more that we’d like to add. If you have a favorite sensor that you’d like to see, please reach out to me or the broader illumos community and we’d be happy to take a look at it.

Another avenue that we’re looking to explore is having a standardized sensor driver such that a device driver doesn’t necessarily have to have its own character device or implementation of the ioctl handling.

Finally, I’d like to thank all those who helped me as we discussed different aspects of the API, reviewed the work, and tested it. None of this happens in a vacuum.

Posted on August 14, 2019 at 7:44 am by rm · Permalink · Comments Closed
In: Miscellaneous

Turtles on the Wire: Understanding how the OS uses the Modern NIC

The modern networking card (NIC) has evolved quite a bit from the simple Ethernet cards of yesteryear. As such, the way that the operating system uses them has had to evolve in tandem. Gone are the simple 10 Mbit/s copper or (BNC) devices. Instead, 1 Gb/s is common-place in the home, 10 Gb/s rules the server, and you can buy cards that come in speeds like 25 Gb/s, 40 Gb/s, and even 100 Gb/s! Speed isn’t the only thing that’s changed, there’s been a big push to virtualization. What used to be one app on one server, transformed into a couple apps on a few Hardware Virtual Machines, and these days can be hundreds of containers all on a single server.

For this entry, we’re going to be focusing on how the Operating System sees NICs, what abstractions they provide together, how things have changed to deal with the increased tenancy and performance demands, and then finally where we’re going next with all this. We’re going to focus on where scalability problems have come about and talk about how they’ve been solved.

The Simple NIC

While this is a broad generalization, the simplest way to think of a networking card is that it has five primary pieces:

  1. A MAC Address that it can use to filter incoming packets.
  2. A ring, or circular buffer, that packets are received into from the network.
  3. A ring, or circular buffer, that packets are sent from to the network.
  4. The ability to generate interrupts.
  5. A way to program all of the above, generally done with PCI memory-mapped registers.

First, let’s talk about rings. Both of these rings are considered a type of circular buffer. With a circular buffer, the valid region of the buffer changes over time. Circular buffers are often used because of the property that they consume a fixed amount of memory and they handle the model of a single producer and a single consumer rather well. One end, the producer, places data in the buffer and moves a head pointer, while another end, the consumer, removes data from the buffer, moving the tail pointer. For the rest of this article, we won’t be using the term circular buffer, but instead ring, which is commonly used in both networking card programming manuals and operating systems to describe a circular buffer used for packet descriptors.

Let’s go into a bit more detail about how this works. Rings are the lifeblood of a networking card. They occupy a fixed chunk of memory in normal system memory and when the hardware wants to access it, it will perform DMA, direct memory access. All the data that comes and goes through it is described by an entry in a ring, called a descriptor. The ring is made of a series of these descriptors. A descriptor is generally made up of three parts:

  1. A buffer address
  2. A buffer length
  3. Metadata about the packet

Consider the receive ring. Each entry in it describes a place that the hardware can place an incoming packet. The buffer in the descriptor says where in main memory to place the packet and the length says how big that buffer is. When a packet arrives, a networking card will generally generate an interrupt, thus letting the operating system know that it should check the ring.

The transmit side is similar, but slightly different in orientation. The OS places descriptors into the ring and then indicates to the networking card that it should transmit those packets. The descriptor says where in memory to find the packet and the length says how long the packet is. The metadata may include information such as whether the packet is broken up into one or more descriptors, the type of data in the packet, or optional features to perform.

To make this more concrete, let’s think of this in the context of receiving packets. Initially, the ring is empty. The OS then fills all of the descriptors with pointers to where the hardware can put packets it receives. The OS doesn’t have to fill the entire ring, but it’s standard practice in most operating systems to do so. For this example, we’ll assume that the OS has filled the entire ring.

Because it’s written to the entire ring, the OS will set its pointer to the start of the buffer. Then, as the hardware receives packets, the hardware will bump its pointer and send the OS an interrupt.

The following ASCII art image describes how the ring might look after the hardware has received two packets and delivered the OS an interrupt:

        +--------------+ <----- OS Pointer
        | Descriptor 0 |
        +--------------+
        | Descriptor 1 |
        +--------------+ <----- Hardware Pointer
        | Descriptor 2 |
        +--------------+
        |     ...      |
        +--------------+
        | Descriptor n |
        +--------------+

When the OS receives the interrupt, it reads where the hardware pointer is and processes those packets between its pointer and the hardware’s. Once it’s done, it doesn’t have to do anything until it prepares those descriptors with fresh buffers. Once it does, it’ll update its pointer by writing to the hardware. For example, if the OS has processed those first two descriptors and then updates the hardware, the ring will look something like:

        +--------------+
        | Descriptor 0 |
        +--------------+
        | Descriptor 1 |
        +--------------+ <----- Hardware Pointer, OS Pointer
        | Descriptor 2 |
        +--------------+
        |     ...      |
        +--------------+
        | Descriptor n |
        +--------------+

When you send packets, it’s similar. The OS fills in descriptors and then notifies the hardware. Once the hardware has sent them out on the wire, it then injects an interrupt and indicates which descriptors it’s written to the network, allowing the OS to free the associated memory.

MAC Address Filters and Promiscuous Mode

The missing piece we haven’t talked about is the filter. Just about every networking card has support for filtering incoming packets, which is done based on the destination MAC address of the packet. By default, this filter is always programmed with the MAC address of the card. By default, if the destination MAC address of the Ethernet header doesn’t match the networking card’s MAC address, and it isn’t a broadcast or multicast packet, then it will be dropped.

When networking cards want to receive more traffic than that which they’ve been programmed to receive, they’re put into what’s called promiscuous mode, which means that the card will place every packet it receives into the receive ring for the operating system to process, regardless of the destination MAC address.

Now, the question that comes to mind, is why do cards have this filtering capability at all? Why does it matter? Why would we only care about a subset of packets on the network?

To answer this, we first have to go back to the world before network switches, there were only hubs. When a packet came into a hub, it was replicated to every port. This mean that every NIC would end up seeing everything that went out. If they hadn’t filtered by default, then the sheer amount of traffic and interrupts might have overwhelmed many early systems, particularly on larger networks. Nowadays, this isn’t quite as problematic as we use switches, which learn which MAC addresses are behind which switch ports. Of course, there are limits to this, which have motivated a lot of the work around network virtualization, but more on that in a future blog post.

Problem: The Single Busy CPU

In the previous section, we talked about how we had an interrupt that fired whenever packets were successfully received or transmitted. By default, there was only ever a single interrupt used and that interrupt vector usually mapped to exactly one CPU, in other words only one CPU could process the initial stream of packets. In many systems, that CPU would then process that incoming packet all the way through the TCP/IP stack until it reached a socket buffer for an application to read.

This has led to many problems. The biggest are that:

  1. You have a single CPU that’s spending all its time handling interrupts, unable to do user work
  2. You might often have increased latency for incoming packets due to the processing time

These problems were especially prevalent on single CPU systems. While it may be hard to remember, for a long time, most systems only had a single socket with a single CPU. There were no hardware threads and there were no cores. As multi-processing became more common (particularly on x86), the question became how do networking cards take advantage of horizontal scalability.

A Swing and a Miss

The first solution that springs to mind when we talk about this is say why don’t we just have multiple threads process the same ring? If we have multiple CPUs in the system, why not just have a thread we can schedule to help do some of the heavy lifting along with the thread that’s processing the interrupt?

Unfortunately, this is where Amdahl starts to rear his head, glaring at us and reminding us of the harsh truths of multi-processing. Multiple threads processing the same queue doesn’t really do us much good. Instead, we’ll end up changing where our contention is. Instead of having a single CPU 100% busy, we’ll have several CPUs, quite busy, and spending a lot of their time blocked on locks. The problem is that we still have shared state here — the ring itself.

There’s actually another somewhat nastier and much more subtle problem here: packet ordering. While TCP has a lot of logic to deal with out of order packets and UDP users are left on their own, no one likes out of order packets. In many TCP stacks, whenever a packet is detected to have arrived out of order, that sends up red flags in the stack. This will cause TCP to believe there is something wrong in the network and often throttle traffic or require data to be retransmitted, injecting noticeable latency and performance impacts.

Now, if the packets arrive out of order on the networking card, then there’s not a lot we can do here. However, if they arrive in order and we have multiple threads processing the same ring, then due to lock ordering and the scheduler, it all of a sudden becomes very easy to have packets arrive out of order, leading to TCP throwing its hands up, and performance being left behind in the gutter.

So we now actually have two different problems:

  1. We need to make sure that we don’t allow packets to be delivered out of order
  2. We want to avoid sharing data structures protected by the same lock

Nine Rings for Packets Doomed to be Hashed

The answer to this problem is two-fold. Earlier we said that a ring is the lifeblood of a NIC, so what if we just put added a bunch more rings to the card. If each ring has its own interrupt associated with it, then we’ve solved our contention problem. Each ring is still processed by only a single thread. The fastest way to deal with shared resources is not to share at all!

So now we’re left with the question of how do we deal with the out of order packet problem. If we simply assigned each incoming packet to a ring in a round-robin fashion, we’d only be making the problem of out of order delivery a certainty. So this means that we need to do something a little bit smarter.

The important observation is that what we care about is that a given TCP or UDP connection always go to the same place. It’s a TCP connection that can become out of order. If there are two different connections ongoing, the order that their packets are received in doesn’t matter. Only the order of a single connection matters. Now all we need to do is assign a given connection to a ring. For a given TCP connection, the source and destination IP addresses, and the source and destination ports will be the same throughout the lifetime of the connection. You might sometimes hear this called a flow, a series of identifying information that identifies some set of traffic.

The way the assignments are made is by hashing. Networking cards have a hash function that takes into account various fields in the header that are constant and use them to produce a hash value. Then that hash value is used to assign something to a ring. For example, if there were four rings in use, a simple way to assign traffic is to simply take the hash value and compute its modulus by the number of rings, giving a constant ring index.

Different cards use different strategies for this. You also don’t necessarily need to use all of the members of the header. For example, while you could use both the source and destination ports and IP addresses for a TCP connection, if the card ignored the ports, the right thing would still happen. Traffic wouldn’t end up out of order; however, it might not be spread quite as evenly amongst the rings.

This technology is often referred to as receive side scaling (RSS). On SmartOS and other illumos derived systems, RSS is enabled automatically if the device and its driver support it.

Problem: Density, Density, Density

As Steve Ballmer famously once said, “Developers, Developers, Developers…”. For many companies today, it isn’t developers that is the problem, but density. The density and utilization of machines is one of the most important factors for their efficiency and their capex and opex budgets. The first major entry into enterprise for tackling this density was VMware and the Hardware Virtual Machine. However, operating system virtualization had also kicked off. For more on the history, listen to Bryan Cantrill‘s talk at Papers we Love.

Just like airlines don’t want to fly with empty seats on planes, when companies are buying servers, they want to make sure that they are fully utilized. Due to the increasing size of machines, that means that the number of things running on it has to increase. With rare exception, gone are the days of the single app on a server. This means that the number of IP addresses and MAC addresses per server has jumped dramatically. We cannot just load up a box with NICs and assign each physical device to a guest. That’s a lot of cards, ports, cables, and top of rack switches.

However, increasing density isn’t necessarily free. We now have new problems that come up as a result and new scalability problems.

A Brief Aside: The Virtual NIC

Once you start driving up the density and tenancy of a single machine, then you immediately have the question of how do you represent these different devices. Each of these instances, whether they be hardware virtualized or a container, has their own networking information. They not only have their own IP address, but they also have their own unique MAC address and different properties on them. So how do you represent these along with the physical devices?

Enter the Virtual NIC or VNIC. Each VNIC has its own MAC address and its own link properties. For example, they have their own MTU and may have a VLAN tag. Each VNIC is created over a physical device and can be given to any kind of instance. This allows a physical device to be shared between multiple different instances, all of which may not trust one another. VNICs allow the administrator or operator to describe logically what they want to exist, without worrying about the mess of the many different possible abstractions that have since shown up in the form of bridges, virtual switches, etc. In addition, VNICs have capabilities that allow us to prevent MAC address spoofing, IP address spoofing, DHCP spoofing, and more, making it very nice for a multi-tenant environment where you don’t trust your neighbor, especially when your neighbor sits next to you.

Always Promiscuous?

We earlier talked about how devices don’t receive every frame and instead have filters. On the flip side, as density demands increased, so does the number of MAC addresses that exist on one server. When we talked about the simple NIC, it only had one MAC address filter. If the OS wanted to receive traffic for more than one MAC address, it would need to put the device into promiscuous mode.

However, here’s another way that devices have changed substantially. Rather than just having a single MAC address filter, they have added several. If you consider SmartOS and illumos, every time a virtual NIC is created, it tries to use an available MAC address filter. The number of filters present determines how many devices we can support before we have to put a NIC in promiscuous mode. On some devices there can be hundreds of these filters. Some of which also take into account the VLAN tag as well.

The Classification Challenge

So, we’ve managed to improve things a bit here. We’ve got a couple hundred devices here and we’ve been able to program those devices into our MAC address filters. Our devices are employing RSS so we’re able to better spread the load across every device; however, we still have a problem: when a packet comes in we need to now figure out what virtual or physical device it corresponds to so we deliver it to the right instance.

By pushing all of this density into a single server, that server needs its own logical, virtual switches. At a high level, implementing this is straightforward, we simply need to look at the destination MAC address, find the corresponding device, and then deliver the packet in the context of that device.

NIC manufacturers paid attention to this need and the fact that operating systems were spending a lot of time dealing with this and so they came up with some additional features to help. We’ve already mentioned how devices can support RSS and how they can have MAC address filters. So, what happens if we combine the two ideas: given a piece of hardware multiple rings, each of which can be programmed with its own MAC address filter.

In this world, we assemble rings into what we call a group. A group consists of one or more MAC address filters and one or more rings. Consider the case where each group has one ring and one filter. So each VNIC in the system will be assigned to a group while they’re available. If a group can be assigned then all the traffic coming to the corresponding ring is guaranteed to be for that device. If we know that, then we can bypass any and all classification in software. When we process the ring, we know exactly what device in the OS this corresponds to and we can skip the classification step entirely.

We mentioned that some devices can assign multiple rings to a given group. If multiple rings are assigned to a group, then the NIC will perform RSS across that set of rings. That means that after the traffic gets filtered and assigned to a group, we then hash the incoming packets and assign it to one of those rings.

Now, you might be asking what about the case where we run out of groups? Well, at that point we try and leverage the fact that some groups can support multiple MAC addresses. This default group will be programmed with all the remaining MAC addresses. If those are exceeded, then we can put that default group and only that group into promiscuous mode.

What we’ve done now is taken the basic building blocks of both rings and MAC address filters and combined them in a way to tackle the density problem. This lets a single network card scale up much more dramatically.

Problem: CPUs are too ‘slow’

One of the other prime features of modern NICs is what various NIC manufacturers call hardware offloads. TCP, UDP, and other networking protocols often have checksums that need to be calculated. The reason this came up is that many CPUs were considered too slow. What this really means is that there was a bit more latency and CPU time spent processing these checksums and verifying them. NIC vendors decided to add additional logic to their silicon (or firmware) to calculate those checksums themselves.

The general idea is that if when the OS needs to transmit the packet, it can ask the NIC to fill in the checksums and when it is receiving a packet, it can ask the NIC to verify the checksum. If the NIC verifies the checksum, then often times the OS will trust the NIC and then not verify it itself.

In general, these offload technologies are fairly common and generally enabled by default. They do add a nice little advantage; however, it often comes at the cost of debugability and may leave you at the mercy of hardware and firmware bugs in the devices. Historically, some devices have had bugs in this logic or had various edge conditions that will lead them to corrupt the data or incorrectly calculate the checksum. Debugging these kinds of problems is often very difficult, because everything that the OS generates itself looks fine.

There are other offload technologies that have also been introduced such as TCP Segmentation Offload which seek to offload parts of the TCP stack processing to networking cards. Whenever looking at or considering hardware offloads, it’s always important to understand the trade offs in performance, debugability, and value. Not every hardware offload is worth it. Always measure.

Problem: The Interrupts are Coming in too Hot

Let’s set the stage. Originally devices could handle 10 Mbits of traffic in a single second. If you assume a default MTU of 1500 bytes and that every packet was that size (depending on your workload, this can easily be a bad assumption), then that means that a device could in theory receive about 833 packets in a given second (10 Mbit/s * 1,000,000 bits/Mbit / 8 bits/byte / 1500 bytes/packet). When you start accounting for the Ethernet header and the VLAN tag, this number falls a bit more.

So if we had 833 packets per second and then we assume that each interrupt only has a single packet (the worst case), we have 833 interrupts per second and we have over 1 ms to process each packet. Okay, that’s easy.

Now, we’re not using 10 Mbit/s devices, we’re often using 10 Gbit/s devices. That means we have 1000x more packets per second! So that means that if we have a single packet in every interrupt that’s 833,000 interrupts per second. All of a sudden the time we have to just do all of the accounting for the interrupt becomes ridiculous and starts to add a lot of overhead.

Solution One: Do Less Work

The first way that this is often handled is to simply do less. Modern devices have interrupt throttling. This allows device drivers to limit the number of interrupts that occur per second per ring. The rates and schemes are different on every device, but a simple way to think about it is if you set an upper bound on interrupts per second, then the device will enforce a minimum amount of time between interrupts. Say you wanted to allow 8000 interrupts per second, then this would mean that an interrupt could fire at most every 125 microseconds.

When an interrupt comes in, the operating system can process more than one packet per interrupt. If there are several packets available in the ring, then there’s nothing that stops the system from processing multiple in a single interrupt and in fact, if you want to perform well at higher speeds, you need to.

While most operating systems enable this by default, there is a trade off. You can imagine that there is a small latency hit. For most uses this isn’t very important; however, if you’re in the finance world where every microsecond counts, then you may not care about the CPU cost and would rather avoid the latency. At the end of the day though, most workloads will end up benefiting from using interrupt throttling, especially as it can be used to help reduce the CPU load.

Solution Two: Turn Off Interrupts

Here we’re going to go and do something entirely different. We’re going to stop bothering with interrupts. Modern PCI Express devices all support multiple interrupts. We’ve talked about how there are multiple rings, well each ring has its own interrupt identified by a vector. These interrupts are called MSI-X. These devices allow you to mask the interrupt and turn it off on a per-ring basis.

Regardless of whether interrupts are turned on or off on a given ring, packets will still accumulate in the ring. This means that if the operating system were to look at the ring, it could see that there were packets available to be processed. If the OS marks the received entries processed, then the hardware will continue delivering packets into the ring. When the OS decides to turn off interrupts and process the ring with a dedicated thread, we call this polling.

Polling works in conjunction with dedicated rings being used for classification. In essence, when the OS notices that there’s a large number of packets coming in on the ring, it will then transition the ring to polling mode, disabling interrupts. The dedicated thread will then continue to poll the ring for work, consuming up to a fixed number of packets in any one poll. After that, if there is still a large backlog, it will keep polling until a low watermark is reached, at which point it will disable polling and transition back to interrupt based processing.

Each time a poll occurs, the packets will be delivered in bulk to the networking stack. So if 50 packets came in in-between polls, then they would all be delivered at once.

As with everything we’ve seen, there is a trade off of sorts. When you’re in polling mode, there can be an additional latency hit to processing some of these packets; however, polling can make it much easier to saturate 10 Gbit/s and faster devices with very little CPU.

Recapping

We started off with introducing the concept of rings in a networking card. Rings form the basic building block of the NIC and are the basis for a lot of the more interesting trends in hardware. From there, we talked about how you can use rings for RSS and when you combine rings with MAC address filters you can drive up tenancy with hardware classification, which helps enable polling, among other features.

One important thing here is that all of the features we’ve talked about are completely transparent to applications. All of the software built upon the BSD/POSIX sockets APIs, functions like connect(), accept(), sendmsg(), and recvmsg(), automatically get all of the benefits of these developments and enhancements in the operating system and hardware, without having to change a single thing in the application.

Future Directions and More Reading

This is only the tip of the iceberg both in terms of detail and in terms of what we can do. For example, while we’ve primarily talked about the hardware and software classification of flows for the purpose of delivering them to the right device, there are other things we can do such as throttle bandwidth and more with tools like flowadm(1M).

At Joyent, we’re continuing to explore these areas and push ahead in new and interesting ways. For example, as cards have been adding more and more functionality to support things like VXLAN and Geneve, we’re exploring how we perform hardware classification of those protocols, leverage newer checksum offloading for them, and coming up with novel and different ways to improve the performance. If any of the following sound interesting, make sure to reach out.

If you found this topic interesting, but find yourself looking for more, you may want to read some of the following:

Posted on September 15, 2016 at 10:06 am by rm · Permalink · Comments Closed
In: Miscellaneous

illumos day 2014

Saturday September 27th was illumos day 2014, hosted as a follow on to Surge 2014. illumos day was really quite nice and it was a good gathering of both folks who have been in the community for some time, and those who were just getting started. I was able to record the talks and so you can find them all in the following Youtube playlist:

The following folks gave talks:

Thanks to everyone who attended!

Posted on October 1, 2014 at 10:47 am by rm · Permalink · Comments Closed
In: Miscellaneous

illumos Overlay Networks Development Preview 02

I’m happy to announce the second development preview of my network vitalization or if you like to use buzzwords, software defined networking, for illumos. Like the previous entry, the goal of this is to give folks something to play around with and get a sense of what this looks like for a user and administrator.

The dev-overlay branch of illumos-joyent has all the source code and has been merged up to illumos and illumos-joyent from September 22nd.

This is a development preview, so it’s using a debug build of illumos. This is not suitable for production use. There are bugs; expect panics.

How we got here

It’s worth taking a little bit of time to understand the class of problems that we’re trying to solve. At the core of this work is a desire to have multiple logical layer two networks be able to all use one physical, or underlay, network. This means, that you can run multiple virtual machines that each have their own independent set of VLANs and private address space, so both Alice and Bob can have their own VMs using the same private IP addresses, like 10.1.2.3/24 and be confident that they will not see each others traffic.

What’s in this Release

This release builds on from the last release which had simple point to point tunnels. This release adds support for the following:

This release also has a similar set of known issues:

Dynamic Plugins

In the first release, overlay devices only supported the direct plugin which always sent all traffic to a single destination. While useful, it meant that a given tunnel was limited to being point to point. The notion of a dynamic plugin changes this entirely. In this world, traffic can be encapsulated and sent to different hosts based on its destination mac address. Instead of getting a single destination from userland at device creation, the kernel goes and asks userland to supply it with the destination on demand.

Allowing an answer to be supplied this way makes it much easier to write different ways of answering the question in userland. As individuals and organizations figure out their own strategy here, it makes it much easier to interface with existing centralized databases or extant distributed systems.

In addition, as part of writing a simple files backend, I wrote several routines that can be used to inject proxy ARP, proxy NDP, and proxy DHCPv4 requests. Having these primitives in the common library makes it much easier for different backends which don’t support multicast or broadcast traffic to have something to use.

The files plugin format

In the next section we’ll show an example of getting started and having three different VMs use the same file for understanding our virtual network’s layout. Here’s a copy of the file /var/tmp/hosts.json that I’ve been using:

# cat /var/tmp/hosts.json
{

        "de:ad:be:ef:00:00": {
                "arp": "10.55.55.2",
                "ip": "10.88.88.69",
                "ndp": "fe80::3",
                "port": 4789
        }, "de:ad:be:ef:00:01": {
                "arp": "10.55.55.3",
                "dhcp-proxy": "de:ad:be:ef:00:00",
                "ip": "10.88.88.70",
                "ndp": "fe80::4",
                "port": 4789
        }, "de:ad:be:ef:00:02": {
                "arp": "10.55.55.4",
                "ip": "10.88.88.71",
                "ndp": "fe80::5",
                "port": 4789
        }
}

In this JSON blob, the key is the MAC address of a VNIC. With each key, there must be a member entitled ip and port. These are used by the plugin to answer the question of, where should a packet with this mac address be sent? The ip member may be either an IPv4 or IPv6 address.

Machines send packets to a specific MAC address. They look up the mappings between a MAC address and an IP address using different mechanisms for IPv4 and IPv6. IPv4 uses ARP to get this information which devolves into using broadcast frames, while NDP is built into IPv6 and uses ICMPv6. Those messages are generally sent using specific multicast addresses. However, because this backend does not support broadcast or multicast traffic, we need to do something a little different.

When the kernel encounters a destination MAC address that it doesn’t recognize, it asks userland where it should send it. Userland in turn looks at the layer 2 header and determines what it should do. When it sees something that gives the sign that it might be an ARP or NDP packet, it pulls down a copy of the entire packet and if it confirms that it is in fact an ARP or NDP packet, it will generate a response on its own using information encoded in the JSON file above and that will be injected into the overlay device for delivery.

The system determines the mapping between an IPv4 address and its MAC address by supplying an IP address in the arp field. It determines the mapping between an IPv6 address and its mac address by using the ndp field.

Finally, to better explore this prototype, I implemented a DHCP proxy capability. While DHCP has a defined system of relaying, the relay expects to be able to receive layer 2 broadcast packets. Instead, if we see a UDP broadcast packet that’s doing a DHCP query, we rewrite the frame to send it explicitly to the destination MAC address listed in the dhcp-proxy member. In this case, if I run a DHCPv4 server on the host listed on the first entry, it will properly serve a DHCP address to the mac address that has the dhcp-proxy entry. However, an important thing to remember with this, is that even though DHCP was able to assign an address, one still needs to be able to perform ARP and therefore if the address doesn’t match the one in the files entry, it will not work. To be able to do that properly, you need to write a plugin that’s a bit more complicated than the files backend.

Getting Started

This development release of SmartOS comes in three flavors:

Once you boot this version of SmartOS, you should be good to go. As an example, I’ll show how I set up three individual hosts, which we’ll call zlan, zlan2, and zlan3. I put the JSON file shown above, as the file /var/tmp/hosts.json. On the host zlan I ran the following
commands:

# dladm create-overlay -v 23 -e vxlan -s files -p vxlan/listen_ip=10.88.88.69 -p files/config=/var/tmp/hosts.json overlay0
# dladm create-vnic -m de:ad:be:ef:00:00 -l overlay0 foo0
# ifconfig foo0 plumb up 10.55.55.2/24
# ifconfig foo0 inet6 plumb up
# ifconfig foo0 inet6 fe80::3

On the host zlan2 I ran the following:

# dladm create-overlay -v 23 -e vxlan -s files -p vxlan/listen_ip=10.88.88.70 -p files/config=/var/tmp/hosts.json overlay0
# dladm create-vnic -m de:ad:be:ef:00:01 -l overlay0 foo0
# ifconfig bar0 plumb up 10.55.55.3/24
# ifconfig bar0 inet6 plumb up
# ifconfig bar0 inet6 fe80::4

And finally on the host zlan3, I ran the following:

# dladm create-overlay -v 23 -e vxlan -s files -p vxlan/listen_ip=10.88.88.71 -p files/config=/var/tmp/hosts.json overlay0
# dladm create-vnic -m de:ad:be:ef:00:02 -l overlay0 baz0
# ifconfig baz0 plumb up 10.55.55.4/24
# ifconfig baz0 inet6 plumb up
# ifconfig baz0 inet6 fe80::5

With all that done, all three hosts could ping and access network services on each other.

Concluding

The dynamic plugins allow us to start building and experimenting with something a bit more and interesting than the point to point tunnel. From here, there isn’t as much core functionality that’s necessary to add, but there’s a lot more stability and improvements throughout the stack. In addition, from here, I’ll be experimenting with some more distributed systems to make the next dynamic plugin, much more dynamic.

If you have any feedback, suggestions, or anything else, please let me know. You can find me on IRC (rmustacc in #smartos and #illumos on irc.freenode.net) or on the smartos-discuss mailing list. If you’d like to work on support for other encapsulation methods such as NVGRE or want to see how implementing a dynamic mapping service might be, reach out to me.

Posted on September 23, 2014 at 11:58 am by rm · Permalink · Comments Closed
In: Networking

illumos Overlay Networks Development Preview 01

At Joyent I’ve been spending my time designing and building support for network virtualization in the form of protocols like VXLAN. I’ve gotten far enough along that I’m happy to announce the first SmartOS developmental preview of this work. The goal of this is to just give something for folks to play around with and start getting a sense of what this looks like. If you have any feedback, please send it my way!

All the development of this is going on in its own branch of illumos-joyent: dev-overlay. You can see all of the developments, including a README that gives a bit of an introduction and background, on that branch here.

The development preview below is a debug build of illumos. This is not suitable for production use. There are bugs. Expect panics.

What’s in this release

This release adds the foundation for overlay devices and their management in user land. With this you can create and list point-to-point VXLAN tunnels and create vnics on top of them. This is all done through dladm. This release also includes the preliminary version of the varpd daemon which manages user land lookups and will be used for custom lookup mechanisms in the future.

However, there are known things that don’t work:

Getting Started

This development release comes in the standard SmartOS flavors:

Once you boot this version of the platform, you’ll find that most things look the same. You’ll find a new service has been created and should be online — varpd. You can verify this with the svcs command. Next, I’ll walk through an example of starting everything up, creating an overlay device, and a VNIC on top of that.

[root@00-0c-29-ca-c7-23 ~]# svcs varpd
STATE          STIME    FMRI
online         21:43:00 svc:/network/varpd:default
[root@00-0c-29-ca-c7-23 ~]# dladm create-overlay -e vxlan -s direct \
    -p vxlan/listen_ip=10.88.88.69 -p direct/dest_ip=10.88.88.70 \
    -p direct/dest_port=4789 -v 23 demo0
[root@00-0c-29-ca-c7-23 ~]# dladm show-overlay
LINK         PROPERTY            PERM REQ VALUE       DEFAULT     POSSIBLE
demo0        mtu                 rw   -   0           --          --
demo0        vnetid              rw   -   23          --          --
demo0        encap               r-   -   vxlan       --          vxlan
demo0        varpd/id            r-   -   1           --          --
demo0        vxlan/listen_ip     rw   y   10.88.88.69 --          --
demo0        vxlan/listen_port   rw   y   4789        4789        1-65535
demo0        direct/dest_ip      rw   y   10.88.88.70 --          --
demo0        direct/dest_port    rw   y   4789        --          1-65535
[root@00-0c-29-ca-c7-23 ~]# dladm create-vnic -l demo0 foo0
[root@00-0c-29-ca-c7-23 ~]# ifconfig foo0 plumb up 10.55.55.2/24

Let’s take this apart. The first thing that we did is create an overlay device. The -e vxlan option tells us that we should use vxlan for encapsulation. Currently only VXLAN is supported. The -s direct specifies that we should use the direct or point-to-point module for determining where packets flow. In other words, there’s only a single destination.

Following this we set three required properties. The vxlan/listen_ip which tells us what IP addresses to listen on. The direct/dest_ip which tells us which IP to send the results to, and finally, direct/dest_port which says what port to use. We didn’t end up setting the property vxlan/listen_port because VXLAN specifies a default port which is 4789.

Finally, we specified a virtual network id with -v, in this case 23. And then we ended it all with a name.

After that, it became visible in the dladm show-overlay which displayed everything that we wanted. You’ll want to take similar steps on another machine, just make sure to swap the IP addresses around.

Concluding

This is just the tip of the iceberg here. There’s going to be a lot more functionality and a lot more improvements down the road. I’ll be doing additional development previews along the way.

If you have any feedback, suggestions, or anything else, please let me know. You can find me on IRC (rmustacc in #smartos and #illumos on irc.freenode.net) or on the smartos-discuss mailing list. If you’d like to work on support for other encapsulation methods such as NVGRE or want to see how implementing a dynamic mapping service might be, reach out to me.

Posted on July 25, 2014 at 6:08 pm by rm · Permalink · Comments Closed
In: Networking

DLPI and the IP Fastpath

The series so far

If you’re getting started you’ll want to see the previous entries on Project Bardiche:

The illumos Networking Stack

This blog post is going to dive into more detail about what the ‘fastpath’ is in illumos for networking, what it means, and a bit more about how it works. We’ll also go through and cover a bit more information about some of the additions we made as part of this project. Before we go too much further, let’s take another look at the picture of the networking stack from the entry on architecture of vnd:

             +---------+----------+----------+
             | libdlpi |  libvnd  | libsocket|
             +---------+----------+----------+
             |         ·          ·    VFS   |
             |   VFS   ·    VFS   +----------+
             |         ·          |  sockfs  |
             +---------+----------+----------+
             |         |    VND   |    IP    |
             |         +----------+----------+
             |            DLD/DLS            |
             +-------------------------------+
             |              MAC              |
             +-------------------------------+
             |             GLDv3             |
             +-------------------------------+

If you don’t remember what some of these components are, you might want to refresh your memory with the vnd architecture entry. Importantly, almost everything is layered on top of the DLD and DLS modules.

The illumos networking stack comes from a long lineage of technical work done at Sun Microsystems. Initially the networking stack was implemented using STREAMs. STREAMs is a message passing interface where message blocks (mblk_t) are sent around from one module to the next. For example, there are modules for things like arp, tcp/ip, udp, etc. These are chained together and can be seen in mdb using the ::stream dcmd. Here’s an example for my zone development zone:

> ::walk dld_str_cache | ::print dld_str_t ds_rq | ::q2stream | ::stream

+-----------------------+-----------------------+
| 0xffffff0251050690    | 0xffffff0251050598    |
| udp                   | udp                   |
|                       |                       |
| cnt = 0t0             | cnt = 0t0             |
| flg = 0x20204022      | flg = 0x20244032      |
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xffffff02510523f8    | 0xffffff0251052300    | if: net0
| ip                    | ip                    |
|                       |                       |
| cnt = 0t0             | cnt = 0t0             |
| flg = 0x00004022      | flg = 0x00004032      |
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xffffff0250eda158    | 0xffffff0250eda060    |
| vnic                  | vnic                  |
|                       |                       |
| cnt = 0t0             | cnt = 0t0             |
| flg = 0x00244062      | flg = 0x00204032      |
+-----------------------+-----------------------+
...

If I sent a udp packet, it would first be processed by the udp streams module, then the ip streams module, and finally make its way to the DLD/DLS layer which is represented by the vnic entry here. The means of this communication was part of the DLPI. DLPI itself defines several different kinds of messages and responses which can be found in the illumos source code here. The general specification is available here, though there’s a lot more to it than is worth reading. In illumos, it’s been distilled down into libdlpi.

Recall from the vnd architecture entry that the way devices and drivers communicate with a datalink is by initially using STREAMS modules and by opening a device in /dev/net/. Each data link in the system is represented by a dls_link_t. When you open a device in /dev/net, you get a dld_str_t which is an instance of a STREAMS device.

The DLPI allows consumers to bind to what they call a SAP or a service attachment point. What this means depends on the kind of data link. In the case of Ethernet, this refers to the ethertype. In other words, a given dld_str_t can be bound to something like IP, ARP, LLDP, etc. If this were something other than Ethernet, then that name space would be different.

For a given data link, only one dld_str_t can be actively bound to a given SAP (ethertype) at a time. An active bind refers to something that is actively consuming and sending data. For example, when you create an IP interface using ifconfig or ipadm, that does an active bind. Another example of an active bind is a daemon used for LLDP. There are also passive binds, like in the case of something trying to capture packets like snoop or tcpdump. That allows something to capture the data without worrying about blocking someone from using that attachment point.

Speeding things up

While the fundamentals of the DLPI are sound, the implementation in STREAMS, particularly for sending data left something to be desired. It greatly complicated the locking and it was hard to get it to perform in the way that was needed for saturating 10 GbE networks with TCP traffic. For all the details on what happened here and a good background, I’ll refer you to Sunay Tripathi’s Blog, where he covers a lot of what changed in Solaris 10 to fix this.

There are two parts to what folks generally end up calling the ‘IP fastpath’. One part of which we leverage for vnd and the other part which is still firmly used by IP. We’ll touch on the first part of this which eliminates the use of sending STREAMS messages. Instead it uses direct callbacks. Today this happens by negotiating with DLPI messages that discover capabilities of devices and then enabling them. The code for the vnd driver does this, as well as the ip driver. Specifically, you first send down a DL_CAPABILITY_REQ message. The response contains a list of capabilities that exist.

If the capability, DL_CAPAB_DLD is returned, then you can enable direct function calls to the DLD and DLS layer. The returned values give you a function pointer, which you can then use to do several things, and ultimately use to request to enable DLD_CAPAB_DIRECT. When you make a call to enable, you specify a function pointer for DLD to call directly when a packet is received. It then will give you a series of functions to use for things like checking flow control and transmitting a packet. These functions allow the system to bypass the issues with STREAMS and directly transmit along packets.

The second part of the ‘IP fastpath’ is something that primarily the IP module uses. In the IP module there is a notion of a neighbor cache entry or nce. This nce describes how to reach another host. When that host is found, the nce asks the lower layers of the stack to generate a layer two header that’s appropriate for this traffic. In the case where you have an Ethernet device, this means that it generates the MAC header including the source and destination mac addresses, ethertype, and vlan tags if there should be one. The IP stack then uses this pre-generated header each time rather than trying to create a new one from scratch for every packet. In addition, the IP module is subscribed to change events that get generated when something like a mac address changes, so that it can regenerate these headers when the administrator makes a change to the system.

New Additions

Finally, it’s worth taking a little bit of time to talk about the new DLPI additions that we added with project bardiche. We needed to solve two problems. Specifically:

To solve the first case, we added a new request called a DL_EXCLUSIVE_REQ. This adds a new mode for the bind state of the dld_str_t. In addition to being active or passive, it can now be exclusive. This can only be requested if no one is actively using the device. If someone is, for example, an IP interface has already been created, then the DL_EXCLUSIVE_REQ will fail. The opposite is true as well, if someone is using the dld_str_t in exclusive mode, then the request to bind to the IP ethertype will also fail. This exclusive request lasts until the consumer closes the dld_str_t.

When a vnd device is created, it makes an explicit request for exclusive access to the device, because it needs to send and receive on all of the different ethertypes. If an IP interface is already active, it doesn’t make sense for a vnd device to be created there. Once the vnd device is destroyed, then anything can use the data link.

Solving our second problem was actually quite simple. The core logic to not loop back packets that were transmitted was already present in the MAC layer. To do that, we created a new promiscuous option that could be specified in the DLPI DL_PROMISCON_REQ called DL_PROMISC_RX_ONLY. Enabling this would pass along the flag MAC_PROMISC_FLAGS_NO_TX_LOOP down to the mac layer which actually does the heavy lifting of duplicating the necessary amount of packets.

Conclusion

This gives a rather rough introduction to the fastpath in the illumos networking stack. The devil, as always, is in the details.

In the next entries, we’ll go over the other new extensions that were added as part of this work: the sdev plugin interface and generalized serialization queues. Finally, we’ll finish with a recap and go over what’s next.

Posted on April 3, 2014 at 8:22 am by rm · Permalink · Comments Closed
In: Bardiche, Networking · Tagged with: ,

Project Bardiche: Framed I/O

The series so far

If you’re getting started you’ll want to see the previous entries on Project Bardiche:

Background

Framed I/O is a new abstraction that we’re currently experimenting with through Project Bardiche. We call this framed I/O, because the core concept is what we call a frame: a variable amount of discrete data that has a maximum size. In this article, we’ll call data that fits this framed. For example, Ethernet devices work exactly this way. They have a maximum size based on their MTU, but there may very well be less data available than the maximum. There are a few overarching goals that led us down this path:

The primary use case of framed I/O is for vnd and virtual machines. However, we believe that the properties here make it desirable to other portions of the stack which operate in terms of frames. To understand why we’re evaluating this abstraction, it’s worth talking about the other existing abstractions in the system.

read(2) and write(2)

Let’s start with the traditional and most familiar series of I/O interfaces: read(2), write(2), readv(2), and writev(2). These are the standard I/O system calls that most C programmers are familiar with. read(2) and write(2) originated in first edition UNIX. readv(2) and writev(2) supposedly came about during the development of 4.2 BSD. The read and write routines operate on streams of data. The callers and file descriptors have no inherent notion of data being framed and all framing has to be built into consumption of the data. For a lot of use cases, that is the correct abstraction.

The readv(2) and writev(2) interfaces allowed that stream to be vectored. It’s hard to say if these were vectored I/O abstraction in Operating Systems, but it certainly is one of the most popular ones from early systems that’s still around. Where read(2) and write(2) map a stream to a single buffer, these calls map a stream to a series of arbitrarily sized vectors. The act of vectorizing data is not uncommon and can be very useful. Generally, this is done when combining what may be multiple elements into one discrete stream for transferring. For example, if a program maintains one buffer for a protocol header and another buffer is used for the payload, then being able to specify a vector that includes both of these in one call can be quite useful.

When operating with framed data, these interfaces fall a bit short. The problem is that you’ve lost information that the system had regard the framing. It may be that the protocol itself includes the delineations, but there’s no guarantee that that data is correct. For example, if you had a buffer of size 1500, not only would something like read(2) only give you the total number of bytes returned, you wouldn’t be able to get the total number of frames. A return value of 1500 could be one large 1500 byte frame, it could be multiple 300 byte frames or anything in between.

getmsg(2) and putmsg(2)

The next set of APIs that are worth looking at are getmsg(2) and putmsg(2). These APIs are a bit different from the normal read(2) and write(2) APIs, they’re designed around framed messages. These routines use a struct strbuf which has the following members:

    int    maxlen;      /* maximum buffer length */
    int    len;         /* length of data */
    char   *buf;        /* ptr to buffer */

These interfaces allows for the consumer to properly express the maximum size of the frame that they expect and the amount of data that the given frame actually includes. This is very useful for framed data. Unfortunately, this API has some deficiencies. It doesn’t have the ability to break down the data into vectors nor do systems really have a means of working with multiple vectors at a time.

sendmsg(2) and recvmsg(2)

The next set of APIs that I explored and looked at focused on were the recvmsg(2) family, particularly the extensions that were introduced into the Linux kernel via sendmmsg(2) and [recvmmsg(2). The general design of the msghdr structure is good, though it understandably is designed around the socket interface. Unfortunately something like sendmsg(2) is not something that device drivers in most systems get, it currently only works for socket file systems, and a lot of things don't look like sockets. Things like ancillary data and the optional addresses are not as useful and don't have meaning for other styles of messages or if they do, they may not fit the abstraction that's been defined there.

Framed I/O

Based on our evaluations with the above APIs, a few of us chatted around Joyent's San Francisco office and tried to come up with something that might have the properties we felt made more sense for something like KVM networking. To help distinguish it from traditional Socket semantics or STREAMS semantics, we named it after the basic building block of the frame. The general structure is called a frameio_t which itself has a series of vector structures called a framevec_t. The structures roughly look like:

typedef struct framevec {
    void    *fv_buf;        /* Buffer with data */
    size_t  fv_buflen;      /* Size of the buffer */
    size_t  fv_actlen;      /* Amount of buffer consumed, ignore on error */
} framevec_t;

typedef struct frameio {
    uint_t  fio_version;    /* Should always be FRAMEIO_CURRENT_VERSION */
    uint_t  fio_nvpf;       /* How many vectors make up one frame */
    uint_t  fio_nvecs;      /* The total number of vectors */
    framevec_t fio_vecs[];  /* C99 VLA */
} frameio_t;

The idea here is that, much like a struct msgbuf, each vector component has a notion of what it's maximum size is and then the actual size of data consumed. These vectors can then be constructed into series of frames in multiple ways through the fio_nvpf and fio_nvecs members. The fio_nvecs field describes the total number of vectors and the fio_nvpf describes how many vectors are in a frame. You might think of the fio_nvpf member as basically describing how many iovec structures make up a single frame.

Consider that you have four vectors to play with, you might want to rig it up in one of several ways. You might want to map each message to a signle vector, meaning that you could read four messsages at once. You might want the opposite and map a single message to all four vectors. In that case you'd only ever read one message at a time, broken into four components. You could also break it down such that you always broke down a message into two vectors, that means that you'd be able to read two messages at a time. The following ASCII art might help.

1:1 Vector to Frame mapping

 +-------+  +-------+  +-------+  +-------+
 | msg 0 |  | msg 1 |  | msg 2 |  | msg 3 |
 +-------+  +-------+  +-------+  +-------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

4:1 Vector to Frame mapping

 +----------------------------------------+
 |                  msg 0                 |
 +----------------------------------------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

2:1 Vector to Frame Mapping

 +------------------+  +------------------+
 |       msg 0      |  |       msg 1      |
 +------------------+  +------------------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

Currently the maximum number of vectors allowed in a given call is limited to 32. As long as the total number evenly divides the number of frames per vector, than any value is alright.

By combining these two different directions, we believe that this'll be a useful abstraction and suitable for other parts of the system that operate on framed data, for example a USB stack. Another thing that this design lets us do is that by not constraining the content of the vectors, it would be possible to replicate something like the struct msghdr where the protocol header data was actually in the first vector.

Framed I/O now and in the Future

Today we've started plumbing this through in QEMU to account for its network device backend APIs that allow one to operate on traditional iovec. However, there's a lot more that can be done with this. For example, one of the things that is on our minds is writing a vhost-net style driver for illumos that can easily map data between the framed I/O representation and the virtio driver. With this, it'd even be possible to do something that's mostly zero-copy as well. Alternatively, we may also explore just redoing a lot of QEMU's internal networking paths to make it more friendly for sending and receiving multiple packets at once. That should certainly help with the overhead today of networking I/O in virtual machines.

We think that this might fit in other parts of the system as well, for example, it may make sense to be used as part of the illumos USB3 stack's design as the unit that we send data in. Whether it makes sense as anything more than just for the vnd device and this style of I/O time will tell.

Today vnd devices are exposed in libvnd through the vnd_frameio_read(3VND) and vnd_frameio_write(3VND) interfaces. So these can also be used for someone who's trying to develop their own services using vnd, for example, user land switches, firewalls, etc.

Next in the Bardiche Series

Next in the bardiche series, we'll be delving into some of the additional kernel subsystems and new DLPI abstractions that were created. Following those, we'll end with a recap entry on bardiche as a whole and what may come next.

Posted on March 25, 2014 at 8:57 am by rm · Permalink · Comments Closed
In: Bardiche, Networking · Tagged with: ,