Turtles on the Wire: Understanding how the OS uses the Modern NIC

The modern networking card (NIC) has evolved quite a bit from the simple Ethernet cards of yesteryear. As such, the way that the operating system uses them has had to evolve in tandem. Gone are the simple 10 Mbit/s copper or (BNC) devices. Instead, 1 Gb/s is common-place in the home, 10 Gb/s rules the server, and you can buy cards that come in speeds like 25 Gb/s, 40 Gb/s, and even 100 Gb/s! Speed isn’t the only thing that’s changed, there’s been a big push to virtualization. What used to be one app on one server, transformed into a couple apps on a few Hardware Virtual Machines, and these days can be hundreds of containers all on a single server.

For this entry, we’re going to be focusing on how the Operating System sees NICs, what abstractions they provide together, how things have changed to deal with the increased tenancy and performance demands, and then finally where we’re going next with all this. We’re going to focus on where scalability problems have come about and talk about how they’ve been solved.

The Simple NIC

While this is a broad generalization, the simplest way to think of a networking card is that it has five primary pieces:

  1. A MAC Address that it can use to filter incoming packets.
  2. A ring, or circular buffer, that packets are received into from the network.
  3. A ring, or circular buffer, that packets are sent from to the network.
  4. The ability to generate interrupts.
  5. A way to program all of the above, generally done with PCI memory-mapped registers.

First, let’s talk about rings. Both of these rings are considered a type of circular buffer. With a circular buffer, the valid region of the buffer changes over time. Circular buffers are often used because of the property that they consume a fixed amount of memory and they handle the model of a single producer and a single consumer rather well. One end, the producer, places data in the buffer and moves a head pointer, while another end, the consumer, removes data from the buffer, moving the tail pointer. For the rest of this article, we won’t be using the term circular buffer, but instead ring, which is commonly used in both networking card programming manuals and operating systems to describe a circular buffer used for packet descriptors.

Let’s go into a bit more detail about how this works. Rings are the lifeblood of a networking card. They occupy a fixed chunk of memory in normal system memory and when the hardware wants to access it, it will perform DMA, direct memory access. All the data that comes and goes through it is described by an entry in a ring, called a descriptor. The ring is made of a series of these descriptors. A descriptor is generally made up of three parts:

  1. A buffer address
  2. A buffer length
  3. Metadata about the packet

Consider the receive ring. Each entry in it describes a place that the hardware can place an incoming packet. The buffer in the descriptor says where in main memory to place the packet and the length says how big that buffer is. When a packet arrives, a networking card will generally generate an interrupt, thus letting the operating system know that it should check the ring.

The transmit side is similar, but slightly different in orientation. The OS places descriptors into the ring and then indicates to the networking card that it should transmit those packets. The descriptor says where in memory to find the packet and the length says how long the packet is. The metadata may include information such as whether the packet is broken up into one or more descriptors, the type of data in the packet, or optional features to perform.

To make this more concrete, let’s think of this in the context of receiving packets. Initially, the ring is empty. The OS then fills all of the descriptors with pointers to where the hardware can put packets it receives. The OS doesn’t have to fill the entire ring, but it’s standard practice in most operating systems to do so. For this example, we’ll assume that the OS has filled the entire ring.

Because it’s written to the entire ring, the OS will set its pointer to the start of the buffer. Then, as the hardware receives packets, the hardware will bump its pointer and send the OS an interrupt.

The following ASCII art image describes how the ring might look after the hardware has received two packets and delivered the OS an interrupt:

        +--------------+ <----- OS Pointer
        | Descriptor 0 |
        | Descriptor 1 |
        +--------------+ <----- Hardware Pointer
        | Descriptor 2 |
        |     ...      |
        | Descriptor n |

When the OS receives the interrupt, it reads where the hardware pointer is and processes those packets between its pointer and the hardware’s. Once it’s done, it doesn’t have to do anything until it prepares those descriptors with fresh buffers. Once it does, it’ll update its pointer by writing to the hardware. For example, if the OS has processed those first two descriptors and then updates the hardware, the ring will look something like:

        | Descriptor 0 |
        | Descriptor 1 |
        +--------------+ <----- Hardware Pointer, OS Pointer
        | Descriptor 2 |
        |     ...      |
        | Descriptor n |

When you send packets, it’s similar. The OS fills in descriptors and then notifies the hardware. Once the hardware has sent them out on the wire, it then injects an interrupt and indicates which descriptors it’s written to the network, allowing the OS to free the associated memory.

MAC Address Filters and Promiscuous Mode

The missing piece we haven’t talked about is the filter. Just about every networking card has support for filtering incoming packets, which is done based on the destination MAC address of the packet. By default, this filter is always programmed with the MAC address of the card. By default, if the destination MAC address of the Ethernet header doesn’t match the networking card’s MAC address, and it isn’t a broadcast or multicast packet, then it will be dropped.

When networking cards want to receive more traffic than that which they’ve been programmed to receive, they’re put into what’s called promiscuous mode, which means that the card will place every packet it receives into the receive ring for the operating system to process, regardless of the destination MAC address.

Now, the question that comes to mind, is why do cards have this filtering capability at all? Why does it matter? Why would we only care about a subset of packets on the network?

To answer this, we first have to go back to the world before network switches, there were only hubs. When a packet came into a hub, it was replicated to every port. This mean that every NIC would end up seeing everything that went out. If they hadn’t filtered by default, then the sheer amount of traffic and interrupts might have overwhelmed many early systems, particularly on larger networks. Nowadays, this isn’t quite as problematic as we use switches, which learn which MAC addresses are behind which switch ports. Of course, there are limits to this, which have motivated a lot of the work around network virtualization, but more on that in a future blog post.

Problem: The Single Busy CPU

In the previous section, we talked about how we had an interrupt that fired whenever packets were successfully received or transmitted. By default, there was only ever a single interrupt used and that interrupt vector usually mapped to exactly one CPU, in other words only one CPU could process the initial stream of packets. In many systems, that CPU would then process that incoming packet all the way through the TCP/IP stack until it reached a socket buffer for an application to read.

This has led to many problems. The biggest are that:

  1. You have a single CPU that’s spending all its time handling interrupts, unable to do user work
  2. You might often have increased latency for incoming packets due to the processing time

These problems were especially prevalent on single CPU systems. While it may be hard to remember, for a long time, most systems only had a single socket with a single CPU. There were no hardware threads and there were no cores. As multi-processing became more common (particularly on x86), the question became how do networking cards take advantage of horizontal scalability.

A Swing and a Miss

The first solution that springs to mind when we talk about this is say why don’t we just have multiple threads process the same ring? If we have multiple CPUs in the system, why not just have a thread we can schedule to help do some of the heavy lifting along with the thread that’s processing the interrupt?

Unfortunately, this is where Amdahl starts to rear his head, glaring at us and reminding us of the harsh truths of multi-processing. Multiple threads processing the same queue doesn’t really do us much good. Instead, we’ll end up changing where our contention is. Instead of having a single CPU 100% busy, we’ll have several CPUs, quite busy, and spending a lot of their time blocked on locks. The problem is that we still have shared state here — the ring itself.

There’s actually another somewhat nastier and much more subtle problem here: packet ordering. While TCP has a lot of logic to deal with out of order packets and UDP users are left on their own, no one likes out of order packets. In many TCP stacks, whenever a packet is detected to have arrived out of order, that sends up red flags in the stack. This will cause TCP to believe there is something wrong in the network and often throttle traffic or require data to be retransmitted, injecting noticeable latency and performance impacts.

Now, if the packets arrive out of order on the networking card, then there’s not a lot we can do here. However, if they arrive in order and we have multiple threads processing the same ring, then due to lock ordering and the scheduler, it all of a sudden becomes very easy to have packets arrive out of order, leading to TCP throwing its hands up, and performance being left behind in the gutter.

So we now actually have two different problems:

  1. We need to make sure that we don’t allow packets to be delivered out of order
  2. We want to avoid sharing data structures protected by the same lock

Nine Rings for Packets Doomed to be Hashed

The answer to this problem is two-fold. Earlier we said that a ring is the lifeblood of a NIC, so what if we just put added a bunch more rings to the card. If each ring has its own interrupt associated with it, then we’ve solved our contention problem. Each ring is still processed by only a single thread. The fastest way to deal with shared resources is not to share at all!

So now we’re left with the question of how do we deal with the out of order packet problem. If we simply assigned each incoming packet to a ring in a round-robin fashion, we’d only be making the problem of out of order delivery a certainty. So this means that we need to do something a little bit smarter.

The important observation is that what we care about is that a given TCP or UDP connection always go to the same place. It’s a TCP connection that can become out of order. If there are two different connections ongoing, the order that their packets are received in doesn’t matter. Only the order of a single connection matters. Now all we need to do is assign a given connection to a ring. For a given TCP connection, the source and destination IP addresses, and the source and destination ports will be the same throughout the lifetime of the connection. You might sometimes hear this called a flow, a series of identifying information that identifies some set of traffic.

The way the assignments are made is by hashing. Networking cards have a hash function that takes into account various fields in the header that are constant and use them to produce a hash value. Then that hash value is used to assign something to a ring. For example, if there were four rings in use, a simple way to assign traffic is to simply take the hash value and compute its modulus by the number of rings, giving a constant ring index.

Different cards use different strategies for this. You also don’t necessarily need to use all of the members of the header. For example, while you could use both the source and destination ports and IP addresses for a TCP connection, if the card ignored the ports, the right thing would still happen. Traffic wouldn’t end up out of order; however, it might not be spread quite as evenly amongst the rings.

This technology is often referred to as receive side scaling (RSS). On SmartOS and other illumos derived systems, RSS is enabled automatically if the device and its driver support it.

Problem: Density, Density, Density

As Steve Ballmer famously once said, “Developers, Developers, Developers…”. For many companies today, it isn’t developers that is the problem, but density. The density and utilization of machines is one of the most important factors for their efficiency and their capex and opex budgets. The first major entry into enterprise for tackling this density was VMware and the Hardware Virtual Machine. However, operating system virtualization had also kicked off. For more on the history, listen to Bryan Cantrill‘s talk at Papers we Love.

Just like airlines don’t want to fly with empty seats on planes, when companies are buying servers, they want to make sure that they are fully utilized. Due to the increasing size of machines, that means that the number of things running on it has to increase. With rare exception, gone are the days of the single app on a server. This means that the number of IP addresses and MAC addresses per server has jumped dramatically. We cannot just load up a box with NICs and assign each physical device to a guest. That’s a lot of cards, ports, cables, and top of rack switches.

However, increasing density isn’t necessarily free. We now have new problems that come up as a result and new scalability problems.

A Brief Aside: The Virtual NIC

Once you start driving up the density and tenancy of a single machine, then you immediately have the question of how do you represent these different devices. Each of these instances, whether they be hardware virtualized or a container, has their own networking information. They not only have their own IP address, but they also have their own unique MAC address and different properties on them. So how do you represent these along with the physical devices?

Enter the Virtual NIC or VNIC. Each VNIC has its own MAC address and its own link properties. For example, they have their own MTU and may have a VLAN tag. Each VNIC is created over a physical device and can be given to any kind of instance. This allows a physical device to be shared between multiple different instances, all of which may not trust one another. VNICs allow the administrator or operator to describe logically what they want to exist, without worrying about the mess of the many different possible abstractions that have since shown up in the form of bridges, virtual switches, etc. In addition, VNICs have capabilities that allow us to prevent MAC address spoofing, IP address spoofing, DHCP spoofing, and more, making it very nice for a multi-tenant environment where you don’t trust your neighbor, especially when your neighbor sits next to you.

Always Promiscuous?

We earlier talked about how devices don’t receive every frame and instead have filters. On the flip side, as density demands increased, so does the number of MAC addresses that exist on one server. When we talked about the simple NIC, it only had one MAC address filter. If the OS wanted to receive traffic for more than one MAC address, it would need to put the device into promiscuous mode.

However, here’s another way that devices have changed substantially. Rather than just having a single MAC address filter, they have added several. If you consider SmartOS and illumos, every time a virtual NIC is created, it tries to use an available MAC address filter. The number of filters present determines how many devices we can support before we have to put a NIC in promiscuous mode. On some devices there can be hundreds of these filters. Some of which also take into account the VLAN tag as well.

The Classification Challenge

So, we’ve managed to improve things a bit here. We’ve got a couple hundred devices here and we’ve been able to program those devices into our MAC address filters. Our devices are employing RSS so we’re able to better spread the load across every device; however, we still have a problem: when a packet comes in we need to now figure out what virtual or physical device it corresponds to so we deliver it to the right instance.

By pushing all of this density into a single server, that server needs its own logical, virtual switches. At a high level, implementing this is straightforward, we simply need to look at the destination MAC address, find the corresponding device, and then deliver the packet in the context of that device.

NIC manufacturers paid attention to this need and the fact that operating systems were spending a lot of time dealing with this and so they came up with some additional features to help. We’ve already mentioned how devices can support RSS and how they can have MAC address filters. So, what happens if we combine the two ideas: given a piece of hardware multiple rings, each of which can be programmed with its own MAC address filter.

In this world, we assemble rings into what we call a group. A group consists of one or more MAC address filters and one or more rings. Consider the case where each group has one ring and one filter. So each VNIC in the system will be assigned to a group while they’re available. If a group can be assigned then all the traffic coming to the corresponding ring is guaranteed to be for that device. If we know that, then we can bypass any and all classification in software. When we process the ring, we know exactly what device in the OS this corresponds to and we can skip the classification step entirely.

We mentioned that some devices can assign multiple rings to a given group. If multiple rings are assigned to a group, then the NIC will perform RSS across that set of rings. That means that after the traffic gets filtered and assigned to a group, we then hash the incoming packets and assign it to one of those rings.

Now, you might be asking what about the case where we run out of groups? Well, at that point we try and leverage the fact that some groups can support multiple MAC addresses. This default group will be programmed with all the remaining MAC addresses. If those are exceeded, then we can put that default group and only that group into promiscuous mode.

What we’ve done now is taken the basic building blocks of both rings and MAC address filters and combined them in a way to tackle the density problem. This lets a single network card scale up much more dramatically.

Problem: CPUs are too ‘slow’

One of the other prime features of modern NICs is what various NIC manufacturers call hardware offloads. TCP, UDP, and other networking protocols often have checksums that need to be calculated. The reason this came up is that many CPUs were considered too slow. What this really means is that there was a bit more latency and CPU time spent processing these checksums and verifying them. NIC vendors decided to add additional logic to their silicon (or firmware) to calculate those checksums themselves.

The general idea is that if when the OS needs to transmit the packet, it can ask the NIC to fill in the checksums and when it is receiving a packet, it can ask the NIC to verify the checksum. If the NIC verifies the checksum, then often times the OS will trust the NIC and then not verify it itself.

In general, these offload technologies are fairly common and generally enabled by default. They do add a nice little advantage; however, it often comes at the cost of debugability and may leave you at the mercy of hardware and firmware bugs in the devices. Historically, some devices have had bugs in this logic or had various edge conditions that will lead them to corrupt the data or incorrectly calculate the checksum. Debugging these kinds of problems is often very difficult, because everything that the OS generates itself looks fine.

There are other offload technologies that have also been introduced such as TCP Segmentation Offload which seek to offload parts of the TCP stack processing to networking cards. Whenever looking at or considering hardware offloads, it’s always important to understand the trade offs in performance, debugability, and value. Not every hardware offload is worth it. Always measure.

Problem: The Interrupts are Coming in too Hot

Let’s set the stage. Originally devices could handle 10 Mbits of traffic in a single second. If you assume a default MTU of 1500 bytes and that every packet was that size (depending on your workload, this can easily be a bad assumption), then that means that a device could in theory receive about 833 packets in a given second (10 Mbit/s * 1,000,000 bits/Mbit / 8 bits/byte / 1500 bytes/packet). When you start accounting for the Ethernet header and the VLAN tag, this number falls a bit more.

So if we had 833 packets per second and then we assume that each interrupt only has a single packet (the worst case), we have 833 interrupts per second and we have over 1 ms to process each packet. Okay, that’s easy.

Now, we’re not using 10 Mbit/s devices, we’re often using 10 Gbit/s devices. That means we have 1000x more packets per second! So that means that if we have a single packet in every interrupt that’s 833,000 interrupts per second. All of a sudden the time we have to just do all of the accounting for the interrupt becomes ridiculous and starts to add a lot of overhead.

Solution One: Do Less Work

The first way that this is often handled is to simply do less. Modern devices have interrupt throttling. This allows device drivers to limit the number of interrupts that occur per second per ring. The rates and schemes are different on every device, but a simple way to think about it is if you set an upper bound on interrupts per second, then the device will enforce a minimum amount of time between interrupts. Say you wanted to allow 8000 interrupts per second, then this would mean that an interrupt could fire at most every 125 microseconds.

When an interrupt comes in, the operating system can process more than one packet per interrupt. If there are several packets available in the ring, then there’s nothing that stops the system from processing multiple in a single interrupt and in fact, if you want to perform well at higher speeds, you need to.

While most operating systems enable this by default, there is a trade off. You can imagine that there is a small latency hit. For most uses this isn’t very important; however, if you’re in the finance world where every microsecond counts, then you may not care about the CPU cost and would rather avoid the latency. At the end of the day though, most workloads will end up benefiting from using interrupt throttling, especially as it can be used to help reduce the CPU load.

Solution Two: Turn Off Interrupts

Here we’re going to go and do something entirely different. We’re going to stop bothering with interrupts. Modern PCI Express devices all support multiple interrupts. We’ve talked about how there are multiple rings, well each ring has its own interrupt identified by a vector. These interrupts are called MSI-X. These devices allow you to mask the interrupt and turn it off on a per-ring basis.

Regardless of whether interrupts are turned on or off on a given ring, packets will still accumulate in the ring. This means that if the operating system were to look at the ring, it could see that there were packets available to be processed. If the OS marks the received entries processed, then the hardware will continue delivering packets into the ring. When the OS decides to turn off interrupts and process the ring with a dedicated thread, we call this polling.

Polling works in conjunction with dedicated rings being used for classification. In essence, when the OS notices that there’s a large number of packets coming in on the ring, it will then transition the ring to polling mode, disabling interrupts. The dedicated thread will then continue to poll the ring for work, consuming up to a fixed number of packets in any one poll. After that, if there is still a large backlog, it will keep polling until a low watermark is reached, at which point it will disable polling and transition back to interrupt based processing.

Each time a poll occurs, the packets will be delivered in bulk to the networking stack. So if 50 packets came in in-between polls, then they would all be delivered at once.

As with everything we’ve seen, there is a trade off of sorts. When you’re in polling mode, there can be an additional latency hit to processing some of these packets; however, polling can make it much easier to saturate 10 Gbit/s and faster devices with very little CPU.


We started off with introducing the concept of rings in a networking card. Rings form the basic building block of the NIC and are the basis for a lot of the more interesting trends in hardware. From there, we talked about how you can use rings for RSS and when you combine rings with MAC address filters you can drive up tenancy with hardware classification, which helps enable polling, among other features.

One important thing here is that all of the features we’ve talked about are completely transparent to applications. All of the software built upon the BSD/POSIX sockets APIs, functions like connect(), accept(), sendmsg(), and recvmsg(), automatically get all of the benefits of these developments and enhancements in the operating system and hardware, without having to change a single thing in the application.

Future Directions and More Reading

This is only the tip of the iceberg both in terms of detail and in terms of what we can do. For example, while we’ve primarily talked about the hardware and software classification of flows for the purpose of delivering them to the right device, there are other things we can do such as throttle bandwidth and more with tools like flowadm(1M).

At Joyent, we’re continuing to explore these areas and push ahead in new and interesting ways. For example, as cards have been adding more and more functionality to support things like VXLAN and Geneve, we’re exploring how we perform hardware classification of those protocols, leverage newer checksum offloading for them, and coming up with novel and different ways to improve the performance. If any of the following sound interesting, make sure to reach out.

If you found this topic interesting, but find yourself looking for more, you may want to read some of the following:

Posted on September 15, 2016 at 10:06 am by rm · Permalink · Comments Closed
In: Miscellaneous

illumos day 2014

Saturday September 27th was illumos day 2014, hosted as a follow on to Surge 2014. illumos day was really quite nice and it was a good gathering of both folks who have been in the community for some time, and those who were just getting started. I was able to record the talks and so you can find them all in the following Youtube playlist:

The following folks gave talks:

Thanks to everyone who attended!

Posted on October 1, 2014 at 10:47 am by rm · Permalink · Comments Closed
In: Miscellaneous

illumos Overlay Networks Development Preview 02

I’m happy to announce the second development preview of my network vitalization or if you like to use buzzwords, software defined networking, for illumos. Like the previous entry, the goal of this is to give folks something to play around with and get a sense of what this looks like for a user and administrator.

The dev-overlay branch of illumos-joyent has all the source code and has been merged up to illumos and illumos-joyent from September 22nd.

This is a development preview, so it’s using a debug build of illumos. This is not suitable for production use. There are bugs; expect panics.

How we got here

It’s worth taking a little bit of time to understand the class of problems that we’re trying to solve. At the core of this work is a desire to have multiple logical layer two networks be able to all use one physical, or underlay, network. This means, that you can run multiple virtual machines that each have their own independent set of VLANs and private address space, so both Alice and Bob can have their own VMs using the same private IP addresses, like and be confident that they will not see each others traffic.

What’s in this Release

This release builds on from the last release which had simple point to point tunnels. This release adds support for the following:

This release also has a similar set of known issues:

Dynamic Plugins

In the first release, overlay devices only supported the direct plugin which always sent all traffic to a single destination. While useful, it meant that a given tunnel was limited to being point to point. The notion of a dynamic plugin changes this entirely. In this world, traffic can be encapsulated and sent to different hosts based on its destination mac address. Instead of getting a single destination from userland at device creation, the kernel goes and asks userland to supply it with the destination on demand.

Allowing an answer to be supplied this way makes it much easier to write different ways of answering the question in userland. As individuals and organizations figure out their own strategy here, it makes it much easier to interface with existing centralized databases or extant distributed systems.

In addition, as part of writing a simple files backend, I wrote several routines that can be used to inject proxy ARP, proxy NDP, and proxy DHCPv4 requests. Having these primitives in the common library makes it much easier for different backends which don’t support multicast or broadcast traffic to have something to use.

The files plugin format

In the next section we’ll show an example of getting started and having three different VMs use the same file for understanding our virtual network’s layout. Here’s a copy of the file /var/tmp/hosts.json that I’ve been using:

# cat /var/tmp/hosts.json

        "de:ad:be:ef:00:00": {
                "arp": "",
                "ip": "",
                "ndp": "fe80::3",
                "port": 4789
        }, "de:ad:be:ef:00:01": {
                "arp": "",
                "dhcp-proxy": "de:ad:be:ef:00:00",
                "ip": "",
                "ndp": "fe80::4",
                "port": 4789
        }, "de:ad:be:ef:00:02": {
                "arp": "",
                "ip": "",
                "ndp": "fe80::5",
                "port": 4789

In this JSON blob, the key is the MAC address of a VNIC. With each key, there must be a member entitled ip and port. These are used by the plugin to answer the question of, where should a packet with this mac address be sent? The ip member may be either an IPv4 or IPv6 address.

Machines send packets to a specific MAC address. They look up the mappings between a MAC address and an IP address using different mechanisms for IPv4 and IPv6. IPv4 uses ARP to get this information which devolves into using broadcast frames, while NDP is built into IPv6 and uses ICMPv6. Those messages are generally sent using specific multicast addresses. However, because this backend does not support broadcast or multicast traffic, we need to do something a little different.

When the kernel encounters a destination MAC address that it doesn’t recognize, it asks userland where it should send it. Userland in turn looks at the layer 2 header and determines what it should do. When it sees something that gives the sign that it might be an ARP or NDP packet, it pulls down a copy of the entire packet and if it confirms that it is in fact an ARP or NDP packet, it will generate a response on its own using information encoded in the JSON file above and that will be injected into the overlay device for delivery.

The system determines the mapping between an IPv4 address and its MAC address by supplying an IP address in the arp field. It determines the mapping between an IPv6 address and its mac address by using the ndp field.

Finally, to better explore this prototype, I implemented a DHCP proxy capability. While DHCP has a defined system of relaying, the relay expects to be able to receive layer 2 broadcast packets. Instead, if we see a UDP broadcast packet that’s doing a DHCP query, we rewrite the frame to send it explicitly to the destination MAC address listed in the dhcp-proxy member. In this case, if I run a DHCPv4 server on the host listed on the first entry, it will properly serve a DHCP address to the mac address that has the dhcp-proxy entry. However, an important thing to remember with this, is that even though DHCP was able to assign an address, one still needs to be able to perform ARP and therefore if the address doesn’t match the one in the files entry, it will not work. To be able to do that properly, you need to write a plugin that’s a bit more complicated than the files backend.

Getting Started

This development release of SmartOS comes in three flavors:

Once you boot this version of SmartOS, you should be good to go. As an example, I’ll show how I set up three individual hosts, which we’ll call zlan, zlan2, and zlan3. I put the JSON file shown above, as the file /var/tmp/hosts.json. On the host zlan I ran the following

# dladm create-overlay -v 23 -e vxlan -s files -p vxlan/listen_ip= -p files/config=/var/tmp/hosts.json overlay0
# dladm create-vnic -m de:ad:be:ef:00:00 -l overlay0 foo0
# ifconfig foo0 plumb up
# ifconfig foo0 inet6 plumb up
# ifconfig foo0 inet6 fe80::3

On the host zlan2 I ran the following:

# dladm create-overlay -v 23 -e vxlan -s files -p vxlan/listen_ip= -p files/config=/var/tmp/hosts.json overlay0
# dladm create-vnic -m de:ad:be:ef:00:01 -l overlay0 foo0
# ifconfig bar0 plumb up
# ifconfig bar0 inet6 plumb up
# ifconfig bar0 inet6 fe80::4

And finally on the host zlan3, I ran the following:

# dladm create-overlay -v 23 -e vxlan -s files -p vxlan/listen_ip= -p files/config=/var/tmp/hosts.json overlay0
# dladm create-vnic -m de:ad:be:ef:00:02 -l overlay0 baz0
# ifconfig baz0 plumb up
# ifconfig baz0 inet6 plumb up
# ifconfig baz0 inet6 fe80::5

With all that done, all three hosts could ping and access network services on each other.


The dynamic plugins allow us to start building and experimenting with something a bit more and interesting than the point to point tunnel. From here, there isn’t as much core functionality that’s necessary to add, but there’s a lot more stability and improvements throughout the stack. In addition, from here, I’ll be experimenting with some more distributed systems to make the next dynamic plugin, much more dynamic.

If you have any feedback, suggestions, or anything else, please let me know. You can find me on IRC (rmustacc in #smartos and #illumos on irc.freenode.net) or on the smartos-discuss mailing list. If you’d like to work on support for other encapsulation methods such as NVGRE or want to see how implementing a dynamic mapping service might be, reach out to me.

Posted on September 23, 2014 at 11:58 am by rm · Permalink · Comments Closed
In: Networking

illumos Overlay Networks Development Preview 01

At Joyent I’ve been spending my time designing and building support for network virtualization in the form of protocols like VXLAN. I’ve gotten far enough along that I’m happy to announce the first SmartOS developmental preview of this work. The goal of this is to just give something for folks to play around with and start getting a sense of what this looks like. If you have any feedback, please send it my way!

All the development of this is going on in its own branch of illumos-joyent: dev-overlay. You can see all of the developments, including a README that gives a bit of an introduction and background, on that branch here.

The development preview below is a debug build of illumos. This is not suitable for production use. There are bugs. Expect panics.

What’s in this release

This release adds the foundation for overlay devices and their management in user land. With this you can create and list point-to-point VXLAN tunnels and create vnics on top of them. This is all done through dladm. This release also includes the preliminary version of the varpd daemon which manages user land lookups and will be used for custom lookup mechanisms in the future.

However, there are known things that don’t work:

Getting Started

This development release comes in the standard SmartOS flavors:

Once you boot this version of the platform, you’ll find that most things look the same. You’ll find a new service has been created and should be online — varpd. You can verify this with the svcs command. Next, I’ll walk through an example of starting everything up, creating an overlay device, and a VNIC on top of that.

[root@00-0c-29-ca-c7-23 ~]# svcs varpd
STATE          STIME    FMRI
online         21:43:00 svc:/network/varpd:default
[root@00-0c-29-ca-c7-23 ~]# dladm create-overlay -e vxlan -s direct \
    -p vxlan/listen_ip= -p direct/dest_ip= \
    -p direct/dest_port=4789 -v 23 demo0
[root@00-0c-29-ca-c7-23 ~]# dladm show-overlay
demo0        mtu                 rw   -   0           --          --
demo0        vnetid              rw   -   23          --          --
demo0        encap               r-   -   vxlan       --          vxlan
demo0        varpd/id            r-   -   1           --          --
demo0        vxlan/listen_ip     rw   y --          --
demo0        vxlan/listen_port   rw   y   4789        4789        1-65535
demo0        direct/dest_ip      rw   y --          --
demo0        direct/dest_port    rw   y   4789        --          1-65535
[root@00-0c-29-ca-c7-23 ~]# dladm create-vnic -l demo0 foo0
[root@00-0c-29-ca-c7-23 ~]# ifconfig foo0 plumb up

Let’s take this apart. The first thing that we did is create an overlay device. The -e vxlan option tells us that we should use vxlan for encapsulation. Currently only VXLAN is supported. The -s direct specifies that we should use the direct or point-to-point module for determining where packets flow. In other words, there’s only a single destination.

Following this we set three required properties. The vxlan/listen_ip which tells us what IP addresses to listen on. The direct/dest_ip which tells us which IP to send the results to, and finally, direct/dest_port which says what port to use. We didn’t end up setting the property vxlan/listen_port because VXLAN specifies a default port which is 4789.

Finally, we specified a virtual network id with -v, in this case 23. And then we ended it all with a name.

After that, it became visible in the dladm show-overlay which displayed everything that we wanted. You’ll want to take similar steps on another machine, just make sure to swap the IP addresses around.


This is just the tip of the iceberg here. There’s going to be a lot more functionality and a lot more improvements down the road. I’ll be doing additional development previews along the way.

If you have any feedback, suggestions, or anything else, please let me know. You can find me on IRC (rmustacc in #smartos and #illumos on irc.freenode.net) or on the smartos-discuss mailing list. If you’d like to work on support for other encapsulation methods such as NVGRE or want to see how implementing a dynamic mapping service might be, reach out to me.

Posted on July 25, 2014 at 6:08 pm by rm · Permalink · Comments Closed
In: Networking

DLPI and the IP Fastpath

The series so far

If you’re getting started you’ll want to see the previous entries on Project Bardiche:

The illumos Networking Stack

This blog post is going to dive into more detail about what the ‘fastpath’ is in illumos for networking, what it means, and a bit more about how it works. We’ll also go through and cover a bit more information about some of the additions we made as part of this project. Before we go too much further, let’s take another look at the picture of the networking stack from the entry on architecture of vnd:

             | libdlpi |  libvnd  | libsocket|
             |         ·          ·    VFS   |
             |   VFS   ·    VFS   +----------+
             |         ·          |  sockfs  |
             |         |    VND   |    IP    |
             |         +----------+----------+
             |            DLD/DLS            |
             |              MAC              |
             |             GLDv3             |

If you don’t remember what some of these components are, you might want to refresh your memory with the vnd architecture entry. Importantly, almost everything is layered on top of the DLD and DLS modules.

The illumos networking stack comes from a long lineage of technical work done at Sun Microsystems. Initially the networking stack was implemented using STREAMs. STREAMs is a message passing interface where message blocks (mblk_t) are sent around from one module to the next. For example, there are modules for things like arp, tcp/ip, udp, etc. These are chained together and can be seen in mdb using the ::stream dcmd. Here’s an example for my zone development zone:

> ::walk dld_str_cache | ::print dld_str_t ds_rq | ::q2stream | ::stream

| 0xffffff0251050690    | 0xffffff0251050598    |
| udp                   | udp                   |
|                       |                       |
| cnt = 0t0             | cnt = 0t0             |
| flg = 0x20204022      | flg = 0x20244032      |
            |                       ^
            v                       |
| 0xffffff02510523f8    | 0xffffff0251052300    | if: net0
| ip                    | ip                    |
|                       |                       |
| cnt = 0t0             | cnt = 0t0             |
| flg = 0x00004022      | flg = 0x00004032      |
            |                       ^
            v                       |
| 0xffffff0250eda158    | 0xffffff0250eda060    |
| vnic                  | vnic                  |
|                       |                       |
| cnt = 0t0             | cnt = 0t0             |
| flg = 0x00244062      | flg = 0x00204032      |

If I sent a udp packet, it would first be processed by the udp streams module, then the ip streams module, and finally make its way to the DLD/DLS layer which is represented by the vnic entry here. The means of this communication was part of the DLPI. DLPI itself defines several different kinds of messages and responses which can be found in the illumos source code here. The general specification is available here, though there’s a lot more to it than is worth reading. In illumos, it’s been distilled down into libdlpi.

Recall from the vnd architecture entry that the way devices and drivers communicate with a datalink is by initially using STREAMS modules and by opening a device in /dev/net/. Each data link in the system is represented by a dls_link_t. When you open a device in /dev/net, you get a dld_str_t which is an instance of a STREAMS device.

The DLPI allows consumers to bind to what they call a SAP or a service attachment point. What this means depends on the kind of data link. In the case of Ethernet, this refers to the ethertype. In other words, a given dld_str_t can be bound to something like IP, ARP, LLDP, etc. If this were something other than Ethernet, then that name space would be different.

For a given data link, only one dld_str_t can be actively bound to a given SAP (ethertype) at a time. An active bind refers to something that is actively consuming and sending data. For example, when you create an IP interface using ifconfig or ipadm, that does an active bind. Another example of an active bind is a daemon used for LLDP. There are also passive binds, like in the case of something trying to capture packets like snoop or tcpdump. That allows something to capture the data without worrying about blocking someone from using that attachment point.

Speeding things up

While the fundamentals of the DLPI are sound, the implementation in STREAMS, particularly for sending data left something to be desired. It greatly complicated the locking and it was hard to get it to perform in the way that was needed for saturating 10 GbE networks with TCP traffic. For all the details on what happened here and a good background, I’ll refer you to Sunay Tripathi’s Blog, where he covers a lot of what changed in Solaris 10 to fix this.

There are two parts to what folks generally end up calling the ‘IP fastpath’. One part of which we leverage for vnd and the other part which is still firmly used by IP. We’ll touch on the first part of this which eliminates the use of sending STREAMS messages. Instead it uses direct callbacks. Today this happens by negotiating with DLPI messages that discover capabilities of devices and then enabling them. The code for the vnd driver does this, as well as the ip driver. Specifically, you first send down a DL_CAPABILITY_REQ message. The response contains a list of capabilities that exist.

If the capability, DL_CAPAB_DLD is returned, then you can enable direct function calls to the DLD and DLS layer. The returned values give you a function pointer, which you can then use to do several things, and ultimately use to request to enable DLD_CAPAB_DIRECT. When you make a call to enable, you specify a function pointer for DLD to call directly when a packet is received. It then will give you a series of functions to use for things like checking flow control and transmitting a packet. These functions allow the system to bypass the issues with STREAMS and directly transmit along packets.

The second part of the ‘IP fastpath’ is something that primarily the IP module uses. In the IP module there is a notion of a neighbor cache entry or nce. This nce describes how to reach another host. When that host is found, the nce asks the lower layers of the stack to generate a layer two header that’s appropriate for this traffic. In the case where you have an Ethernet device, this means that it generates the MAC header including the source and destination mac addresses, ethertype, and vlan tags if there should be one. The IP stack then uses this pre-generated header each time rather than trying to create a new one from scratch for every packet. In addition, the IP module is subscribed to change events that get generated when something like a mac address changes, so that it can regenerate these headers when the administrator makes a change to the system.

New Additions

Finally, it’s worth taking a little bit of time to talk about the new DLPI additions that we added with project bardiche. We needed to solve two problems. Specifically:

To solve the first case, we added a new request called a DL_EXCLUSIVE_REQ. This adds a new mode for the bind state of the dld_str_t. In addition to being active or passive, it can now be exclusive. This can only be requested if no one is actively using the device. If someone is, for example, an IP interface has already been created, then the DL_EXCLUSIVE_REQ will fail. The opposite is true as well, if someone is using the dld_str_t in exclusive mode, then the request to bind to the IP ethertype will also fail. This exclusive request lasts until the consumer closes the dld_str_t.

When a vnd device is created, it makes an explicit request for exclusive access to the device, because it needs to send and receive on all of the different ethertypes. If an IP interface is already active, it doesn’t make sense for a vnd device to be created there. Once the vnd device is destroyed, then anything can use the data link.

Solving our second problem was actually quite simple. The core logic to not loop back packets that were transmitted was already present in the MAC layer. To do that, we created a new promiscuous option that could be specified in the DLPI DL_PROMISCON_REQ called DL_PROMISC_RX_ONLY. Enabling this would pass along the flag MAC_PROMISC_FLAGS_NO_TX_LOOP down to the mac layer which actually does the heavy lifting of duplicating the necessary amount of packets.


This gives a rather rough introduction to the fastpath in the illumos networking stack. The devil, as always, is in the details.

In the next entries, we’ll go over the other new extensions that were added as part of this work: the sdev plugin interface and generalized serialization queues. Finally, we’ll finish with a recap and go over what’s next.

Posted on April 3, 2014 at 8:22 am by rm · Permalink · Comments Closed
In: Bardiche, Networking · Tagged with: ,

Project Bardiche: Framed I/O

The series so far

If you’re getting started you’ll want to see the previous entries on Project Bardiche:


Framed I/O is a new abstraction that we’re currently experimenting with through Project Bardiche. We call this framed I/O, because the core concept is what we call a frame: a variable amount of discrete data that has a maximum size. In this article, we’ll call data that fits this framed. For example, Ethernet devices work exactly this way. They have a maximum size based on their MTU, but there may very well be less data available than the maximum. There are a few overarching goals that led us down this path:

The primary use case of framed I/O is for vnd and virtual machines. However, we believe that the properties here make it desirable to other portions of the stack which operate in terms of frames. To understand why we’re evaluating this abstraction, it’s worth talking about the other existing abstractions in the system.

read(2) and write(2)

Let’s start with the traditional and most familiar series of I/O interfaces: read(2), write(2), readv(2), and writev(2). These are the standard I/O system calls that most C programmers are familiar with. read(2) and write(2) originated in first edition UNIX. readv(2) and writev(2) supposedly came about during the development of 4.2 BSD. The read and write routines operate on streams of data. The callers and file descriptors have no inherent notion of data being framed and all framing has to be built into consumption of the data. For a lot of use cases, that is the correct abstraction.

The readv(2) and writev(2) interfaces allowed that stream to be vectored. It’s hard to say if these were vectored I/O abstraction in Operating Systems, but it certainly is one of the most popular ones from early systems that’s still around. Where read(2) and write(2) map a stream to a single buffer, these calls map a stream to a series of arbitrarily sized vectors. The act of vectorizing data is not uncommon and can be very useful. Generally, this is done when combining what may be multiple elements into one discrete stream for transferring. For example, if a program maintains one buffer for a protocol header and another buffer is used for the payload, then being able to specify a vector that includes both of these in one call can be quite useful.

When operating with framed data, these interfaces fall a bit short. The problem is that you’ve lost information that the system had regard the framing. It may be that the protocol itself includes the delineations, but there’s no guarantee that that data is correct. For example, if you had a buffer of size 1500, not only would something like read(2) only give you the total number of bytes returned, you wouldn’t be able to get the total number of frames. A return value of 1500 could be one large 1500 byte frame, it could be multiple 300 byte frames or anything in between.

getmsg(2) and putmsg(2)

The next set of APIs that are worth looking at are getmsg(2) and putmsg(2). These APIs are a bit different from the normal read(2) and write(2) APIs, they’re designed around framed messages. These routines use a struct strbuf which has the following members:

    int    maxlen;      /* maximum buffer length */
    int    len;         /* length of data */
    char   *buf;        /* ptr to buffer */

These interfaces allows for the consumer to properly express the maximum size of the frame that they expect and the amount of data that the given frame actually includes. This is very useful for framed data. Unfortunately, this API has some deficiencies. It doesn’t have the ability to break down the data into vectors nor do systems really have a means of working with multiple vectors at a time.

sendmsg(2) and recvmsg(2)

The next set of APIs that I explored and looked at focused on were the recvmsg(2) family, particularly the extensions that were introduced into the Linux kernel via sendmmsg(2) and [recvmmsg(2). The general design of the msghdr structure is good, though it understandably is designed around the socket interface. Unfortunately something like sendmsg(2) is not something that device drivers in most systems get, it currently only works for socket file systems, and a lot of things don't look like sockets. Things like ancillary data and the optional addresses are not as useful and don't have meaning for other styles of messages or if they do, they may not fit the abstraction that's been defined there.

Framed I/O

Based on our evaluations with the above APIs, a few of us chatted around Joyent's San Francisco office and tried to come up with something that might have the properties we felt made more sense for something like KVM networking. To help distinguish it from traditional Socket semantics or STREAMS semantics, we named it after the basic building block of the frame. The general structure is called a frameio_t which itself has a series of vector structures called a framevec_t. The structures roughly look like:

typedef struct framevec {
    void    *fv_buf;        /* Buffer with data */
    size_t  fv_buflen;      /* Size of the buffer */
    size_t  fv_actlen;      /* Amount of buffer consumed, ignore on error */
} framevec_t;

typedef struct frameio {
    uint_t  fio_version;    /* Should always be FRAMEIO_CURRENT_VERSION */
    uint_t  fio_nvpf;       /* How many vectors make up one frame */
    uint_t  fio_nvecs;      /* The total number of vectors */
    framevec_t fio_vecs[];  /* C99 VLA */
} frameio_t;

The idea here is that, much like a struct msgbuf, each vector component has a notion of what it's maximum size is and then the actual size of data consumed. These vectors can then be constructed into series of frames in multiple ways through the fio_nvpf and fio_nvecs members. The fio_nvecs field describes the total number of vectors and the fio_nvpf describes how many vectors are in a frame. You might think of the fio_nvpf member as basically describing how many iovec structures make up a single frame.

Consider that you have four vectors to play with, you might want to rig it up in one of several ways. You might want to map each message to a signle vector, meaning that you could read four messsages at once. You might want the opposite and map a single message to all four vectors. In that case you'd only ever read one message at a time, broken into four components. You could also break it down such that you always broke down a message into two vectors, that means that you'd be able to read two messages at a time. The following ASCII art might help.

1:1 Vector to Frame mapping

 +-------+  +-------+  +-------+  +-------+
 | msg 0 |  | msg 1 |  | msg 2 |  | msg 3 |
 +-------+  +-------+  +-------+  +-------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

4:1 Vector to Frame mapping

 |                  msg 0                 |
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

2:1 Vector to Frame Mapping

 +------------------+  +------------------+
 |       msg 0      |  |       msg 1      |
 +------------------+  +------------------+
    ||         ||         ||         ||
 +-------+  +-------+  +-------+  +-------+
 | vec 0 |  | vec 1 |  | vec 2 |  | vec 3 |
 +-------+  +-------+  +-------+  +-------+

Currently the maximum number of vectors allowed in a given call is limited to 32. As long as the total number evenly divides the number of frames per vector, than any value is alright.

By combining these two different directions, we believe that this'll be a useful abstraction and suitable for other parts of the system that operate on framed data, for example a USB stack. Another thing that this design lets us do is that by not constraining the content of the vectors, it would be possible to replicate something like the struct msghdr where the protocol header data was actually in the first vector.

Framed I/O now and in the Future

Today we've started plumbing this through in QEMU to account for its network device backend APIs that allow one to operate on traditional iovec. However, there's a lot more that can be done with this. For example, one of the things that is on our minds is writing a vhost-net style driver for illumos that can easily map data between the framed I/O representation and the virtio driver. With this, it'd even be possible to do something that's mostly zero-copy as well. Alternatively, we may also explore just redoing a lot of QEMU's internal networking paths to make it more friendly for sending and receiving multiple packets at once. That should certainly help with the overhead today of networking I/O in virtual machines.

We think that this might fit in other parts of the system as well, for example, it may make sense to be used as part of the illumos USB3 stack's design as the unit that we send data in. Whether it makes sense as anything more than just for the vnd device and this style of I/O time will tell.

Today vnd devices are exposed in libvnd through the vnd_frameio_read(3VND) and vnd_frameio_write(3VND) interfaces. So these can also be used for someone who's trying to develop their own services using vnd, for example, user land switches, firewalls, etc.

Next in the Bardiche Series

Next in the bardiche series, we'll be delving into some of the additional kernel subsystems and new DLPI abstractions that were created. Following those, we'll end with a recap entry on bardiche as a whole and what may come next.

Posted on March 25, 2014 at 8:57 am by rm · Permalink · Comments Closed
In: Bardiche, Networking · Tagged with: ,

Project Bardiche: vnd Architecture

My previous entry introduced Project Bardiche, a project which revamps how we do networking for KVM guests. This entry focuses on the design and architecture of the vnd driver and how it fits into the broader networking stack.

The illumos networking stack

The illumos networking stack is broken into several discrete pieces which is summarized in the following diagram:

             | libdlpi |  libvnd  | libsocket|
             |         ·          ·    VFS   |
             |   VFS   ·    VFS   +----------+
             |         ·          |  sockfs  |
             |         |    VND   |    IP    |
             |         +----------+----------+
             |            DLD/DLS            |
             |              MAC              |
             |             GLDv3             |

At the top of the diagram are the common interfaces to user land. The first and most familiar is libsocket. That contains all of the common interfaces that C programmers are used to seeing: connect(3SOCKET), accept(3SOCKET), etc. On the other hand, there’s libdlpi, which provides an alternate means for interacting with networking devices. It is often used by software like DHCP servers and for LLDP. libvnd is new and a part of project bardiche. For now, we’ll stick to describing the other two paths.

Next, operations transit through the virtual file system (VFS) to reach their next destination. For most folks, that’ll be the illumos file system sockfs. The sockfs file system provides a means of translating between the gory details of TCP, UDP, IP, and friends, and the actual sockets that they rely on. The next step for such sockets is what folks traditionally think of as the TCP/IP stack. This encompasses everything related to the actual protocol processing. For example, connecting a socket and going through the TCP dance of the SYN and SYN/ACK is all handled by the logic in the TCP/IP stack. In illumos, both TCP and IP are implemented in the same kernel module called IP.

The next layer is comprised of two kernel modules which work together called DLD and DLS. DLD is the data-link driver and DLS is the data-link services module. The two modules work together. Every data link in illumos, whether it’s a virtual nic or physical nic, is modeled as a dld device. When you open something like /dev/net/igb0, that’s an instance of a DLD device. These devices provide an implementation of all of the DLPI (Data-link Provider Interface) STREAMS messages and are used to negotiate the fast path. We’ll go into more detail about that in a future entry.

Everything transits out of DLD and DLS and enters the MAC layer. The MAC layer handles taking care of interfacing with the actual device drivers, programming unicast and multicast addresses into devices, controlling whether or not the devices are in promiscuous mode, etc. The final layer is the Generic Lan Device version three (GLDv3) Device Driver. GLDv3 is a standard interface for networking device drivers and represents a set of entry points that the Operating System expects to use with them.

vnd devices

A vnd device is created on top of a data link similar to how an IP interface is created on top of a data link. Once a vnd device has been created, it can be used to read and write layer two frames. In addition, a vnd device can optionally be linked into the file system name space allowing others to open the device.

Similar to /dev/net, vnd devices show up under /dev/vnd. A control node is always created at /dev/vnd/ctl. This control node is referred to as a self-cloning device. That means that any time the device is opened, a new instance of the device is created. Once the control node has been opened, it is associated with a data link and then it is bound into the file system name space with some name that usually is identical to the name of the data link. After the device has been bound, it then shows up in /dev/vnd. If a vnd device was named net0 then it would show up as /dev/vnd/net0. Just as /dev/net displays all of the data links in th various zones under /dev/net/zone, the same is true for vnd. The vnd devices in any zone are all located under /dev/vnd/zone and follow the pattern /dev/vnd/zone/%zonename/%vnddevice. These devices are never directly manipulated. Instead, they are used by libvnd and vndadm.

Once a vnd device has been created and bound into the name space, it will persist until it is removed with either vndadm or libvnd or the zone it is present in is halted. The removal of vnd devices from the name space is similar to calling unlink(2) on a file. If any process has the vnd device open after it is has been removed from the name space, it will persist until all open handles have been closed.

If a data link already has an IP interface or is being actively used for any other purpose, a vnd device cannot be created on top of it, and vice versa. Because vnd devices operate at a layer two, if various folks are already consuming layer three, it doesn’t make sense to create a vnd device on top of it. The opposite also holds.

The command vndadm was written to manipulate vnd devices. It’s worth stepping through some basic examples of using the command. Even more examples can be found in its manual page. With that, let’s get started and create a vnic and then a device. Substitute your physical link for anything you prefer.

# dladm create-vnic -l e1000g0 vnd0
# vndadm create vnd0
# ls /dev/vnd
ctl   vnd0  zone

With that, we’ve created a device. Next, we can use vndadm to list devices as well as get and set properties.

# vndadm list
NAME             DATALINK         ZONENAME
vnd0             vnd0             global
# vndadm get vnd0
LINK          PROPERTY         PERM  VALUE
vnd0          rxbuf            rw    65536
vnd0          txbuf            rw    65536
vnd0          maxsize          r-    4194304
vnd0          mintu            r-    0
vnd0          maxtu            r-    1518
# vndadm set txbuf=2M
# vndadm get vnd0 txbuf
LINK          PROPERTY         PERM  VALUE
vnd0          txbuf            rw    2097152

You’ll note that there are two properties that we can set, rxbuf and txbuf. These are the sizes of buffers that an instance of a vnd device maintains. As frames come in, they are put into the receive buffer where they sit until they are read by someone, usually a KVM guest. If a frame would come in that would exceed the size of that buffer, then it will be dropped instead. The transmit buffer controls the total amount of outstanding data that can exist at any given time in the vnd subsystem. The vnd device has to keep track of this to deal with cases like flow control.

Finally, we can go ahead and remove the device via:

# vndadm destroy vnd0

While not shown here, all of these commands can operate on device that are in another zone, if the user is in the global zone. To get statistics about device throughput and packet drops, you can use the command vndstat. Here’s a brief example:

$ vndstat 1 5
 name |   rx B/s |   tx B/s | drops txfc | zone
 net0 | 1.45MB/s | 14.1KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 3.50MB/s | 19.5KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 2.83MB/s | 30.8KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 3.08MB/s | 30.6KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525
 net0 | 3.21MB/s | 30.6KB/s |     0    0 | 1b7155a4-aef9-e7f0-d33c-9705e4b8b525

The drops column sums up the total number of drops while the txfc column shows the number of times that the device has been flow controlled during that period.

Programmatic Use

So far, I’ve demonstrated the use of the user commands. For most applications, you’ll want to use the fully featured C library libvnd. The introductory manual is where you’ll want to get started for all information in using it. It will point you out to all of the rest of the functions which can all be found in the manual section 3VND. Please keep in mind, that until the library makes its way up into illumos, portions of the API may still end up changing and isn’t considered stable yet.

Peeking under the hood

So far we’ve talked about how you can use these devices, now let’s go under the hood and talk about how this is constructed. For all of the full gory details, you should turn to the vnd big theory statement. There are multiple components that make up the general architecture of the vnd sub-system, though only character devices are shown. The following bit of ascii art from the big theory statement describes the general architecture:

+----------------+     +-----------------+
| global         |     | global          |
| device list    |     | netstack list   |
| vnd_dev_list   |     | vnd_nsd_list    |
+----------------+     +-----------------+
    |                    |
    |                    v
    |    +-------------------+      +-------------------+
    |    | per-netstack data | ---> | per-netstack data | --> ...
    |    | vnd_pnsd_t        |      | vnd_pnsd_t        |
    |    |                   |      +-------------------+
    |    |                   |
    |    | nestackid_t    ---+----> Netstack ID
    |    | vnd_pnsd_flags_t -+----> Status flags
    |    | zoneid_t       ---+----> Zone ID for this netstack
    |    | hook_family_t  ---+----> VND IPv4 Hooks
    |    | hook_family_t  ---+----> VND IPv6 Hooks
    |    | list_t ----+      |
    |    +------------+------+
    |                 |
    |                 v
    |           +------------------+       +------------------+
    |           | character device |  ---> | character device | -> ...
    +---------->| vnd_dev_t        |       | vnd_dev_t        |
                |                  |       +------------------+
                |                  |
                | minor_t       ---+--> device minor number
                | ldi_handle_t  ---+--> handle to /dev/net/%datalink
                | vnd_dev_flags_t -+--> device flags, non blocking, etc.
                | char[]        ---+--> name if linked
                | vnd_str_t * -+   |
        | STREAMS device          |
        | vnd_str_t               |
        |                         |
        | vnd_str_state_t      ---+---> State machine state
        | gsqueue_t *          ---+---> mblk_t Serialization queue
        | vnd_str_stat_t       ---+---> per-device kstats
        | vnd_str_capab_t      ---+----------------------------+
        | vnd_data_queue_t ---+   |                            |
        | vnd_data_queue_t -+ |   |                            v
        +-------------------+-+---+                  +---------------------+
                            | |                      | Stream capabilities |
                            | |                      | vnd_str_capab_t     |
                            | |                      |                     |
                            | |    supported caps <--+-- vnd_capab_flags_t |
                            | |    dld cap handle <--+-- void *            |
                            | |    direct tx func <--+-- vnd_dld_tx_t      |
                            | |                      +---------------------+
                            | |
           +----------------+ +-------------+
           |                                |
           v                                v
+-------------------+                  +-------------------+
| Read data queue   |                  | Write data queue  |
| vnd_data_queue_t  |                  | vnd_data_queue_t  |
|                   |                  |                   |
| size_t        ----+--> Current size  | size_t        ----+--> Current size
| size_t        ----+--> Max size      | size_t        ----+--> Max size
| mblk_t *      ----+--> Queue head    | mblk_t *      ----+--> Queue head
| mblk_t *      ----+--> Queue tail    | mblk_t *      ----+--> Queue tail
+-------------------+                  +-------------------+

At a high level there are three different core components. There is a per-netstack data structure, there is a character device and there is a STREAMS device.

A netstack, or networking stack, is a concept in illumos that contains an independent set of networking information. This includes TCP/IP state, routing tables, tunables, etc. Every zone in SmartOS has its own netstack which allows zones to more fully control and interface with networking. In addition, the system has a series of IP hooks which are used by things like ipfilter and ipd to manipulate packets. When the vnd kernel module is first loaded, it registers with the netstack sub-system which ensures that allows the vnd kernel module to create its per-netstack data. In addition to hooking, the per-netstack data is used to make sure that when a zone is halted that all of the associated vnd devices are torn down.

The character device is the interface between consumers and the system. The vnd module is actually a self-cloning device. Whenever a library handle is created it first opens the control node which is /dev/vnd/ctl. The act of opening that creates a clone of the device with a new minor number. When an existing vnd device is opened, then no cloning takes place, it opens one of the existing character devices.

The major magic happens when a vnd character device is asked to associate with a data link. This happens through an ioctl that the library wraps up and takes care of. When the device is associated, the kernel itself does what we call a layered open – it opens and holds another character or block device. In this case the vnd module does a layered open of the data link. However, the devices that back data links are still STREAMS devices that speak DLPI. To take care of dealing with all of the DLPI messages and set up the normal fast path, we use the third core component: the vnd STREAMS device.

The vnd STREAMS device is fairly special, it cannot be used outside of the kernel and is an implementation detail of the vnd driver. After doing the layered open, the vnd STREAMS device is pushed onto the stream head and it begins to exchange DLPI messages to set up and configure the data link. Once it has successfully walked through its state machine, the device is full and ready to go. As part of doing that, it asks for exclusive access to the device, enables us to receive all the packets that are originally destined for the device, and enables direct function calls for this through what’s referred to commonly as the fastpath. Once that’s set up, the character device and STREAMS device wire up with one another. Once that’s all been finished successfully, the character device can be fully initialized.

At this point in time, the device can be fully used for reading and writing packets. It can optionally be bound into the file system name space. That binding is facilitated by the sdev file system and its new plugin interface. We’ll go into more detail about that in a future entry.

The STREAMS device contains a lot of the meat for dealing with data. It contains the data queues and it controls all the interactions with DLD/DLS and the fastpath. In addition, it also knows about its gsqueue (generic serialization queue). The gsqueue is used to ensure that we properly handle the order of transmitted packets, especially when subject to flow control.

The following two diagrams (from the big theory statement) describe the path that data takes when received and when transmitted.

Receive path

                                 * . . . packets from gld
                          |     mac     |
                          |     dld     |
                                 * . . . dld direct callback
                         | vnd_mac_input |
  +---------+             +-------------+
  | dropped |<--*---------|  vnd_hooks  |
  |   by    |   .         +-------------+
  |  hooks  |   . drop probe     |
  +---------+     kstat bump     * . . . Do we have free
                                 |         buffer space?
                           no .  |      . yes
                              .  +      .
                          |                     |
                          * . . drop probe      * . . recv probe
                          |     kstat bump      |     kstat bump
                          v                     |
                       +---------+              * . . fire pollin
                       | freemsg |              v
                       +---------+   +-----------------------+
                                     | vnd_str_t`vns_dq_read |
                                              ^ ^
                              +----------+    | |     +---------+
                              | read(9E) |-->-+ +--<--| frameio |
                              +----------+            +---------+

Transmit path

  +-----------+   +--------------+       +-------------------------+   +------+
  | write(9E) |-->| Space in the |--*--->| gsqueue_enter_one()     |-->| Done |
  | frameio   |   | write queue? |  .    | +->vnd_squeue_tx_append |   +------+
  +-----------+   +--------------+  .    +-------------------------+
                          |   ^     .
                          |   |     . reserve space           from gsqueue
                          |   |                                   |
             queue  . . . *   |       space                       v
              full        |   * . . . avail          +------------------------+
                          v   |                      | vnd_squeue_tx_append() |
  +--------+          +------------+                 +------------------------+
  | EAGAIN |<--*------| Non-block? |<-+                           |
  +--------+   .      +------------+  |                           v
               . yes             v    |     wait          +--------------+
                           no . .*    * . . for           | append chain |
                                 +----+     space         | to outgoing  |
                                                          |  mblk chain  |
    from gsqueue                                          +--------------+
        |                                                        |
        |      +-------------------------------------------------+
        |      |
        |      |                            yes . . .
        v      v                                    .
   +-----------------------+    +--------------+    .     +------+
   | vnd_squeue_tx_drain() |--->| mac blocked? |----*---->| Done |
   +-----------------------+    +--------------+          +------+
                                        |                     |
      |                                 |           tx        |
      |                          no . . *           queue . . *
      | flow controlled .               |           empty     * . fire pollout
      |                 .               v                     |   if mblk_t's
    +-------------+     .      +---------------------+        |   sent
    | set blocked |<----*------| vnd_squeue_tx_one() |--------^-------+
    | flags       |            +---------------------+                |
    +-------------+    More data       |    |      |      More data   |
                       and limit       ^    v      * . .  and limit   ^
                       not reached . . *    |      |      reached     |
                                       +----+      |                  |
                                                   v                  |
    +----------+          +-------------+    +---------------------------+
    | mac flow |--------->| remove mac  |--->| gsqueue_enter_one() with  |
    | control  |          | block flags |    | vnd_squeue_tx_drain() and |
    | callback |          +-------------+    | GSQUEUE_FILL flag, iff    |
    +----------+                             | not already scheduled     |

Wrapping up

This entry introduces the tooling around vnd and provides a high level overview of the different components that make up the vnd module. In the next entry in the series on baridche, we’ll cover the new framed I/O abstraction. Entries following that will cover the new DLPI extensions, the sdev plugin interface, generalized squeues, and finally the road ahead.

Posted on March 24, 2014 at 8:31 am by rm · Permalink · Comments Closed
In: Bardiche, Networking · Tagged with: , , ,

Project Bardiche: Introduction

I just recently landed Project Bardiche into SmartOS. The goal of Bardiche has been to create a more streamlined data path for layer two networking in illumos. While the primary motivator for this was for KVM guests, it’s opened up a lot of room for more than just virtual machines. This bulk of this project is comprised of changes to illumos-joyent; however, there were some minor changes made to smartos-live, illumos-kvm, and illumos-kvm-cmd.

Project Highlights

Before we delve into a lot more of the specifics in the implementation, let’s take a high level view of what this project has brought to the system. Several of these topics will have their own follow up blog entries.

There’s quite a bit there. The rest of this entry will go into detail on the motivation for this work and a bit more on the new /dev/net/zone, libdlpi, and snoop features. Subsequent entries will cover the the new vnd architecture and the new DLPI primitives, the new gsqueue interface, shine a bit more light on what gets referred to as the fastpath, and cover the new sdev plugin architecture.


Project bardiche started from someone asking what would it take to allow a hypervisor-based firewall to be able to filter packets that were being sent from a KVM guest. We wanted to focus on allowing the hypervisor to provide the firewall because of the following challenges associated with managing a firewall running in the guest.

While it’s true that practically all the guests that you would run under hardware virtualization have their own firewall software, they’re rarely the same. If we wanted to leverage the firewall built into the guest, we’d need to build an agent that lived in each guest. Not only does that mean that we’d have to write one of these for every type of guest we wanted to manage, given that customers are generally the super-user in their virtual machine (VM), they’d be able to simply kill the agent or change these rules, defeating the API.

While dealing with this, there were several other deficiencies in how networking worked for KVM guests today based on how QEMU, the program that actually runs the VM, interacted with the host networking. For each Ethernet device that was in the guest, there was a corresponding virtual NIC in the host. The two were joined with the vnic back end in QEMU which originally used libdlpi to bind them. While this worked, there were some problems with it.

Because we had to put the device in promiscuous mode, there was no way to tell it not to send back traffic that came from ourselves. In addition to just being a waste of cycles, this causes duplicate address detection, often performed with IPv6, to fail for many systems.

In addition, the dlpi interfaces had no means of reading or writing multiple packets at a time. A lot of these issues extend from the history of the illumos networking stack. When it was first implemented, it was done using STREAMS. Over time, that has changed. In Solaris 10 the entire networking stack was revamped with a project called Fire Engine. That project, among many others, transitioned the stack from a message passing interface to one that used a series of direct calls and serialization queues (squeues). Unfortunately, the means by which we were using libdlpi, left us still using STREAMS.

While exploring the different options and means to interface with the existing firewall, we eventually reached the point where we realized that we needed to go out and create a new interface that solved this, and the related problems that we had, as well as, lay the foundation for a lot of work that we’d still like to do.

First Stop: Observability Improvements

When I first started this project, I knew that I was going to have to spend a lot of time debugging. As such, I knew that I need to solve one of the more frustrating aspects of working with KVM networking: the ability to snoop and capture traffic. At Joyent, we always run a KVM instance inside of zone. This gives us all the benefits of zones: the associated security and resource controls.

However, before this project, data links that belonged to zones were not accessible from the global zone. Because of the design of the KVM branded zone, the only process running is QEMU and you cannot log in, which makes it very hard to pass the data link to snoop or tcpdump. This set up does not make it impossible to debug. One can use DTrace or use snoop on a device in the global zone; however, both of those end up requiring a bit more work or filtering.

The solution to this is to allow the global zone to see the data links for all devices across all zones under /dev/net and then enhance the associated libraries and commands to support accessing the new devices. If you’re in the global zone, there is now a new directory called /dev/net/zone. Don’t worry, this new directory can’t break you, as all data links in the system need to end with a number. On my development virtual machine which has a single zone with a single vnic named net0, you’d see the following:

[root@00-0c-29-37-80-28 ~]# find /dev/net | sort

Just as you always have in SmartOS, you’ll still see the data links for your zone at the top level in /dev/net, eg. /dev/net/e1000g0 and /dev/net/vmwarebr0. Next, each of the zones on the system, in this case the global zone, and the non-global zone named 79809c3b-6c21-4eee-ba85-b524bcecfdb8, show up in /dev/net/zone. Inside of each of those directories are the data links that live in that zone.

The next part of this was exposing this functionality in libdlpi and then using that in snoop. For the moment, I added a private interface called dlpi_open_zone. It’s similar to dlpi_open except that it takes an extra argument for the zone name. Once this change gets up to illumos it’ll then become a public interface that you can and should use. You can view the manual page online here or if you’re on a newer SmartOS box you can run man dlpi_open_zone to see the documentation.

The use of this made its way into snoop in the form of a new option: -z zonename. Specifying a zone with -z will cause snoop to use dlpi_open_zone which will try to open the data link specified via its -d option from the zone. So if we wanted to watch all the icmp traffic over the data link that the KVM guest used we could run:

# snoop -z 79809c3b-6c21-4eee-ba85-b524bcecfdb8 -d net0 icmp

With this, it should now be easier, as an administrator of multiple zones, to observe what’s going on across multiple zones without having to log into them.


There are numerous people whom helped this project along the way. The entire Joyent engineering team helped from the early days of bouncing ideas about APIs and interfaces all the way through to the final pieces of review. Dan McDonald, Sebastien Roy, and Richard Lowe, all helped review the various write ups and helped deal with various bits of arcana in the networking stack. Finally, the broader SmartOS community helped go through and provide additional alpha and beta testing.

What’s Next

In the following entries, we’ll take a tour of the new sub-systems ranging from the vnd architecture and framed I/O abstraction through the sdev interfaces.

Posted on March 20, 2014 at 4:01 pm by rm · Permalink · Comments Closed
In: Bardiche, Networking · Tagged with: , , , ,

Userland CTF in DTrace

We at Joyent use DTrace for understanding and debugging userland applications just as often as we do for the kernel. That is part of the reason why we’ve worked on things like flamegraphs, the Node.js ustack helper, and the integration of libusdt in node module’s like restify and bunyan.

I’ve just put back some work that makes observing userland with DTrace a lot simpler and much more powerful. Before we meet the devil in the details, let’s start with an example:

$ cat startd.d
        printf("%s: %s\n", probefunc == "start_instance" ?
            "starting" : "stopping", stringof(args[1]->ri_i.i_fmri));
$ dtrace -qs startd.d -p $(pgrep -o startd)
stopping: svc:/system/intrd:default
stopping: svc:/system/cron:default
start_instance:entry starting: svc:/system/cron:default

If you’re familiar with DTrace you might realize that this doesn’t really look like what you’re expecting! Hey Robert, won’t that script blow up without the copyin? Also where did the args[] come from with the pid provider?!

The answer to this minor mystery is that DTrace can now leverage type information in userland programs. Not only does the compiler know the size and layout of types, it’ll also take care of adding calls to copyin for you so you can dereference members without fear. To explain how we’ve managed all of this, we need to go into a bit of back story.

The Compiler is the Enemy

Since the beginning of programming, we’ve needed to be able to debug the programs that we’ve written. A large chunk of time has been spent on tooling to be able to understand and root cause these bugs whether working on a live system or doing a post-mortem analysis on something like a core dump.

Unfortunately, the compiler is in many ways our enemy. Its goal is to take whatever well commented and understandable code we might have and transform it not only into something that a processor can understand, but often times transforming it through optimization passes into something that no longer resembles what we originally wrote.

This problem isn’t limited to languages like C and C++. In fact, many of the same problems apply when you use any compiler, be it your current compile to JavaScript language of the day (emscripten and coffeescript) or something like lex and yacc.

Fortunately, both the compiler and the linker are just software. Shortly after we first hit this problem, they were modified to produce debugging information that could be encoded into the binaries they produced. Folks even were able to encode this kind of information in the original ‘a.out’ executable format that came around in first edition UNIX.

One of the first popular formats that was used was called stabs. It was used on many operating systems for many years and you can still convince compilers like gcc, clang, and suncc to generate it. Since then, DWARF has become the most popular and commonly used format. The initial origin of DWARF came from Bell Labs, but it was rather unpopular because the debugging data that it created was just too large. Since then DWARF has become more standardized and more compact than the first version of DWARF. However, it is a rather complicated format.

With all of these formats there is a trade-off between expressibility and size. If the debugging information takes too much space, then people stop including it. Most available OS and package distributions do not incorporate debugging information. If you’re lucky, they separate that information into different packages. This means that when you’re debugging a problem in production you very often don’t have the very information you need. Even more frustrating, when this information is in a separate file and you’re trying to do post-mortem analysis, then you need to track that down and make sure that you have the right version of the debug information that corresponds to what you were using in production.

This situation is unsatisfying, but – we have other options! Sun developed CTF in Solaris 9 with the intent of using it with mdb and eventually DTrace. In illumos, we put CTF data in every kernel module, library, and a majority of applications. We don’t store all the information that you might get in, say, DWARF, but we store what we’ve found over the years to be the most useful information.

CTF data includes the following pieces of information:

o The definitions of all types and structures
o The arguments and types of each function
o The types of function return values
o The types of global variables

All of the CTF data for a given library, binary, or kernel module is found inside of what we call a CTF Container. The CTF container is found as its own section in an ELF object. A simple way to see if something in question has CTF is to run elfdump(1). Here’s an example:

$ elfdump /lib/libc.so | grep SUNW_ctf
Section Header[37]:  sh_name: .SUNW_ctf

If a library or program does not have CTF data then the section won’t show up in the list and the grep should turn up empty.

CTF and DTrace

If you’ve ever wondered how it is that DTrace knows types when you use various providers, like args[] in fbt, the answer to that is CTF. When you run DTrace, it loads relevant CTF containers for the kernel. In fact, even the basic types that D provides, such as an int or types that you define in a D script, end up in a CTF container. Consider the following dtrace invocation:

# dtrace -l -v -n 'fbt::ip_input:entry'
   ID   PROVIDER            MODULE                          FUNCTION
37546        fbt                ip                          ip_input

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: ISA

        Argument Types
                args[0]: struct ill_s *
                args[1]: ill_rx_ring_t *
                args[2]: mblk_t *
                args[3]: struct mac_header_info_s *

The D compiler used its CTF data for the ip module to determine the arguments and their types. We can then run something like:

# dtrace -qn 'fbt::ip_input:entry{ print(*args[0]); exit(0) }'
struct ill_s {
    pfillinput_t ill_inputfn = ip`ill_input_short_v4
    ill_if_t *ill_ifptr = 0xffffff0148a27ab8
    queue_t *ill_rq = 0xffffff014b65ba60
    queue_t *ill_wq = 0xffffff014b65bb58
    int ill_error = 0
    ipif_t *ill_ipif = 0xffffff014b6fc460
    uint_t ill_ipif_up_count = 0x1
    uint_t ill_max_frag = 0x5dc
    uint_t ill_current_frag = 0x5dc
    uint_t ill_mtu = 0x5dc
    uint_t ill_mc_mtu = 0x5dc
    uint_t ill_metric = 0
    char *ill_name = 0xffffff014bc642c8

DTrace uses the CTF data for the struct ill_s to interpret all of the data and correlate it with the appropriate members.

Bringing it to Userland

While DTrace happily consumes all of the CTF data for the various kernel modules, up to now it simply ignored all of the CTF data in userland applications. With my changes, DTrace will now consume that CTF data for referenced processes. Let’s go back to the example that we opened this blog post with. If we list those probes verbosely we see:

# dtrace -l -v -n
# 'pid$target::stop_instance:entry,pid$target::start_instance:entry' -p
# $(pgrep -o startd)
   ID   PROVIDER            MODULE                          FUNCTION
62420       pid8        svc.startd                     stop_instance

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Types
                args[0]: userland scf_handle_t *
                args[1]: userland restarter_inst_t *
                args[2]: userland stop_cause_t

62419       pid8        svc.startd                    start_instance

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Types
                args[0]: userland scf_handle_t *
                args[1]: userland restarter_inst_t *
                args[2]: userland int32_t

Before this change, the Argument Types section would be empty. Because svc.startd has CTF data, DTrace was able to figure out the types of startd’s functions. Without these changes, you’d have to manually define the types of a scf_handle_t and a restarter_inst_t and cast the raw arguments to the correct type. If you ended up with a structure that has a lot of nested structures, defining all of them in D can quickly become turtles all the way down.

Look Ma, no copyin!

DTrace runs in a special context in the kernel, and often times DTrace requires you to think about what’s going on. Just as the kernel can’t access arbitrary user memory without copying it in, neither can DTrace. Consider the following classic one liner:

dtrace -n 'syscall::open:entry{ trace(copyinstr(arg0)); }'

You’ll note that we have to use copyinstr. That tells DTrace that we need to copy in the string from userland into the kernel in order to do something with it, whether that be an aggregation, printing it out, or saving it for some later action. This copyin isn’t limited to just strings. If you wanted to dereference some member of a structure, you’d have to either copy in the full structure, or manually determine the offset location of the member you care about.

At the previous illumos hackathon, Adam Leventhal had the idea of introducing a keyword into D, the DTrace language, that would tell the D compiler that a type is from userland. The D compiler would then take care of copying in the data automatically. Together we built a working prototype, with the keyword userland.

While certainly useful on its own, it really shines when combined with CTF data, as in the pid provider. The pid provider automatically applies the userland keyword to all of the types that are found in args[]. This allowed us to skip the copyin of intermediate structures and write a simple expression. eg. in our initial example we are able to do something that looks like a normal dereference in C: args[1]->ri_i.i_fmri. Before this change, you would have had to do three copyins: one for args[1], one for ri_i, and a final one for the string i_fmri.

As an example of the kinds of scripts that motivated this, here’s a portion of a D script that I used to help debug part of an issue inside of QEMU and KVM:

$ cat event.d
/* After its mfence */
    self->arg = arg1;
    this->data = (char *)(copyin(self->arg + 0x28, 8));
    self->sign = *(uint16_t *)(this->data+0x2);


/self->trace && self->arg/
    this->data =  (char *)(copyin(self->arg + 0x28, 8));
    printf("%d notify signal index: 0x%04x notify? %d\n", timestamp,
        *(uint16_t *)(this->data + 0x2), arg1);

There are many parts of this script where I’ve had to manually encode structure offsets and structure sizes. In the larger script, I had to play games with looking at registers and the pid provider’s ability to instrument arbitrary assembly instructions. I for one am glad that I’ll have to write a lot less of these.

When you have no CTF

While no binary should be left behind, not everything has CTF data today. But the userland keyword can still be incredibly useful even without it. Whenever you’re making a cast, you can note that the type is a userland type with the userland keyword, and the D compiler will do all the heavy lifting from there.

Here’s an example from a program that has a traditional linked list, but doesn’t have any CTF data:

struct foo;

typedef struct foo {
        struct foo *foo_next;
        char *foo_value;
} foo_t;

        this->p = (userland foo_t *)arg0;

The nice thing with the userland keyword is that you don’t have to do any copyin or worry about figuring out structure sizes. The goal with all of this is to make it simpler and more intuitive to write D scripts.

Referring to types

As a part of all this, you can refer to types in arbitrary processes that are running on the system, as well as the target. The syntax is designed to be flexible enough to allow you to specify not just the pid, but the link map, library, and type name, but you can also just specify the pid and type that you want. While you can’t refer to macros inside these definitions, you can use the shorthand `pid“ to refer to the value of $TARGET.

For example, say you wanted to refer to a glob_t, which on illumos is defined via a typdef struct glob_t { ... } glob_t, there are a lot of different ways that you can do that. The following are all equivalent:

dtrace -n 'BEGIN{ trace((pid`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((pid`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((pid`LM0`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((pid8`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((pid8`libc.so.1`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((pid8`LM0`libc.so.1`glob_t *)0); }'

dtrace -n 'BEGIN{ trace((struct pid`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct pid`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct pid`LM0`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct pid8`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((struct pid8`libc.so.1`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((struct pid8`LM0`libc.so.1`glob_t *)0); }'

All of these would also work with the userland keyword. The userland keyword interacts with structs a bit differently than one might expect, so let’s show all of our above examples with the userland keyword:

dtrace -n 'BEGIN{ trace((userland pid`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((userland pid`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((userland pid`LM0`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((userland pid8`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((userland pid8`libc.so.1`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((userland pid8`LM0`libc.so.1`glob_t *)0); }'

dtrace -n 'BEGIN{ trace((struct userland pid`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct userland pid`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct userland pid`LM0`libc.so.1`glob_t *)0); }' -p 8
dtrace -n 'BEGIN{ trace((struct userland pid8`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((struct userland pid8`libc.so.1`glob_t *)0); }'
dtrace -n 'BEGIN{ trace((struct userland pid8`LM0`libc.so.1`glob_t *)0); }'

What’s next?

From here you can get started with userland ctf and the userland keyword in DTrace in current releases of SmartOS. They’ll be making their way to an illumos distribution near you some time soon.

Now that we have this, we’re starting to make plans for the future. One idea that Adam had is to make it easy to scope a structure definition to a target process’s data model.

Another big task that we have is to make it easy to get CTF into programs and ideally, make it invisible to anyone using any compiler tool chain on illumos!

With that, happy tracing!

Posted on November 14, 2013 at 11:54 am by rm · Permalink · Comments Closed
In: Miscellaneous

Per-thread caching in libumem

libumem was developed in 2001 by Jeff Bonwick and Jonathan Adams. While the Solaris implementation of malloc(3C) and free(3C) performed adequately for single threaded applications, it did not scale. Drawing on the work that was done to extend the original kernel slab allocator, Jeff and Jonathan brought it to userland in the form of libumem. Since then, libumem has even been brought to other operating systems. libumem offers great performance for multi-threaded applications, though there are a few cases where libumem doesn’t quite perform compared to libc and the allocators found in other operating systems like eglibc. The most common case for this is when you have short-lived small allocations, often less than 200 bytes.

What’s happening?

As part of our work with Voxer, they had uncovered some general performance problems that Brendan and I were digging into. We distilled this down to a small Node.js benchmark that was calculating a lot of md5 sums. As part of narrowing down the problem, I eventually broke out one of Brendan’s flame graphs. Since we had a CPU-bound problem, this allowed us to easily visualize and understand what’s going on. When you throw that under DTrace with jstack(), you get something that looks pretty similar to the following flame graph:

libc flamegraph

There are two main pieces to this flame graph. The first is performing the actual md5 operations. The second is creating and destroying the md5 objects and the initialization associated with that. In drilling down, we found that we were spending comparatively more time trying to handle the allocations. If you look at the flamegraph in detail, you’ll see that when calling malloc and free we’re spending a lot of that time in in the lock portions of libc. libc’s malloc has a global mutex. Using a simple DTrace script like dtrace -n 'pid$target::malloc:entry{ @[tid] = count(); }', we can verify that only one thread is calling into malloc, so we’re grabbing an uncontended lock. One’s next guess might be to try and run this with libumem to see if there is any difference. This gives us the following flame graph:

libumem flamegraph

You can easily spot all of the libumem related allocations because they are a bit more like towers that consist of a series of three functions calls. First to malloc(3C), then umem_alloc(3MALLOC), and finally umem_cache_alloc(3MALLOC). On top of that are additional stacks related to grabbing locks. In umem_cache_alloc there is still a lock that a thread has to grab. Unlike libc’s malloc, this lock is not a global lock. Each cache has a lock per-CPU, which, when combined with magazines allows for highly-parallel allocations. However, we’re only doing mallocs from one thread so this is an uncontested mutex lock. The key takeaway from this is that the uncontested mutex lock can still be problematic. This is also much trickier in user-land where there is a lot more to deal with when grabbing a lock. Compare the kernel implementation with the user-land implementation. One conclusion that you reach from this is that we should do something to get rid of the lock.

When getting rid of mutexes, one might first think of using atomics and trying to rewrite this to be lock-free. But, aside from the additional complexity that rewriting portions of this to be lock-free might induce, that doesn’t solve the fundamental problem: we don’t want to be synchronizing at all. Instead this suggests a different idea that other memory allocators have taken: adding a thread-local cache.

A brief aside: libumem and malloc design

As part of libumem’s design it creates a series of fixed size caches which it uses to handle allocations. These caches are sized from 8 bytes to 128KB, with the difference between caches growing larger with the cache size. If a malloc comes in that is within the range of one of these caches then we use the cache. If the allocation is larger than 128KB then libumem uses a different means to allocate that. For the rest of this entry we’ll only talk about the allocations that are handled by one these caches. For the full details of libumem, I strongly suggest you read the two papers on libumem and the original slab allocator.

Keeping track of your allocations

When you call malloc(3C) you pass in a size, but you don’t need to pass that size back into free(3C). You only need to pass in the buffer. malloc ends up doing work to handle this for you. malloc will actually allocate an extra eight byte tag and prepend that to the buffer. So if you request a 36 bytes, malloc will actually allocate 42 bytes from the system and return you a pointer that starts right after that tag. This tag encodes two pieces of information. The first piece is the size and the second piece is a tag that is encoded with the size. It uses the second field to help detect programs that erroneously write to memory. The structure that it prepends looks like:

typedef struct malloc_data {
	uint32_t malloc_size;
	uint32_t malloc_stat;
} malloc_data_t;

When you call free, libumem grabs that structure, reads the buffer size, and validates the tag. If everything checks out, it releases the entire buffer back to the appropriate cache. If it doesn’t check out, libumem aborts the program.

Per-Thread Caching: High-level

To speed up the allocation and free process, we’re going to change how malloc and free work. When a given thread calls free, instead of releasing the buffer directly back to the cache, we will instead store it with the thread. That way if the thread comes around and requests a buffer that would be satisfied by that cache, it can just take the one that it stored. By creating this per-thread cache, we have a lock-free and contention-free means of servicing allocations. We store these allocations as a linked list and use one list per cache. When we want to add a buffer to the list, we make it the new head. When we remove an entry from the list, we remove the head. If the head is set to NULL then we know that the list is empty. When the list is empty, we simply go ahead and use the normal allocation path. When a thread exits, then all of the buffers in that thread are freed back to the underlying caches.

We don’t want the per-thread cache to end up storing an unbounded amount of memory. That would end up appearing no different from a memory leak. Instead, we have two mechanisms in place to control this.

  1. A cap on the amount of memory each thread may cache.
  2. We only enable this for a subset of umem caches.

By default, each thread is capped at 1 MB of cache. This can be configured on a per process basis using the UMEM_OPTIONS environment variable. Simply set perthread_cache=[size]. The size is in bytes and you can use the common K, M, G, and even T, suffixes. We only support doing this for sixteen caches at this time and we opt to make this the first sixteen caches. If you don’t tune the cache sizes, allocations up to 256 bytes for 32-bit applications and up to 448 bytes for 64-bit applications will be cached.

Finally, when a thread exits, all of the memory in that thread’s cache is released back to the underlying general purpose umem caches.

Another aside: Position Independent Code

Modern libraries are commonly built with Position Independent Code (PIC). The goal of building something PIC is that it can be loaded anywhere in the address space and no additional relocations will need to be performed. This means that all the offsets and addresses within a given library that are for the library itself are relative addresses. The means for doing this for amd64 programs is relatively straightforward. amd64 offers an addressing mode known as RIP-relative. RIP-relative addressing is where you specify an offset relative to the current instruction pointer which is stored in the register %rip. 32-bit i386 programs don’t have RIP-relative addressing, so compilers have to use different tricks to for relative addressing. One of the more common techniques is to use a call +0 instruction to establish a known address. Here is the disassembly of a simple function which happens to call a global function pointer in a library.

> testfunc::dis
testfunc:                         movq   +0x1bb39(%rip),%rax      <0x86230>
testfunc+7:                       pushq  %rbp
testfunc+8:                       movq   %rsp,%rbp
testfunc+0xb:                     call   *(%rax)
testfunc+0xd:                     leave
testfunc+0xe:                     ret
> testfunc::dis
testfunc:                         pushl  %ebp
testfunc+1:                       movl   %esp,%ebp
testfunc+3:                       pushl  %ebx
testfunc+4:                       subl   $0x10,%esp
testfunc+7:                       call   +0x0     <testfunc+0xc>
testfunc+0xc:                     popl   %ebx
testfunc+0xd:                     addl   $0x1a990,%ebx
testfunc+0x13:                    pushl  0x8(%ebp)
testfunc+0x16:                    movl   0x128(%ebx),%eax
testfunc+0x1c:                    call   *(%eax)
testfunc+0x1e:                    addl   $0x10,%esp
testfunc+0x21:                    movl   -0x4(%ebp),%ebx
testfunc+0x24:                    leave
testfunc+0x25:                    ret

Position independent code is still really quite useful, one just has to be aware that they do pay a small price for it. In this case, we’re doing several more loads and stores. When working in intensely performance-sensitive code, those can really add up.

Per-Thread Caching: Implementation

The first problem that we needed to solve for per-thread caching was to figure out how we would store the data for the per-thread caches. While we could have gone with some of the functionality provided by the threading libraries (see threads(5)), that would end up sending us through the Procedure Linkage Table (PLT). Because we are cycle-bumming here our goal is to minimize the number of such calls that we have to make. Instead, we’ve added some storage to the ulwp_t. The ulwp_t is libc’s per-thread data structure. It is the userland equivalent of the kthread_t. We extended the ulwp_t of each thread with the following structure:

typedef struct {
	size_t tm_size;
	void *tm_roots[16];
} tmem_t;

Each entry in the tm_roots array is the head of one of the linked lists that we use to store a set of similarly sized allocations. The tm_size field keeps track of how much data has been stored across all of the linked lists for this thread. Because these structures exist per-thread, there is no need for any synchronization.

Why we need to know about the size of the caches

The set of caches that libumem uses for allocations only exists once umem_init() has finished processing all of the UMEM_OPTIONS environment variables. One of the options can add additional caches and another one can remove caches. It is impossible to know what these caches are at compile time. Given that this is the case, you might ask why do we want to know the size of the caches that libumem creates? Why not just create our own set of sizes that we’re going to use for honoring other allocations?

Failing to use the size of the umem caches would not only cause us to use extra space, but it would also cause us to incur additional fragmentation. Our goal is to be able to reuse these allocations. We can’t have a bucket for every possible allocation size, that would grow quite unwieldy. Let’s say that we used the same bucketing scheme that libumem uses by default. Because we have no way of knowing what cache libumem honored something from, we instead have to assume that the returned allocation is the largest possible size we can use the buffer for. If we make a 65-byte allocation that actually comes from the 80 byte cache, we would instead bucket it in the thread’s 64-byte cache. Effectively, we have to always round an allocation down to the next cache. Items that could satisfy a 64-byte allocation would end up being items that were originally 64-79 bytes large. This is clearly wasteful of our memory.

If you look at the signature of umem_free(3MALLOC) you’ll see that it takes a size argument. This means that it is our responsibility to keep track of what the original size of the allocation was. Normally malloc and free wrap this up in the malloc tags, but since we are reusing buffers, we’ll want to keep track of both the original size and the currently requested size when we reuse it. To do this, we would have to extend the malloc tag structure that we talked about above. While there are some allocations that have extra space for something like this for 64-bit programs, that is not the case for 32-bit programs. To implement it this way, that would require at another eight-byte tag be prepended to every 32-bit malloc and some 64-bit allocations as well.

Obviously this is not an ideal way to go approach the problem. Instead, if we use the cache sizes we don’t have to suffer from either of the above problems. We know that when a umem_alloc comes in, it rounds the allocation size up to the nearest cache to satisfy the request. We leverage that same idea so that when a buffer is freed we put it into the per-thread list that corresponds to the cache that it originally came from. When new requests come in we can use that buffer to satisfy anything that the underlying cache would be able to. Because of this strategy we don’t have to have a second round of fragmentation for our buffers. Similarly, because we know what the cache size is, we don’t have to keep track of the original request size. We know exactly what cache the buffer came from initially.

Following in the footsteps of trapstat and fasttrap

Now that we have opted to know about the cache sizes at run time this means that we have a few approaches we can take to generate the function that handles this per-thread layer. Remember, that we’re here because of performance. We need to cycle-bum and minimize our use of cycles. Loads and stores to and from main memory are much more expensive than simple register arithmetic, comparisons, and small branches. While there is an array of allocations sizes, we don’t want to have to always load each entry from that array. We also have another challenge. We need to avoid doing anything that causes us to use the PLT. We don’t want to end up having a call +0 instruction so we can access 32-bit PIC code. There is one fortunate aspect of the umem caches. Once they are determined at runtime, they never change.

Armed with this information we end up down a different path: we are going to dynamically generate the assembly for our functions. This isn’t the first time this has been done in illumos. For an idea of what this looks like, see trapstat. Our code functions in a very similar way. We have blocks of assembly with holes for addresses and constants. These blocks get copied into executable memory and the appropriate constants get inserted in the correct place. One of the important pieces of this is that we don’t end up calling any other functions. If we detect an error or we can’t satisfy the allocation in the function itself, we end up jumping out to the original malloc and free reusing the same stack frame.

Reaching the assembly generated functions

Once we have generated the machine code, we have the challenge of making it be what applications reach when they call malloc(). This is complicated by the fact that calls to malloc can come in before we create and initialize the new functions as part of the libumem initialization. The standard way you might solve this is with a function pointer that you swap out at some point. However, having that global function pointer causes us to need to address that in a position independent way and adds noticeable overhead. Instead we utilized a small feature of the illumos linker, local auditing, to create a trampoline. Before we get into the details of the auditing library, here’s the data we used to support the decision. We made a small and simple benchmark that just does a fixed number of small mallocs and frees in a loop and compared the differences.

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>

#define MAX     1024 * 1024 * 512

main(int argc, char *argv[])
        int ii;
        void *f;
        int size = 4;
        hrtime_t start;

        if (argc == 2)
                size = atoi(argv[1]);

        start = gethrtime();
        for (ii = 0; ii < MAX; ii++) {
                f = malloc(4);
        printf("%lld\n", (hrtime_t)(gethrtime() - start));
        return (0);

arch libc (ns) libumem (ns) indirect call (ns) audit library (ns)
i386 39833278751 57784034737 14522070966 9215801785
amd64 32470572792 47828105321 9654626131 8980269581

From this data we can see that the audit library technique ended up being a small win on amd64, but for i386, it was a much more substantial win. This all comes down to how the compiler generated PIC code.

Audit libraries are a feature of the illumos linker that allow you to interpose on all the relocations that are being made to and from a given library. For the full details on how audit libraries work consult the Linkers and Libraries guide (one of the best pieces of Sun Documentation) or the linker source itself. We created a local auditing library that allows us to only audit libumem. As part of auditing the relocation for libumem's malloc and free symbols the audit library gives us an opportunity to replace the symbol with one of our own choice. The audit library instead returns the address of a local buffer which contains a jump instruction to the either the actual malloc or free. This installs our simple trampoline.

Later, when umem_init() runs we end up generating the assembly versions of our functions. libumem uses symbols which the auditing library interposes upon to be told where the buffers it should put the generated function are. After both the malloc and free implementations have been successfully generated, it removes the initial jump instruction and atomically replaces it with a five byte nop instruction. We looked at using both the multi-byte nop, five single byte nops, and just zeroing out the jump offset so it would become a jmp +0. Using the same microbenchmark we used earlier, we saw that the multi-byte nop made a noticeable difference.

arch jmp +0 (ns) single-byte nop (ns) multi-byte nop (ns)
i386 9215801785 9434344776 9091563309
amd64 8980269581 8989382774 8562676893

For more details on how this all works and fits together, take a look at the updated libumem big theory statement and pay particular attention to section 8. You may also want to look at the i386 and amd64 specific files which generate the assembly.

Needed linker fixes

There are two changes that are necessary for local auditing to work correctly. Thanks to Bryan who went and made those changes and figured out the way to create this trampoline with the local auditing library.

Understanding per-thread cache utilization

Bryan worked to supply not only the necessary enhancements to the linker, but also supply enhancements to various mdb dcmds to better understand the behavior of the per-thread caching in libumem. ::umastat was enhanced to show both the amount that each thread has used and to show how many allocations each cache has in the per-thread cache (ptc).

> ::umastat
     memory   %   %   %   %   %   %   %   %   %   %   %   %   %   %   %   %   %
tid  cached cap   8  16  32  48  64  80  96 112 128 160 192 224 256 320 384 448
--- ------- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
  1    174K  16   0   6   6   1   4   0   0  18   2  50   0   4   0   1   0   2
  2       0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  3    201K  19   0   6   6   2   4   8   1  16   2  43   0   3   0   1   0   1
  4       0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  5   62.1K   6   0   8   7   3   9   0   0  13   5  38   0   6   0   2   1   0
  6       0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 

cache                        buf     buf     buf     buf  memory     alloc alloc
name                        size  in use  in ptc   total  in use   succeed  fail
------------------------- ------ ------- ------- ------- ------- --------- -----
umem_magazine_1               16      33       -     252      4K        35     0
umem_magazine_3               32      36       -     126      4K        39     0
umem_magazine_7               64       0       -       0       0         0     0
umem_magazine_15             128       4       -      31      4K         4     0
umem_magazine_31             256       0       -       0       0         0     0
umem_magazine_47             384       0       -       0       0         0     0
umem_magazine_63             512       0       -       0       0         0     0
umem_magazine_95             768       0       -       0       0         0     0
umem_magazine_143           1152       0       -       0       0         0     0
umem_slab_cache               56      46       -      63      4K        48     0
umem_bufctl_cache             24     153       -     252      8K       155     0
umem_bufctl_audit_cache      192       0       -       0       0         0     0
umem_alloc_8                   8       0       0       0       0         0     0
umem_alloc_16                 16    2192    1827    2268     36K      2192     0
umem_alloc_32                 32    1082     921    1134     36K      1082     0
umem_alloc_48                 48     275     202     336     16K       275     0
umem_alloc_64                 64     487     359     504     32K       487     0
umem_alloc_80                 80     246     234     250     20K       246     0
umem_alloc_96                 96      42      41      42      4K        42     0
umem_alloc_112               112     741     676     756     20K       741     0
umem_alloc_128               128     133     109     155     20K       133     0
umem_alloc_160               160    1425    1274    1425     36K      1425     0
umem_alloc_192               192      11       9      21      4K        11     0
umem_alloc_224               224      83      82      90     20K        83     0
umem_alloc_256               256       8       8      15      4K         8     0
umem_alloc_320               320      24      22      24      8K        24     0
umem_alloc_384               384       7       6      10      4K         7     0
umem_alloc_448               448      20      19      27     12K        20     0
umem_alloc_512               512       1       -      16      8K       138     0
umem_alloc_640               640       0       -      18     12K       130     0
umem_alloc_768               768       0       -      10      8K        87     0
umem_alloc_896               896       0       -      18     16K       114     0

In addition, this work inspired Bryan to go and add %H to mdb_printf for human readable sizes. As a part of the support for the enhanced ::umastat, there are also new walkers for the various ptc caches.

Performance of the Per-Thread Caching

The ultimate goal of this was to improve our performance. As part of that we did different testing to make sure that we understood what the impact and effects of this would be. We primarily compared ourselves to the libc malloc implementation and a libumem without our new bits. We also did some comparison to other allocators, notably eglibc. We chose eglibc because that is what the majority of customers coming to us from other systems are using and because it is a good allocator, particularly for small objects.

Tight 4 byte allocation loop

One of the basic things that we wanted to test, inspired in part by some of the behavior we had seen in applications, was to measure what a tight malloc and free loop of a small allocation looked like where we varied the number of threads. Below we included a test where we did this one thread and one where we did it with sixteen threads. The main take away we got from this is that libumem has historically been slow at these compared to a single threaded libc program. The sixteen thread graph shows why we really want to use libumem compared to libc. The graph shows the time per thread. As we can see, libc's single mutex for malloc is rather painful.

4 byte allocations with one thread
4 byte allocations with sixteen threads

Time for all cached allocations

Another thing that we wanted to measure was how our allocation time scaled with the size of the cache. While our assembly is straightforward, it could probably be further optimized. We ran the test with both 32-bit and 64-bit code and the results are below. From the graphs you can see that scale fairly linearly across the caches.

32-bit small allocations
64-bit small allocations

The effects of the per-thread cache on uncacheable allocations

One of the things that we wanted to verify was that the presence of the per-thread caching did not unreasonably degrade the performance of other allocations. To look at this we compared what happened if you used libumem and what happened if you did not. We used pbind to lock the program to a single CPU, measured the time it took to do 1KB sized allocations, and compared the differences. We took that value and divided by the total number of allocations we had performed, 512 M in this case. The end result was that for a given loop of malloc and free, the overhead was 8-10ns. That was within reason for our acceptable overhead.

umem_init time

Another one of the areas we wanted to make sure that we didn't seriously regress was the time it takes umem_init. I've included a coarse graph that was created using DTrace. I simply rigged up something that traced the amount of wall and cpu time umem_init took. We repeated that 100 times and graphed the results. The graph below shows a roughly 50 microsecond increase in the wall and cpu time. In this case, a reasonable increase.

umem_init time

Our Original Flame Graph

The last thing that I want to look at is what the original flame graph now looks like using per-thread-caching. We increased the per-thread cache to 64MB because that allows us to cache the majority of the malloc activity which primarily comes from only one thread. The new flame graph is different from the previous two. The amount of time that we've spent in malloc and free has been minimized and compared to libumem previously, we are no longer three layers deep. In terms of raw improvement, while this normally took around 110 seconds with libc, with per-thread-caching we're down to around 78 seconds. Remember, this is a pure node.js benchmark. To have an improvement to malloc() result in a ~30% win was pretty surprising. Even in dynamic garbage collected languages, the performance of malloc is still important.

ptcumem flamegraph

Wrapping Up

In this entry I've described the high-level design, implementation, and some of the results we've seen from our work on per-thread caching libumem. For some workloads there can be a substantial performance improvement by using per-thread caching. To try it out, grab the upcoming release of SmartOS and either add -lumem to your Makefile or simply try it out by running your application with LD_PRELOAD=libumem.so.

When you link with libumem, per-thread caching is enabled by default with a 1 MB per-thread cache. This value can be tuned via the UMEM_OPTIONS environment variable via UMEM_OPTION=perthread_cache=[size]. For example, to set it to 64 MB, you would do something like: UMEM_OPTIONS=perthread_cache=64M. If you enable any kind of the umem_debug(3MALLOC) facilities then this will be disabled. Similarly if you request nomagazines, this will be disabled.

If you have questions, feel free to ask here or join us on IRC.

Posted on July 16, 2012 at 10:50 am by rm · Permalink · Comments Closed
In: SunOS · Tagged with: , , , ,