KVM – Tales from a Core File

Project Bardiche: Introduction

March 20, 2014

I just recently landed Project Bardiche into SmartOS. The goal of Bardiche has been to create a more streamlined data path for layer two networking in illumos. While the primary motivator for this was for KVM guests, it’s opened up a lot of room for more than just virtual machines. This bulk of this project is comprised of changes to illumos-joyent; however, there were some minor changes made to smartos-live, illumos-kvm, and illumos-kvm-cmd.

Project Highlights

Before we delve into a lot more of the specifics in the implementation, let’s take a high level view of what this project has brought to the system. Several of these topics will have their own follow up blog entries.

The global zone can now see data links for every zone in /dev/net/zone/%zonename/%datalink.
libdlpi(3LIB) was extended to be able to access those nics.
snoop(1M) now has a -z option to capture packets on a data link that belongs to a zone.
A new DLPI(7P) primitive DL_EXCLUSIVE_REQ was added.
A new DLPI(7P) promiscuous mode was added: DL_PROMISC_RX_ONLY.
ipfilter can now filter packets for KVM guests.
The IP squeue interface was generalized to allow for multiple consumers in a new module called gsqueue.
The sdev file system was enhanced with a new plugin interface.
A new driver vnd was added, an associated library libvnd, and
commands vndadm and vndstat. This driver provides the new layer two data path.
A new abstraction for sending and receiving framed data called framed I/O.

There’s quite a bit there. The rest of this entry will go into detail on the motivation for this work and a bit more on the new /dev/net/zone, libdlpi, and snoop features. Subsequent entries will cover the the new vnd architecture and the new DLPI primitives, the new gsqueue interface, shine a bit more light on what gets referred to as the fastpath, and cover the new sdev plugin architecture.

Motivation

Project bardiche started from someone asking what would it take to allow a hypervisor-based firewall to be able to filter packets that were being sent from a KVM guest. We wanted to focus on allowing the hypervisor to provide the firewall because of the following challenges associated with managing a firewall running in the guest.

While it’s true that practically all the guests that you would run under hardware virtualization have their own firewall software, they’re rarely the same. If we wanted to leverage the firewall built into the guest, we’d need to build an agent that lived in each guest. Not only does that mean that we’d have to write one of these for every type of guest we wanted to manage, given that customers are generally the super-user in their virtual machine (VM), they’d be able to simply kill the agent or change these rules, defeating the API.

While dealing with this, there were several other deficiencies in how networking worked for KVM guests today based on how QEMU, the program that actually runs the VM, interacted with the host networking. For each Ethernet device that was in the guest, there was a corresponding virtual NIC in the host. The two were joined with the vnic back end in QEMU which originally used libdlpi to bind them. While this worked, there were some problems with it.

Because we had to put the device in promiscuous mode, there was no way to tell it not to send back traffic that came from ourselves. In addition to just being a waste of cycles, this causes duplicate address detection, often performed with IPv6, to fail for many systems.

In addition, the dlpi interfaces had no means of reading or writing multiple packets at a time. A lot of these issues extend from the history of the illumos networking stack. When it was first implemented, it was done using STREAMS. Over time, that has changed. In Solaris 10 the entire networking stack was revamped with a project called Fire Engine. That project, among many others, transitioned the stack from a message passing interface to one that used a series of direct calls and serialization queues (squeues). Unfortunately, the means by which we were using libdlpi, left us still using STREAMS.

While exploring the different options and means to interface with the existing firewall, we eventually reached the point where we realized that we needed to go out and create a new interface that solved this, and the related problems that we had, as well as, lay the foundation for a lot of work that we’d still like to do.

First Stop: Observability Improvements

When I first started this project, I knew that I was going to have to spend a lot of time debugging. As such, I knew that I need to solve one of the more frustrating aspects of working with KVM networking: the ability to snoop and capture traffic. At Joyent, we always run a KVM instance inside of zone. This gives us all the benefits of zones: the associated security and resource controls.

However, before this project, data links that belonged to zones were not accessible from the global zone. Because of the design of the KVM branded zone, the only process running is QEMU and you cannot log in, which makes it very hard to pass the data link to snoop or tcpdump. This set up does not make it impossible to debug. One can use DTrace or use snoop on a device in the global zone; however, both of those end up requiring a bit more work or filtering.

The solution to this is to allow the global zone to see the data links for all devices across all zones under /dev/net and then enhance the associated libraries and commands to support accessing the new devices. If you’re in the global zone, there is now a new directory called /dev/net/zone. Don’t worry, this new directory can’t break you, as all data links in the system need to end with a number. On my development virtual machine which has a single zone with a single vnic named net0, you’d see the following:

[root@00-0c-29-37-80-28 ~]# find /dev/net | sort
/dev/net
/dev/net/e1000g0
/dev/net/vmwarebr0
/dev/net/zone
/dev/net/zone/79809c3b-6c21-4eee-ba85-b524bcecfdb8
/dev/net/zone/79809c3b-6c21-4eee-ba85-b524bcecfdb8/net0
/dev/net/zone/global
/dev/net/zone/global/e1000g0
/dev/net/zone/global/vmwarebr0

Just as you always have in SmartOS, you’ll still see the data links for your zone at the top level in /dev/net, eg. /dev/net/e1000g0 and /dev/net/vmwarebr0. Next, each of the zones on the system, in this case the global zone, and the non-global zone named 79809c3b-6c21-4eee-ba85-b524bcecfdb8, show up in /dev/net/zone. Inside of each of those directories are the data links that live in that zone.

The next part of this was exposing this functionality in libdlpi and then using that in snoop. For the moment, I added a private interface called dlpi_open_zone. It’s similar to dlpi_open except that it takes an extra argument for the zone name. Once this change gets up to illumos it’ll then become a public interface that you can and should use. You can view the manual page online here or if you’re on a newer SmartOS box you can run man dlpi_open_zone to see the documentation.

The use of this made its way into snoop in the form of a new option: -z zonename. Specifying a zone with -z will cause snoop to use dlpi_open_zone which will try to open the data link specified via its -d option from the zone. So if we wanted to watch all the icmp traffic over the data link that the KVM guest used we could run:

# snoop -z 79809c3b-6c21-4eee-ba85-b524bcecfdb8 -d net0 icmp

With this, it should now be easier, as an administrator of multiple zones, to observe what’s going on across multiple zones without having to log into them.

Thanks

There are numerous people whom helped this project along the way. The entire Joyent engineering team helped from the early days of bouncing ideas about APIs and interfaces all the way through to the final pieces of review. Dan McDonald, Sebastien Roy, and Richard Lowe, all helped review the various write ups and helped deal with various bits of arcana in the networking stack. Finally, the broader SmartOS community helped go through and provide additional alpha and beta testing.

What’s Next

In the following entries, we’ll take a tour of the new sub-systems ranging from the vnd architecture and framed I/O abstraction through the sdev interfaces.

Visualizing KVM

August 16, 2011

Last March, Bryan Cantrill and I joined Max Bruning on working towards bringing KVM to illumos. Six months ago we found ourselves looping in x86 real mode and today we’re booting everything from Linux to Plan 9 to Weenix! For a bit more background on how we got there take a gander at Bryan’s entry on KVM on illumos.

For the rest of this entry I’m going to talk about the exciting new analytics we get by integrating DTrace and kstats into KVM. We’ve only scratched the surface of what we can see, but already we’ve integrated several metrics into Cloud Analytics and have gained insight into different areas of guest behavior that the guests themselves haven’t really seen before. While we can never gain the same amount of insight into Virtual Machines (VMs) that we can with a zone, we easily have insight into three main resources of a VM: CPU, disk, and network. Cloud Operators can use these metrics to determine if there is a problem with a VM, determine which VMs are having issues, and what areas of the system are suffering. In addition, we’ve pushed the boundaries of observability by taking advantage of the fact that several components of the hardware stack are virtualized. All in all, we’ve added metrics in the following areas:

Virtual NIC Throughput
Virtual Disk IOps
Virutal Disk Throughput
Hardware Interrupts
Virtual Machine Exits
vCPU Samples

NICs and Disks

One of the things that we had to determine early on was how the guests virtual devices would interface with the host. For NICs, this was simple: rather than trying to map a guest’s NIC to a host’s TUN or TAP device; we just used a VNIC, which was introduced into the OS by the Crossbow project. Each guest NIC corresponds directly to a Crossbow VNIC. This allows us to leverage all of the benefits of using a VNIC including anti-spoof and the analytics that already exist. This lets us see the throughput in terms of either bytes or packets that the guest is sending and receiving on a per guest NIC basis.

The story with disks is quite similar. In this case each disk that the guest sees is backed by a zvol from ZFS. This means that guests unknowingly get the benefits of ZFS: data checksums, snapshots and clones, the ease of transfer via zfs send and zfs receive, redundant pooled storage, and a proven reliability. What is more powerful here is the insight that we can provide into the disk operations. We provide two different views of disk activity. The first is based on throughput and the second is based on I/O operations.

The throughput based analytics are a great way to understand the actual disk activity that the VM is doing. Yet the operations view gives us several interesting avenues to drill down into VM activity. The main decompositions are operation type, latency, offset, and size. This gives us insight into how guest filesystems are actually sending activity to disk. As an example, we generated the following screenshot from a guest running Ubuntu on an ext3 filesystem. The guest would loop creating a 3gb file, sleeping for a seconds, reading the file, and deleting the file before beginning again. In the image below we see operations decomposed by operation type and offset. This allows us to see where on disk ext3 is choosing to lay out blocks on the filesystem. The x-axis represents time; each unit is one second. The y-axis shows the virtual disk block number.

Hardware Interrupts

Brendan Gregg has been helping us out by diligently measuring our performance, comparing us to both a bare metal Linux system and KVM under Linux. While trying to understand the performance characteristics and ensuring that KVM on illumos didn’t have too many performance pathologies he stumbled across an interesting function in the kvm source code: vmx_inject_irq. This function is called any time a hardware interrupt is injected into the guest. We combined this information with an incredibly valuable idea for heatmaps that Brendan thought up. A heatmap based upon subsecond offset allows us to see when across a given second some action occurred. The x-axis is the same as the previous graph, one second. The y-axis though represents when in the second this item occurred, i.e. where in the 1,000,000 microseconds did this action occur. Take a look at the following image:

Here we are visualizing which interrupts occurred in a given second and looking at it based upon when in the second they occur. Each interrupt vector is colored differently. The red represents interrupts caused by the disk controller and yellow by the network controller. The blue is the most interesting: these are timer based interrupts generated by the APIC. Lines that are constant across the horizontal means that these are events that are happening at the same time every second. These represent actions caused by an interval timer, something that
always fires every second. However there are lines that look like a miniature stair; ones that go up at an angle. These represent an application that does work, calls sleep(3C) for an interval, does a little bit of work and sleeps again.

VM Exits

A VM exits when the processor ceases running the guest code and returns to the host kernel to handle something which must be emulated such as memory mapped I/O or accessing one of the emulated devices. One of the ways to increase VM performance is to minimize the number of exits. Early on during our work porting KVM we saw that there were various statistics that were being gathered and exported via debugfs in the Linux KVM code. Here we leveraged kstats and Bryan quickly wrote up kvmstat. kvmstat quickly became an incredibly useful tool for us to easily understand VM behavior. What we’ve done is leverage the kstats which allow us to know which VM, which vCPU, and which of a multitude of reasons the guest exited and add that insight into Cloud Analytics.

vCPU Samples

While working on KVM and reading through the Intel Architecture Manuals I reminded myself of a portion of x86 architecture that is quite important, mainly that the root of the page table is always going to be stored in cr3. Unique values in cr3 represent unique Memory Management Unit (MMU) contexts. Most modern operating systems that run on x86 use a different a different MMU context for each process running on the system and the kernel. Thus if we look at the value of cr3 we get an opaque token that represents something running in the guest.

Brendan had recently been adding metrics to Cloud Analytics based upon using DTrace’s profile provider and combining the gathered data with the subsecond offset heatmaps that we previously discussed. Bryan had added a new variable to D that allowed us to look at the register state of a given running vCPU. To get the value of cr3 that we wanted we could use something along the lines of vmregs[VMX_GUEST_CR3]. When you combine these two we get a heatmap that shows us what is running in the guest. Check out the image below:

Here, we’ve sampled at a frequency of 99 Hz. We avoid sampling at 100 Hz because that would catch a lot of periodic system activity. We’re looking at a one CPU guest running Ubuntu. The guest is initially idle. Next we start two CPU bound activities, highlighted in blue and yellow. What we can visualize are the scheduling decisions being made by the Linux kernel. To further see what happened, we used renice on one of the processes setting it to a value of 19. You can see the effect in the first image below as the blue process rarely runs in comparison to the yellow. In the second image we experimented with the effects of setting different nice values.

These visualizations are quite useful. They let us give someone an idea of what is running in their VM. While it can’t pinpoint it to the exact process, it does let the user understand what the characteristics of their workload are and whether it is a few long lived processes fighting for the CPU, lots of short lived processes coming and going, or something in between. Like the rest of these metrics this lets you understand where in your fleet of VMs the problem may be occurring and help narrow things down to which few VMs should be looked at with native tools.

Conclusion

We’ve only begun to scratch the surface of what we can understand about a virtual machine running under KVM on illumos. Needless to say, this wouldn’t be possible without DTrace and its guarantees of safety for use on production systems and only the overhead of a few NOPs when not in use. As time goes on we’ll be experimenting on what information can help us, operators, and end users better understand their VM’s performance and adding those abilities to Cloud Analytics.

Tales from a Core File

Tag: KVM

Project Bardiche: Introduction

Project Highlights

Motivation

First Stop: Observability Improvements

Thanks

What’s Next

Recent Posts

USB Topology

Transceivers: The Device Between the NIC and the Network

A Tale of Two LEDs

CPU and PCH Temperature Sensors in illumos

Turtles on the Wire: Understanding how the OS uses the Modern NIC

illumos day 2014

Archives

Archives