Event ports and performance

So lots of people have been talking about event ports. They were designed to solve the problem with poll(2) and lots of file descriptors. The goal is to scale with the number of actual events of interest rather than the number of file descriptors one is listening on, since the former is often much less than the latter.

To make this a little more concrete, consider your giant new server which maintains a thousand connections, only about a hundred of which are active at any given time. It has its array with a thousand socket file descriptors, and sits there in a loop calling poll to figure out which of them can be read without blocking. poll waits until at least one of them is ready, returns, and then the program iterates over the whole array to see which descriptor(s) poll said was ready. Finally, it processes them one by one. You waste all this time iterating over the array, not to mention the time copying it between the kernel and userland every time you call poll.

It’s true that you could multi-thread this and save some time, but the overhead of thousands of threads (one for each fd) adds up as well.

This is the problem event ports are designed to solve. Interestingly enough, this was not the first attempt to speed up this scenario. /dev/poll was introduced in Solaris 7 for the same purpose, and works great – except that multi-threaded apps can’t use it without [lots of slow] synchronization. Linux has epoll and BSD has kqueue, about neither of which do I pretend to know much.

Event ports, as applied to file descriptors, are basically a way of telling the OS that you’re interested in certain events on certain descriptors, and to deliver (synchronous) notification of said events on said descriptors to a particular object (…which also happens to be a file descriptor). Then you can have one or more worker threads sleeping, waiting for the OS to wake them up and hand them an event, saying "here you go, this socket’s got something to read." You can actually have lots of worker threads hanging around the same event port and let the OS do your "synchronization" – they probably don’t need to communicate or share any common state (depending on your app, of course).

No expensive copies. No traversing huge arrays finding needles in haystacks.

At least, in theory. Until now, nobody’s really measured their performance with any microbenchmarks. My recent addition of event ports of libevent gave me an excuse to run libevent’s own bunchmarks and see how it performed. The benchmark basically just schedules lots of read/write events, and has a few parameters: the number of file descriptors, the number of writes, and the number of active connections (the higher this number, the more the benchmark spreads its writes over the many file descriptors).

Results

All benchmarks below were run on a uniprocessor x86 machine with 1G memory. Note the vertical axes – some of the graphs use logarithmic scales. And remember, these are the results of the libevent benchmark, so differences in performance can be a result of the underlying polling mechanism as well as the libevent implementation (though since the implementations don’t matter, it’s assumed that most affects are attributable to the underlying polling mechanism)

Let’s start with the simplest example. Consider small numbers of file descriptors, and various methods of polling them:

On this machine, event ports are fast, even for small numbers of file descriptors. In some circumstances, event ports are slightly more expensive for small numbers of file descriptors, particularly when many of them are active. This results because (a) when most of your descriptors are active, then the work done by poll is not such a waste (because you have to do it anyway), and (b) event ports require two system calls per descriptor event rather than one.

But let’s look at much larger numbers of file descriptors, where event ports really show their stuff. Note that /dev/poll has a maximum of about 250 fd’s (Update 7/12/06: Alan Bateman points out that you can use setrlimit to update RLIMIT_NOFILE to increase this to much higher values), and select(3c) can only handle up to about 500, so they’re omitted from the rest of these graphs. This graph has a logarithmic scale.

Wow. That’s about a 10x difference for 7,500 fd’s. ‘Nuff said. Finally, let’s examine what happens when the connections become more active (more of the descriptors are written to):

Event ports still win, but not by nearly as much (about 2x, for up to 15,000 fd’s). Less of poll’s array traversing is unnecessary, so the difference is smaller. But in the long run, event ports scale pretty much at a constant rate, while poll scales linearly. Select is even worse, since it’s implemented on top of poll (on Solaris). /dev/poll is good, but has the problem mentioned earlier of being difficult to use in a multi-threaded application.

For large numbers of descriptors, event ports are the clear winner.

Benchmarking can be difficult, and it’s easy to miss subtle effects. I invite your comments, especially if you think there’s something I’ve overlooked.


libevent and Solaris event ports

For those who dwell in subterranean shelters, event ports (or the “event completion framework,” if you want) are the slick new way to deal with events from various sources, including file descriptors, asynchronous i/o, other user processes, etc. Adam Leventhal even thinks they’re the 20th best thing about Solaris 10. Okay, that’s not huge, but competing with zfs, dtrace, and openness, that ain’t half bad. Though the interface is, of course, well-defined, the framework is still evolving. People are talking about adding signals and filesystem events to the event completion framework, though there’s definitely some debate about how to go about doing that. Now I hear you asking me: “Okay, that sounds pretty cool, but why would I port my sw33t @pp to Solaris just for event ports? Or port my existing poll(2) implementation on Solaris to use these newfangled dohickeys?” Well if amazing scalability isn’t an important enough reason, then I don’t know what is. (One of the great properties of event ports is they scale with the number of actual events, not the number of descriptors you’re listening to.) But reworking your code could come with added benefits. Enter libevent. Libevent is a portable library for dealing with file descriptor events. “But wait,” you ask, “I thought that’s what event ports are for!” Well, libevent provides a portable API for doing this. But the implementation varies from system to system. On BSD, it might use the kqueue facility, while on GNU/Linux it might choose epoll. This way, you can write your program on one platform, move to another, recompile, and all your file descriptor polling code works fine (albeit probably with a different implementation, which may affect performance substantially (possibly for the better, since the library chooses the best one for your platform)). Now libevent could use event ports as a backend on Solaris 10. It’s a project I’ve been working on for the last week or so. With that, you get the portability of libevent plus (some of[1]) the scalability of event ports. Sweet. Oh, and if you’re wondering more about event ports, check out the engine room, which includes an example and more technical information. [1]: the event port libevent backend isn’t quite as scalable as event ports alone, because a bit of per-descriptor bookkeeping has to be maintained. D’oh.

Dazed and confused

Today I started my summer internship on the Solaris Kernel group at Sun Microsystems.
I’ll figure out something interesting to put here.