Another detour: short-circuiting cat(1)

What do you think happens when you do this:

        # cat vmcore.4 > /dev/null

If you’ve used Unix systems before, you might expect this to read vmcore.4 into memory and do nothing with it, since cat(1) reads a file, and "> /dev/null" sends it to the null driver, which accepts data and does nothing. This appears pointless, but can actually be useful to bring a file into memory, for example, or to evict other files from memory (if this file is larger than total cache size).

But here’s a result I found surprising:

        # ls -l vmcore.1
-rw-r--r--   1 root     root     5083361280 Oct 30  2009 vmcore.1
# time cat vmcore.1 > /dev/null
real    0m0.007s
user    0m0.001s
sys     0m0.007s

That works out to 726GB/s. That’s way too fast, even reading from main memory. The obvious question is how does cat(1) know that I’m sending to /dev/null and not bother to read the file at all?

Of course, you can answer this by examining the cat source in the ON gate. There’s no special case for /dev/null (though that does exist elsewhere), but rather this behavior is a consequence of an optimization in which cat(1) maps the input file and writes the mapped buffer instead of using read(2) to fill a buffer and write that. With truss(1) it’s clear exactly what’s going on:

        # truss cat vmcore.1 > /dev/null
execve("/usr/bin/cat", 0x08046DC4, 0x08046DD0)  argc = 2
[ ... ]
write(1, ..., 8388608) = 8388608
mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 8388608) = 0xFE600000
write(1, ..., 8388608) = 8388608
mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x01000000) = 0xFE600000
[ ... ]
mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x000000012E000000) = 0xFE600000
write(1, ..., 8388608) = 8388608
mmap64(0xFE600000, 8253440, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x000000012E800000) = 0xFE600000
write(1, ..., 8253440) = 8253440
llseek(3, 0x000000012EFDF000, SEEK_SET)         = 0x12EFDF000
munmap(0xFE600000, 8388608)                     = 0
llseek(3, 0, SEEK_CUR)                          = 0x12EFDF000
close(3)                                        = 0
close(1)                                        = 0

cat(1) really is issuing tons of writes from the mapped file, but the /dev/null device just returns immediately without doing anything. The file mapping is never even read. If you actually wanted to read the file (for the side effects mentioned above, for example), you can defeat this with an extra pipe:

        # time cat vmcore.1 | cat > /dev/null
real    0m32.661s
user    0m0.865s
sys     0m32.127s

That’s more like it: about 155MB/s streaming from a single disk. In this case the second cat invocation can’t use this optimization since stdin is actually a pipe, not the input file.

There’s another surprising result of the initial example: the file’s access time actually gets updated even though it was never read:

        # ls -lu vmcore.2
-rw-r--r--   1 root     root     6338052096 Nov  3  2009 vmcore.2
# time cat vmcore.2 > /dev/null
real    0m0.040s
user    0m0.001s
sys     0m0.008s
# ls -lu vmcore.2
-rw-r--r--   1 root     root     6338052096 Aug  6 15:55 vmcore.2

This wasn’t always the case, but it was fixed back in 1995 under bug 1193522, which is where this comment and code probably came from:

    363         /*
364          * NFS V2 will let root open a file it does not have permission
365          * to read. This read() is here to make sure that the access
366          * time on the input file will be updated. The VSC tests for
367          * cat do this:
368          *      cat file > /dev/null
369          * In this case the write()/mmap() pair will not read the file
370          * and the access time will not be updated.
371          */
373         if (read(fi_desc, &x, 1) == -1)
374                 read_error = 1;

I found this all rather surprising because I think of cat(1) as one of the basic primitives that’s dead-simple by design.  If you really want something simple to read a file into memory, you might be better off with dd(1M).

Don’t forget about /dev/poll

When I posted my first benchmark results, I did not realize that the limit on /dev/poll’s file descriptors is a soft one. Using setrlimit(2), you can change the limit (in my case from about 250 to over 64,000). With that, I’ve run some new benchmarks. These are the same as before, but now I’m running them with /dev/poll as well.

The bottom line is that, as expected, it scales very well (similarly to event ports). Its requirement for the use of synchronization between multiple threads is a remaining problem.

Without further adieu, here are the updated results:

Thanks to Alan Bateman for pointing out that /dev/poll can, in fact, handle a larger number of file descriptors.

Event ports and performance

So lots of people have been talking about event ports. They were designed to solve the problem with poll(2) and lots of file descriptors. The goal is to scale with the number of actual events of interest rather than the number of file descriptors one is listening on, since the former is often much less than the latter.

To make this a little more concrete, consider your giant new server which maintains a thousand connections, only about a hundred of which are active at any given time. It has its array with a thousand socket file descriptors, and sits there in a loop calling poll to figure out which of them can be read without blocking. poll waits until at least one of them is ready, returns, and then the program iterates over the whole array to see which descriptor(s) poll said was ready. Finally, it processes them one by one. You waste all this time iterating over the array, not to mention the time copying it between the kernel and userland every time you call poll.

It’s true that you could multi-thread this and save some time, but the overhead of thousands of threads (one for each fd) adds up as well.

This is the problem event ports are designed to solve. Interestingly enough, this was not the first attempt to speed up this scenario. /dev/poll was introduced in Solaris 7 for the same purpose, and works great – except that multi-threaded apps can’t use it without [lots of slow] synchronization. Linux has epoll and BSD has kqueue, about neither of which do I pretend to know much.

Event ports, as applied to file descriptors, are basically a way of telling the OS that you’re interested in certain events on certain descriptors, and to deliver (synchronous) notification of said events on said descriptors to a particular object (…which also happens to be a file descriptor). Then you can have one or more worker threads sleeping, waiting for the OS to wake them up and hand them an event, saying "here you go, this socket’s got something to read." You can actually have lots of worker threads hanging around the same event port and let the OS do your "synchronization" – they probably don’t need to communicate or share any common state (depending on your app, of course).

No expensive copies. No traversing huge arrays finding needles in haystacks.

At least, in theory. Until now, nobody’s really measured their performance with any microbenchmarks. My recent addition of event ports of libevent gave me an excuse to run libevent’s own bunchmarks and see how it performed. The benchmark basically just schedules lots of read/write events, and has a few parameters: the number of file descriptors, the number of writes, and the number of active connections (the higher this number, the more the benchmark spreads its writes over the many file descriptors).


All benchmarks below were run on a uniprocessor x86 machine with 1G memory. Note the vertical axes – some of the graphs use logarithmic scales. And remember, these are the results of the libevent benchmark, so differences in performance can be a result of the underlying polling mechanism as well as the libevent implementation (though since the implementations don’t matter, it’s assumed that most affects are attributable to the underlying polling mechanism)

Let’s start with the simplest example. Consider small numbers of file descriptors, and various methods of polling them:

On this machine, event ports are fast, even for small numbers of file descriptors. In some circumstances, event ports are slightly more expensive for small numbers of file descriptors, particularly when many of them are active. This results because (a) when most of your descriptors are active, then the work done by poll is not such a waste (because you have to do it anyway), and (b) event ports require two system calls per descriptor event rather than one.

But let’s look at much larger numbers of file descriptors, where event ports really show their stuff. Note that /dev/poll has a maximum of about 250 fd’s (Update 7/12/06: Alan Bateman points out that you can use setrlimit to update RLIMIT_NOFILE to increase this to much higher values), and select(3c) can only handle up to about 500, so they’re omitted from the rest of these graphs. This graph has a logarithmic scale.

Wow. That’s about a 10x difference for 7,500 fd’s. ‘Nuff said. Finally, let’s examine what happens when the connections become more active (more of the descriptors are written to):

Event ports still win, but not by nearly as much (about 2x, for up to 15,000 fd’s). Less of poll’s array traversing is unnecessary, so the difference is smaller. But in the long run, event ports scale pretty much at a constant rate, while poll scales linearly. Select is even worse, since it’s implemented on top of poll (on Solaris). /dev/poll is good, but has the problem mentioned earlier of being difficult to use in a multi-threaded application.

For large numbers of descriptors, event ports are the clear winner.

Benchmarking can be difficult, and it’s easy to miss subtle effects. I invite your comments, especially if you think there’s something I’ve overlooked.

libevent and Solaris event ports

For those who dwell in subterranean shelters, event ports (or the “event completion framework,” if you want) are the slick new way to deal with events from various sources, including file descriptors, asynchronous i/o, other user processes, etc. Adam Leventhal even thinks they’re the 20th best thing about Solaris 10. Okay, that’s not huge, but competing with zfs, dtrace, and openness, that ain’t half bad. Though the interface is, of course, well-defined, the framework is still evolving. People are talking about adding signals and filesystem events to the event completion framework, though there’s definitely some debate about how to go about doing that. Now I hear you asking me: “Okay, that sounds pretty cool, but why would I port my sw33t @pp to Solaris just for event ports? Or port my existing poll(2) implementation on Solaris to use these newfangled dohickeys?” Well if amazing scalability isn’t an important enough reason, then I don’t know what is. (One of the great properties of event ports is they scale with the number of actual events, not the number of descriptors you’re listening to.) But reworking your code could come with added benefits. Enter libevent. Libevent is a portable library for dealing with file descriptor events. “But wait,” you ask, “I thought that’s what event ports are for!” Well, libevent provides a portable API for doing this. But the implementation varies from system to system. On BSD, it might use the kqueue facility, while on GNU/Linux it might choose epoll. This way, you can write your program on one platform, move to another, recompile, and all your file descriptor polling code works fine (albeit probably with a different implementation, which may affect performance substantially (possibly for the better, since the library chooses the best one for your platform)). Now libevent could use event ports as a backend on Solaris 10. It’s a project I’ve been working on for the last week or so. With that, you get the portability of libevent plus (some of[1]) the scalability of event ports. Sweet. Oh, and if you’re wondering more about event ports, check out the engine room, which includes an example and more technical information. [1]: the event port libevent backend isn’t quite as scalable as event ports alone, because a bit of per-descriptor bookkeeping has to be maintained. D’oh.