Performance Puzzler: The Slow Server

In the spirit of the TCP Puzzlers, I want to present a small distributed systems performance puzzler: In a distributed system with a fixed-number of servers and a fixed concurrency, what happens when one of the backend servers gets slow? First, let’s pick apart what this means.

Suppose you have:

  • a service with a fixed number of backend servers — say, 100 servers — each exactly equivalent. You could imagine this as a load balancing service having 100 backends or a key-value store having 100 well-distributed shards.
  • a fixed number of clients — say, 10,000. Each client makes one request to the system at a time. When each request completes, the client makes another one. The 10,000 clients could be 1,000 clients each having a 10-way thread pool or 10 event-oriented clients each capable of 1,000 concurrent requests — either way, the total concurrency is 10,000. It doesn’t matter of the clients and server communicate via RPC, HTTP, or something else.

with the following assumptions (which we’ll revisit later):

  • All requests take about the same amount of work to process, both on the client-side and server-side.
  • The per-request cost outside the server is negligible. (We’re going to ignore client-to-server transit time as well as client processing time.)
  • The backend servers are infinitely scalable. That is, we’ll assume that if they take 1ms to process 1 request, they’ll also take 1ms to process 10 requests issued at the same time.
  • Each request is dispatched to one of the servers uniformly at random.

To start with, suppose each server takes 1ms to complete each request. What’s the throughput of the entire system? What will be the per-server throughput?

There are a few ways to analyze this:

  • From the client perspective, we know there will be 10,000 requests outstanding to the system at any given instant. (Remember, there are 10,000 clients, each making a single request at a time, and we’re assuming zero per-request costs on the client side and in transit.) After 1ms, those requests have all completed and another 10,000 requests are running. At this rate, the system will complete (10,000 requests per millisecond) times (1,000 milliseconds per second) = 10 million requests per second. Intuitively, with all servers equal, we’d expect each server to be handling 1% of these, or 100,000 requests per second.
  • From the server perspective: with 10,000 outstanding requests at all times and 100 servers, we’d expect about 100 requests outstanding per server at all times. Each takes 1ms, so there would be 100,000 completed per server per second (which matches what we said above).

Now, what happens when one of the servers becomes slow? Specifically, suppose all of a sudden one of the servers starts taking 1s to process every request. What happens to overall throughput? What about per-server throughput? (Make a guess before reading on!)

The analysis is a bit more complex than before. From the client perspective, there will still always be 10,000 requests outstanding. 99% of requests will land on a fast server and take 1ms, while the remaining 1% will land on the slow server and take 1s. The expected latency for any request is thus 0.99 * 1ms + 0.01 * 1000ms = 10.99ms. If each of the 10,000 clients completes (1 request / 10.99ms) times (1,000 milliseconds per second), we get an expected throughput of about 91 requests per client, or 909,918 requests per second. (That’s a degradation of 91% from the original 10 million requests per second!)

Above, we assumed that each server had the same throughput, but that’s not so obvious now that one of them is so slow. Let’s think about this another way on the client side: what’s the probability at any given time that a client has a request outstanding to the slow server? I find it easiest to imagine the assignment being round-robin instead of random. Suppose clients issued requests to each of the 99 fast servers, then the slow server, and then started again with the first fast server. In that case, the first 99 requests would take 99ms, and the last request would take 1s. The client would have a request outstanding to the slow server for 1,000 ms / 1,099 ms or about 91% of the time. There’s a more rigorous explanation below, but I think it’s intuitive that the random distribution of work would behave the same way on average.

In that case, we could also expect that 91% of clients (or 9100 clients) have a request outstanding to the slow server at any given time. Each of these takes 1 second. The result is a throughput of 9,100 requests per second on the slow server.

What about the fast server? We’d expect that each fast server has 1/99 of the remaining requests (9% of requests) at any given time, which works out to 9.1 requests per server. At 1ms per request, these servers are doing (drum roll) 9,100 requests per second. The same as the slow server! Is that what you guessed? I didn’t.

This isn’t an accident of the specific numbers I picked. If you run the same analysis algebraically instead of with concrete numbers, you’ll find that the general result holds: in this system, when one server becomes slow, the throughput of every server remains the same.

A (more) formal proof

Let’s define some notation:

  • C: the total client concurrency (10,000 in our example)
  • N: the number of servers (100 in our example)
  • Si: server i (where i ranges from 1 to N)
  • Li: the latency (in seconds) of requests to server Si. In our example, L1 would be 1 second, while Lj = 0.001 for j > 1.
  • Ci: the number of requests outstanding on server Si at any given time.
  • Ti: the throughput of server Si (in requests per second).
  • Pr(ci): the probability that a particular client has a request outstanding to server Si at any given time.
  • Pr(Ri): the probability that a newly-issued request will be issued to server Si.

We’re looking for a formula for Ti. For a server that can handle one request a time, the throughput is simply the inverse of the latency (1/Li). We said that our server was infinitely scalable, which means it can execute an arbitrary number of requests in parallel. Specifically, it would be executing Ci requests in parallel, so the throughput Ti is:

Now, how do we calculate Ci, the concurrency of requests at each server? Well, since each client has exactly one request outstanding at a time, Ci is exactly the number of clients that have a request outstanding to server Si. Since the clients operate independently, that’s just:

To calculate ci, we need the percentage of time that each client has a request outstanding to Si. We define a probability space of events (t, i) indicating that at millisecond timestamp t, the client has a request outstanding to server Si. By simply counting the events, we can say that the probability that a client has a request outstanding to server Si is:

Now, since it’s equally likely that we’ll assign any given request to any server:

that means:

All we’re saying here is that the fraction of time each client spends with a request outstanding to a particular server is exactly that server’s latency divided by the latency of all servers put together. This makes intuitive sense: if you have 20 servers that all take the same amount of time, each client would spend 5% (1/20) of its time on each server. If one of those servers takes twice as long as the others, it will spend 9.5% (2/21) of its time on that one (almost twice as much as before) and 4.8% (1/21) on the others.

With this, we can go back to Ci:

and finally back to Ti:

and we have our result: the throughput at each server does not depend at all on the latency of that server (or any other server, for that matter)!

We can plug in some numbers to sanity check. In our example, we started with:

and we get the results:

which matches what we said above.

Simulating the behavior

One might find this result hard to believe. I wrote a simple simulator to demonstrate it. The simulator maintains a virtual clock with a 1ms resolution. To start with, each client issues a request to a random server. Each virtual millisecond, the simulator determines which requests have completed and has the corresponding clients make another request. This runs for a fixed virtual time, after which we determine the total throughput and the per-server throughput.

The simulator as-is hardcodes the the scenario that I described above (1 server with 1s per request, 99 servers with 1ms per request) and a 10-minute simulation. For the specific numbers I gave earlier, the results are:

$ node sim.js
simulated time:           600000 ms
total client concurrency: 10000
     1 server that completes requests with latency 1000 ms (starting with "server_0")
    99 servers that complete requests with latency 1 ms (starting with "server_1")

server_0     5455039 requests (  9092 rps)
server_1     5462256 requests (  9104 rps)
server_2     5459316 requests (  9099 rps)
server_3     5463211 requests (  9105 rps)
server_4     5463885 requests (  9106 rps)
server_5     5456999 requests (  9095 rps)
server_95    5457743 requests (  9096 rps)
server_96    5459207 requests (  9099 rps)
server_97    5458421 requests (  9097 rps)
server_98    5458234 requests (  9097 rps)
server_99    5456471 requests (  9094 rps)
overall:   545829375 requests (909715 rps, expect 9097 rps per server)

In my simulations, server_0 is generally a bit slower than the others, but within 0.2% of the overall overage. The longer I run the simulation, the closer it gets to the mean. Given that the slow server is three orders of magnitude slower per request (1s vs. 1ms), I’d say it’s fair to conclude that the per-server throughput does not vary among servers when one server is slow.


The main result here is that in this system, the per-server throughput is the same for all servers, even when servers vary significantly in how fast they process requests. While the implementations of servers may be independent (i.e., they share no common hardware or software components), their behavior is not independent: the performance of one server depends on that of other servers! The servers have essentially been coupled to each other via the clients.

This result implies another useful one: in a system like this one, when overall throughput is degraded due to one or more poorly-performing servers, you cannot tell from throughput alone which server is slow — or even how many slow servers there are! You could determine which server was slow by looking at per-server latency.

Note too that even when this system is degraded by the slow server, the vast majority of requests complete very quickly. At the same time, clients spend the vast majority of their time waiting for the slow server. (We’ve come to refer to this at Joyent as “getting Amdahl’d”, after Amdahl’s Law.)

If you found this result surprising (as I did), then another takeaway is that the emergent behavior of even fairly simple systems can be surprising. This is important to keep in mind when making changes to the system (either as part of development or in production, as during incident response). Our intuition often leads us astray, and there’s no substitute for analysis.

What about those assumptions?

The assumptions we made earlier are critical to our result. Let’s take a closer look.

  • “There are a fixed number of clients.” This is true of some systems, but certainly not all of them. I expect the essential result holds as long as client count is not a function of server performance, which probably is true for many systems. However, if client software responds to poor performance by either reducing concurrency (figuring they might be overloading the server) or increasing it (to try to maintain a given throughput level), the behavior may well be different.
  • “All requests take about the same amount of work to process, both on the client-side and server-side.” This is true of some systems, but not others. When the costs differ between requests, I’d expect the actual per-client and per-request throughput to vary and make the result shown here harder to see, but it doesn’t change the underlying result.
  • “The per-request cost outside the server is negligible.” This isn’t true of real systems, but makes the math and simulation much simpler. I expect the results would hold as long as the per-request client and transit costs were fixed and small relative to the server cost (which is quite often true).
  • “The backend servers are infinitely scalable. That is, we’ll assume that if they take 1ms to process 1 request, they’ll take 1ms to process 10 requests issued at the same time.”. This is the most dubious of the assumptions, and it’s obviously not true in the limit. However, for compute-bound services, this is not an unreasonable approximation up to the concurrency limit of the service (e.g., the number of available threads or cores). For I/O bound services, this is also a reasonable approximation up to a certain point, since disks often work this way. (For disks, it may not take much longer to do several read or write I/Os — even to spinning disks — than to do one such I/O. The OS can issue them all to the disk, and the disk can schedule them so they happen in one rotation of the platter. It’s not always so perfect as that, but it remains true that you can increase the concurrent I/O for disks up to a point without significantly increasing latency.)
  • “Each request is dispatched to one of the servers uniformly at random.” Many client libraries (e.g., for RPC) use either a random-assignment policy like this one or a round-robin policy, which would have a similar result. For systems that don’t, this result likely won’t hold. In particular, a more sophisticated policy that keeps track of per-server latency might prefer the faster servers. That would send less work to the slower server, resulting in better overall throughput and an imbalance of work across servers.

As you can see, the assumptions we made were intended to highlight this particular effect — namely, the way a single slow server becomes a bottleneck in clients that affects throughput on other servers. If you have a real-world system that looks at all similar to this one and doesn’t have a more sophisticated request assignment policy, then most likely this effect is present to some degree but it maybe harder to isolate relative to other factors.

Visualizing PostgreSQL Vacuum Progress

As heavy users of PostgreSQL since 2012, we’ve learned quite a bit about operating PostgreSQL at scale. Our Manta object storage system uses a large fleet of sharded, highly-available, replicated PostgreSQL clusters at the heart of the metadata tier. When an end user requests their object, say, Manta winds up looking in this PostgreSQL cluster for the metadata for that object in order to find the storage servers hosting copies of the object (along with the size, checksum, and other useful information).

From 2012 right up through the present, one of our biggest challenges has been managing PostgreSQL’s vacuum process. I won’t rehash the details here, as there’s plenty of useful information about it. Indeed, managing vacuum is such well-known challenge in operating PostgreSQL that major efforts are underway to fundamentally change the on-disk storage system to eliminate the underlying problem. Since these efforts appear years away (especially for deployments where major PostgreSQL version upgrades will take time to test and orchestrate), it remains important to understand this problem today.

This post assumes that you’re familiar enough with PostgreSQL to understand:

  • The basics of a PostgreSQL database: tables, rows, indexes, and basic SQL statements.
  • The terms database “page” or “block” (which I believe are used interchangeably to refer to 8KiB chunks of files making up the table’s on-disk representation), “heap” (the sequence of pages, potentially stored across several files, that store the actual table data), “tuple” (internal representation of a row), and “index” (an on-disk data structured used for efficient access to rows in the table). For more on these structures, see “Introduction to PostgreSQL physical storage”.
  • Internally, row updates in PostgreSQL are processed similarly to a delete of the original row plus an insert of a new row. Since we’re not concerned about inserts in this post, we can generally just think about deletes.
  • When rows are deleted in PostgreSQL tables, they are not cleaned up right away because doing so while providing desirable ACID semantics (when other transactions might still be able to see the updated or deleted rows) would make the system much more complex (and possibly slow). Deleted or updated rows (tuples) are called “dead tuples”. For more on this, see “Routine Vacuuming” from PostgreSQL documentation.
  • The vacuum process is a long-running database operation that scans the heap and removes dead tuples (i.e., those invalidated by previous “update” or “delete” operations) from both the heap and indexes. Vacuum is also used to address other issues like XID wraparound, which has certainly been a major challenge for us and other providers, but it’s largely unrelated to this post.

In our production deployments, vacuum operations often take at least 5-10 days. For various reasons, it’s not uncommon for them to take 15-20 days, and we’ve seen some take upwards of 40 days. Vacuum has historically been a somewhat opaque process: it runs for a long time until it finishes, at which point it logs some basic information about what happened. Fortunately, recent versions of PostgreSQL include a view describing the progress of ongoing vacuum operations. This is fantastic! The docs are a bit light on interpreting this view, and while you can find posts explaining a bit about how to use the new view, I was left with a lot of questions. Which of the several sets of counters should we be looking at to determine overall progress? Are there even counters for all the parts we care about? Are all the counters expected to reach 100%? How do we know when certain phases will repeat and what can we do about this? This post documents how we’ve visualized the data in the view and what we’ve learned about the underlying process.

First cut: visualizing vacuum progress

We started putting together a PostgreSQL vacuum visualization by just plotting the values in the pg_stat_progress_vacuum view. We collect this information from PostgreSQL using a component we built called pgstatsmon, store it in Prometheus, and visualize it with Grafana.

First, we plot the phases:

The green line shows the phases of vacuum. These are described in the docs; the key phases here are “scanning heap” (2), “vacuuming indexes” (3), and “vacuuming heap” (4). The other phases finished quickly enough that we didn’t catch them with pgstatsmon. (Note that the phase numbers here might differ from what you see because pgstatsmon adds one to the phase number reported by PostgreSQL.)

We can also add the counts of blocks scanned and vacuumed, plus the total count of blocks (which is constant):

We see the count of blocks scanned (yellow line) rise up to the count of total blocks (blue line). Then we see nothing for a while, then we see the count of blocks vacuumed (orange line) rise up to total blocks (blue line again). This seems pretty useful! We have a better sense of what’s going on, where previously we just knew this was an opaque 4-day process. It’s useful to note that the lines aren’t quite linear (particularly the orange one). And there’s still a long period (about 50% of the vacuum) where we can’t see any meaningful change in progress.

The phases (the green line, now on the right y-axis) add a bit of information here: note that the count of blocks scanned increases during the “scanning heap” phase. Then we see a long, opaque “vacuuming indexes” phase. Finally, the count of blocks vacuumed increases during the “vacuuming heap” phase. (This might sound obvious, but notice that there are “scan” and “vacuum” phases for the heap and “vacuum” and “cleanup” phases for the index, so if you don’t really know what’s going on under the hood, it’s not obvious which counters increase during which phases.)

We also created a percentage view that’s a little less busy, but tells us essentially the same thing:

What’s going on here? Why do we seem to take two passes over the heap? What’s happening during the gap in between passes?

Diving a bit deeper: how vacuum operates

To understand PostgreSQL’s vacuum progress view, you really need to understand more about how vacuum operates. My understanding here is primarily thanks to Peter Geoghegan’s talk at PostgresOpen 2018 (slides, video). Any errors here are my own, and I’d appreciate correction from folks who know better!

First, tuples in a PostgreSQL table are identified by a ctid. This ctid is itself a two-tuple (page number, tuple number). (Pages here refer to sequential 8K pages in the heap, not operating system pages.) The page number is a type of physical address: that is, from the page number, PostgreSQL knows exactly which offset in which file contains that page. The tuple number is a logical address: it’s an ordinal number of the tuple within the page. In order to find information for that tuple, PostgreSQL looks in a sort of table-of-contents in the page’s header.

Most importantly for our discussion: this table-of-contents not only points PostgreSQL to the offset in the page where the tuple’s data is stored, but it also contains a bit indicating whether the tuple is known to be dead. If that bit is set, PostgreSQL behaves as if the tuple doesn’t exist — without even looking at the tuple data itself.

Why is all this important? Because tuples aren’t just referenced from the heap: they’re also referenced from indexes! Indexes are complex on-disk data structures that generally allow more rapid access to specific tuples — e.g., because the index is sorted by some field in the tuple. Rather than storing the entire contents of a tuple, indexes store the ctid. When a query accesses a tuple through an index, it first finds the ctid in the index, then accesses the tuple contents in the heap by ctid. (Again, if the “dead” bit is set in what I’m calling the page’s table of contents, PostgreSQL sees this when accessing the tuple before even looking at what data the tuple contains.)

In order to remove a particular tuple from the table, vacuum has to remove it not just from the heap, but also all of the indexes that reference the tuple. But there could be many indexes, and they could be huge, and they’re not necessarily structured to quickly find references to a given ctid. How can PostgreSQL find all references to a tuple — a handful of needles in a handful of haystacks — so that it can remove it from the heap without leaving references dangling from indexes?

PostgreSQL makes use of the indirection described above. Logically, the process of removing a tuple from the heap involves a few steps (note: this isn’t exactly what happens — it’s a conceptual approximation):

  1. Mark the tuple dead in the header of the page containing the tuple. Immediately, any attempt to reference the tuple from an index (i.e., by its ctid) will find the bit set and treat the tuple as though it’s already gone. (This is correct because vacuum has already established that this tuple really is dead. We didn’t discuss that, but vacuum can tell this because it records the id of the the oldest transaction id running when it started, and any tuple with a death transaction id older than that cannot be visible to any running transaction.)
  2. For each index: scan the index from start to finish for references to this ctid and remove them from the index.
  3. Go back to the page containing the tuple and mark the table-of-contents slot free (instead of just dead). Now this ctid can be reused for a new insert, knowing that there are no dangling references to the previous ctid.

To summarize: if PostgreSQL just reused a given ctid without first removing references from the indexes, then indexes would be left with pointers to who-knows-what in the heap. So it’s a three-step process for each tuple: mark it dead in the heap so that existing references stop working gracefully, remove references from the indexes, then actually free the slot in the heap.

PostgreSQL doesn’t scan all indexes from start to finish for each tuple it cleans up. Instead, as the vacuum process scans the heap and finds dead tuples, it marks the tuple dead as described above and adds it to a list of dead tuples it found. This list is buffered in memory, often called maintenance_work_mem after the tunable that controls how large this list can be. Later, the vacuum process scans each index from start to finish and removes references to all the tuples in the list. Then PostgreSQL goes back to the heap and completes step 3, actually freeing each ctid. In the ideal case, all of this happens once: PostgreSQL batches up the whole list of tuples and takes care of them in one whole pass over each index.

Back to the visualization

Here’s our graph of phases and blocks scanned again:

With our understanding of ctids, we can say a bit more about what’s going on:

  1. During the “scanning heap” phase (phase 2, on the right-Y axis), PostgreSQL is scanning through all blocks of the heap looking for dead tuples. I infer that during this phase, PostgreSQL is marking those tuples dead in the page header and adding them to the list of tuples being removed.
  2. During the “vacuuming indexes” phase (phase 3), PostgreSQL is scanning all of the indexes and removing tuples from each index that are also in the list of tuples being removed (accumulated during the previous phase).
  3. During the “vacuuming heap” phase (phase 4), PostgreSQL is going back through all the blocks of the heap and actually freeing ctids that were removed.

Great! We’ve got a decent understanding of what’s going on during a vacuum and a visualization to observe it in real-time. There are still some gaps mentioned above, like we can’t tell progress at all during the “vacuuming indexes” phase. However, there are a few things we can do with this:

  • If we’re in the “scanning heap” phase, we can at least estimate when that will finish, since progress appears roughly linear. We may also guess how long the “vacuuming heap” phase will take, since it seems to be about the same (which makes sense, since they’re doing similar work). This puts a lower bound on the total vacuum time, since we don’t know how long “vacuuming indexes” will take.
  • Once we reach the “vacuuming heap” phase, we could make plausible estimates about the vacuum completion time.

Is it always so simple?

Multiple index scans

Let’s take a look at more complex case:

What’s going on here? It looks like the vacuum went backwards. In the table describing the vacuum phases, the text mentions that “vacuuming indexes…may happen multiple times per vacuum if maintenance_work_mem is insufficient to store the number of dead tuples found.” It also says that during the “vacuuming heap” phase, “If heap_blks_scanned is less than heap_blks_total, the system will return to scanning the heap after this phase is completed; otherwise, it will begin cleaning up indexes after this phase is completed.”

I found it a little hard to understand this from the docs alone, but the deeper information above about how vacuum works helps explain what’s going on. Above, we said that as vacuum scans through the heap, it accumulates a list of tuples that are being removed so that it can remove them from the indexes as well. That list of tuples goes into a fixed-size block of memory whose size is determined by the maintenance_work_mem tunable. What happens when we fill up this memory buffer? PostgreSQL doesn’t have a lot of options. It proceeds to finish the vacuum of the dead tuples it has already found. That means scanning (and vacuuming) all of the indexes, removing the dead tuples it’s found so far. Then it vacuums those tuples from the heap as well (just like in a normal vacuum). Now, the maintenance work memory can be cleared and it can pick up scanning the heap where it left off. Here’s what it looks like in terms of the blocks scanned and vacuumed:

The graph is starting to get a little busy! But there’s a lot of information here.

  • The green line represents the current phase. This is the same as in the graph above. The phases are indexed on the right y-axis, and the main ones are the three we saw earlier: 2 = “scanning heap”, 3 = “vacuuming indexes”, and 4 = “vacuuming heap”.
  • Just like before, during the “scanning heap” phase, the count of blocks scanned increases. However, instead of increasing all the way to the total count of heap blocks, it stops at what looks like the halfway point. PostgreSQL jumps to the “vacuuming indexes” phase.
  • As before, we have no progress information for the “vacuuming indexes” phase.
  • Once that finishes, we move to “vacuuming heap” (again, just like the example we looked at before). Here, the count of blocks vacuumed rises up to the count of blocks scanned — and stops there.

At this point, something very interesting happens:

  • PostgreSQL moves back to the “scanning heap” phase.
  • The count of “index scans” increases. We didn’t talk about this before because it wasn’t that interesting. In the previous example, there was one index scan for the whole vacuum. This time, we’ve completed one index scan partway through, but the whole vacuum will require a second one.
  • The heap scan picks up where it had previously stopped. (The yellow line showing blocks scanned, which had been flat for a while, starts moving up again. The orange line, which shows blocks vacuumed, now flattens.)
  • This time, the heap scan makes it all the way to the end. We have another “vacuuming indexes” phase, followed by another “vacuuming heap” phase in which the count of blocks vacuumed rises to the total count of blocks.
  • When the count of blocks vacuumed reaches the total count of blocks, the vacuum is essentially finished.

Essentially, the entire table vacuum has been broken up into two pieces: one covering about the first half of the heap, and another covering the rest.

We said the reason for this second scan was that PostgreSQL had filled the maintenance memory with dead tuples and had to vacuum indexes before finishing the heap scan. Can we be sure of this? Can we visualize it? Fortunately, the pg_stat_progress_vacuum view reports both the number of dead tuples accumulated as well as the total number of tuples that can be stored. We plot this against the phases:

We see clearly the count of dead tuples rising up to the maximum count during the “scanning heap” phase. It remains flat for the next two phases. Having finished vacuuming those tuples, the list is emptied and PostgreSQL goes back to “scanning heap” phase. This time, we don’t quite fill up the buffer — the count of dead tuples flatlines at about half the total buffer size.

It might be clearer to look at this in terms of percentages:

Here, we see the dead tuple memory filling up (from 0% to 100%) as we scan the heap, filling up when the scan reaches about 55%. We vacuum indexes for a while, then starting vacuuming tuples from the heap again (the yellow line increasing from 0% up to the green line, about 55%). The dead tuple memory empties (goes back to 0%), starts filling up during the scan, and flattens when we finish scanning the heap.

This visualization allows us to clearly see that the vacuum is running out of maintenance memory before finishing the heap scan and we now understand that results in an extra whole index scan.

How many index scans?

In this example, we saw two index scans. The vacuum was broken up into two pieces about halfway through the heap scan. Just to be clear, it’s only a coincidence that this happened about halfway through the scan. Here’s an example where we just barely ran out of maintenance memory (i.e., the first round made it through most of the heap before having to scan indexes):

Specifically, we can see that the heap scan makes it over 80% of the way through the heap the first time. The second round is needed to finish the last bit of the heap. (Somehow, this case really emphasizes how much more expensive it turned out to be that we had to do a second index scan.)

As far as I know, it’s possible to have any number of index scans. The worst cases we’ve noticed in production involve about 5 index scans. It’s easy to tell how many have been needed for a running vacuum because it’s exposed in the pg_stat_progress_vacuum view.

Is this a big deal? All three of the vacuums that we looked at scanned the whole heap twice (once to scan, and once to vacuum each tuple) — it was just broken into two pieces for the second and third vacuums. However: these last two vacuums took two whole passes over the index files, compared to just one pass in the previous vacuum. That’s a big deal! In our case, the indexes alone are a few TiB, and in the second example, this phase took almost 3 days. Having to do a whole second scan likely added about an extra 3 days to this vacuum, nearly doubling how long it took. Yes, that’s a big deal.

Putting these graphs together, this is what our PostgreSQL vacuum dashboard looks like:

I think of this a bit like an X-Ray in that it’s quite useful in some cases but it requires pretty deep understanding to interpret. You really have to understand the multiple heap scans involved in each table scan, the index scan in between, and the use of maintenance memory and multiple index scans.

Armed with this understanding, we’ve been able to generate promising hypotheses to explain some major points of confusion. For example: we long wondered why some vacuums on the same shards take so much longer than others? Sometimes a shard will experience a 12-day vacuum, and the next one will be 20 days. We can now guess that the second vacuum required an extra index scan. If true, this is very actionable: in principle, we could tune up maintenance_work_mem. Tuning maintenance_work_mem can have an enormous impact on vacuum time, and the effect is highly non-linear. If you’ve got 520MiB worth of dead tuples, with maintenance_work_mem of 520 MiB the vacuum might take closer to 60% of the time required if maintenance_work_mem was just 512 MiB. That’s an extreme case, but the third example above isn’t far from that. (Unfortunately in our case, we’re already using a value for this larger than PostgreSQL can make use of, so there’s no low hanging fruit here.)

In terms of understanding progress:

  • Once the heap scan has begun (which is essentially right away), based on progress, we can estimate how long it will take. Anecdotally, this has been comparable to the time required for the vacuum pass as well. By doubling that estimate, we get a lower bound on how long the vacuum can take. While coarse, that’s still helpful.
  • If we can assume that the density of dead tuples in the heap is approximately uniform throughout the heap, we can also estimate how many index scans will be necessary for a given vacuum based on how quickly we’re filling the maintenance memory. We could even verify this assumption by seeing how quickly we fill the dead tuple memory as we scan the heap.
  • Once we’ve completed a round of vacuuming indexes, we can put together a reasonable estimate of the whole vacuum’s completion time. That’s especially useful when there will be many subsequent index scans, since if there’s only one, we’re likely close to the end of the vacuum at this point anyway.

This still feels pretty rough, though. We’ve got an awful lot of open questions, starting with:

  • How do we apply this to a system with several hundred shards? We can’t rely on human interpretation of a graphic.
  • Is it possible to get progress information during the “vacuum indexes” phase? This is still a pretty long, opaque phase.
  • Can we get information about the number of tuples frozen during the vacuum? How about the number of tuples frozen (or unfrozen) in the whole table? (This would help us understand if we should be tuning the system to freeze more aggressively during normal vacuums or if that’s causing the vacuums to take too long.)
  • For each block updated, was the block updated because a tuple was frozen, some tuples were dead, or some other reason? (This would help us understand why some vacuums require more scans than others.)

Final notes

Believe it or not, this is a grossly simplified explanation of what’s going on in PostgreSQL. Peter’s talk (below) explains much more detail, including opportunistic garbage collection, heap-only tuples, the fact that most space is actually freed during the heap scan pass, and much more. Other mechanisms like hint bits, the free space map, the visibility map affect all of this.

The above represents my understanding from putting together information from the official docs, various talks, and the source. I’ve tried to cite sources where possible, but in some cases I’ve made inferences that could be way off. I’d appreciate any feedback where it’s incorrect or unclear.

The pg_stat_progress_vacuum view has been a huge step forward for us. Many thanks to the PostgreSQL community. PostgreSQL has helped us scale Manta several orders of magnitude over just a few years. The documentation is quite extensive. Thanks to those members of the community who help others understand these implementation details that play an outsize role in operating PostgreSQL at scale. And thanks especially to Peter Geoghegan for his talk at PostgresOpen in SF in 2018, which helped us put these visuals together.

It’s hard to convey how significant the vacuum issue has been for us over the last few years. To be clear, that’s in large part because of our own missteps related to database configuration, system configuration, and application behavior. (PostgreSQL defaults didn’t help either — the default behavior of vacuuming when 20% of the heap is dead means that vacuums on linearly-growing tables become farther apart and longer.) Nevertheless, some of our mistakes are quite hard to correct because they involve time-consuming, downtime-inducing schema migrations on hundreds of clusters. A lot of folks at Joyent have worked hard over the years to better understand and improve the performance of PostgreSQL at Joyent — too many to name!

Further reading:

  • “Introduction to PostgreSQL physical storage” by Rachid Belaid. This post covers relevant PostgreSQL internal structures in much more detail than I did above, with more diagrams, and without assuming too much knowledge of PostgreSQL internals already.
  • “Bloat in PostgreSQL: A Taxonomy”. Peter Geoghegan. PostgresOpen 2018 (slides, video). This talk inspired and informed most of this work.
  • “Routine Vacuuming” from PostgreSQL documentation. This section is worth reading several times.
  • “Database Page Layout” from the PostgreSQL documentation.
  • “All the Dirt on Vacuum”. Jam Nasby. This link is to the latest instance of this popular talk on vacuum. You can find videos from previous versions elsewhere.

2019-07-01 Update: Grafana/Prometheus snippets

Several readers requested more details so they could more easily recreate these dashboards. Below I’ve listed out the Prometheus queries we use to draw the graphs in this post.

Our production dashboards are more complicated than I described above for reasons having nothing to do with the content of this post. We use several different Prometheus data sources, for one, and this particular dashboard is parametrized (using Grafana templates) by the PostgreSQL shard and relation that we’re looking at (among other things). With templates, Grafana shows drop-down selectors at the top of the page. These selectors fill in the $shard and $relation variables you’ll see in the queries below. These variables and their values are very specific to our deployment. If you try to use these queries, you’ll likely need to change the filters quite a bit. And remember, all of this data is collected by pgstatsmon.

The “Complete view” graph (which contains most of the interesting data) is driven by these queries:

  • “vacuum phase”: pg_stat_progress_vacuum_phase{relname="$relation", backend=~"$shard.postgres.*"} > 0
  • “heap_blks_scanned”: pg_stat_progress_vacuum_heap_blks_scanned{relname="$relation", backend=~"$shard.postgres.*"} > 0
  • “heap_blks_total”: pg_stat_progress_vacuum_heap_blks_total{relname="$relation", backend=~"$shard.postgres.*"} > 0
  • “heap_blks_vacuumed”: pg_stat_progress_vacuum_heap_blks_vacuumed{relname="$relation", backend=~"$shard.postgres.*"} > 0
  • “index_vacuum_count”: pg_stat_progress_vacuum_index_vacuum_count{relname="$relation", backend=~"$shard.postgres.*"} > 0

The “vacuum phase” and “index_vacuum_count” queries also feed the “Vacuum progress” graph. It doesn’t use any other queries.

The “Index scans” graph (which shows tuples processed) uses:

  • “vacuum phase”: pg_stat_progress_vacuum_phase{relname="$relation",backend=~"$shard.postgres.*"} > 0
  • “dead tuples processed in current index scan”: pg_stat_progress_vacuum_num_dead_tuples{relname="$relation",backend=~"$shard.postgres.*"} > 0
  • “max dead tuples that can be processed per index scan”: pg_stat_progress_vacuum_max_dead_tuples{relname="$relation",backend=~"$shard.postgres.*"} > 0

For the “Scan and vacuum progress” graph (which shows percentage progress), we use:

  • “% blocks scanned”: 100 * pg_stat_progress_vacuum_heap_blks_scanned{relname="$relation", backend=~"$shard.postgres.*"}/pg_stat_progress_vacuum_heap_blks_total{relname="$relation", backend=~"$shard.postgres.*"} > 0
  • “% blocks vacuumed”: 100 * pg_stat_progress_vacuum_heap_blks_vacuumed{relname="$relation", backend=~"$shard.postgres.*"}/pg_stat_progress_vacuum_heap_blks_total{relname="$relation", backend=~"$shard.postgres.*"} > 0
  • “% dead tuple memory used”: 100 * pg_stat_progress_vacuum_num_dead_tuples{relname="$relation",backend=~"$shard.postgres.*"} / pg_stat_progress_vacuum_max_dead_tuples{relname="$relation",backend=~"$shard.postgres.*"} > 0

I hope this is helpful! If there’s more information that’d be useful, let me know. (I’m not sure I can provide the raw dashboard JSON since it has details specific to our environment — but for that reason, I’m also not sure it would be very helpful.)

TCP puzzlers

This post was cross-posted to the Joyent blog.

It’s been said that we don’t really understand a system until we understand how it fails. Despite having written a (toy) TCP implementation in college and then working for several years in industry, I’m continuing to learn more deeply how TCP works — and how it fails. What’s been most surprising is how basic some of these failures are. They’re not at all obscure. I’m presenting them here as puzzlers, in the fashion of Car Talk and the old Java puzzlers. Like the best of those puzzlers, these are questions that are very simple to articulate, but the solutions are often surprising. And rather than focusing on arcane details, they hopefully elucidate some deep principles about how TCP works.


These puzzlers assume some basic knowledge about working with TCP on Unix-like systems, but you don’t have to have mastered any of this before diving in. As a refresher:

  • TCP states, the three-way handshake used to establish a connection, and the way connections are terminated are described pretty concisely on the TCP Wikipedia page.
  • Programs typically interact with sockets using read, write, connect, bind, listen, and accept. There’s also send and recv, but for our purposes, these work the same way as read and write.
  • I’ll be talking about programs that use poll. Although most systems use something more efficient like kqueue, event ports, or epoll, these are all equivalent for our purposes. As for applications that use blocking operations instead of any of these mechanisms: once you understand how TCP failure modes affect poll, it’s pretty easy to understand how it affects blocking operations as well.

You can try all of these examples yourself. I used two virtual machines running under VMware Fusion. The results match my experiences in our production systems. I’m testing using the nc(1) tool on SmartOS, and I don’t believe any of the behavior shown here is OS-specific. I’m using the illumos-specific truss(1) tool to trace system calls and to get some coarse timing information. You may be able to get similar information using dtruss(1m) on OS X or strace(1) on GNU/Linux.

nc(1) is a pretty simple tool. We’ll use it in two modes:

  • As a server. In this mode, nc will set up a listening socket, call accept, and block until a connection is received.
  • As a client. In this mode, nc will create a socket and establish a connection to a remote server.

In both modes, once connected, each side uses poll to wait for either stdin or the connected socket to have data ready to be read. Incoming data is printed to the terminal. Data you type into the terminal is sent over the socket. Upon CTRL-C, the socket is closed and the process exits.

In these examples, my client is called kang and my server is called kodos.

Warmup: Normal TCP teardown

This one demonstrates a very basic case just to get the ball rolling. Suppose we set up a server on kodos:


[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464310423.7650  [ Fri May 27 00:53:43 UTC 2016 ]
 0.0027 bind(3, 0x08065790, 32, SOV_SOCKBSD)            = 0
 0.0028 listen(3, 1, SOV_DEFAULT)                       = 0
accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) (sleeping...)

(Remember, in these examples, I’m using truss to print out the system calls that nc makes. The -d flag prints a relative timestamp and the -t flag selects which system calls we want to see.)

Now on kang, I establish a connection:


[root@kang ~]# truss -d -t connect,pollsys,read,write,close nc 8080
Base time stamp:  1464310447.6295  [ Fri May 27 00:54:07 UTC 2016 ]
 0.0062 connect(3, 0x08066DD8, 16, SOV_DEFAULT)         = 0
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

Over on kodos, we see:


23.8934 accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) = 4
pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)

At this point, the TCP connections are in state ESTABLISHED and both processes are in poll. We can see this using the netstat tool on each system:


[root@kodos ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------   1049792      0 1049800      0 ESTABLISHED

[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    32806      0 1049800      0 ESTABLISHED

The question is, when we shut down one of these processes, what happens in the other process? Does it find out? How? Try to predict the specific syscall behavior and explain why each system call does what it does.

Let’s try it. I’ll CTRL-C the server on kodos:


pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)
^C127.6307          Received signal #2, SIGINT, in pollsys() [default]

Here’s what we see on kang:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
126.1771        pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
126.1774        read(3, 0x08043670, 1024)                       = 0
126.1776        close(3)                                        = 0
[root@kang ~]#

What happened here? Let’s be specific:

  1. We sent SIGINT to the server, causing the process to exit. Upon exit, file descriptors are closed.
  2. When the last file descriptor for an ESTABLISHED socket is closed, the TCP stack on kodos sends a FIN over the connection and enters the FIN_WAIT_1 state.
  3. The TCP stack on kang receives the FIN packet, transitions its own connection to CLOSE_WAIT, and sends an ACK back. Since the nc client is blocked on this socket being ready to read, the kernel wakes this thread with the event POLLIN.
  4. The nc client sees POLLIN for the socket and calls read, which returns 0 immediately. This indicates end-of-stream. nc presumes that we’re done with this socket and closes it.
  5. Meanwhile, the TCP stack on kodos receives the ACK and enters FIN_WAIT_2.
  6. Since the nc client on kang closed its socket, the TCP stack on kang sent a FIN to kodos. The connection on kang enters LAST_ACK.
  7. The TCP stack on kodos receives the FIN, the connection enters TIME_WAIT, and the stack on kodos acknowledges the FIN.
  8. The TCP stack on kang receives the ACK for the FIN and removes the connection entirely.
  9. Two minutes later, the TCP connection on kodos is closed and the stack removes the connection entirely.

(It’s possible for steps to be slightly reordered and for kang to transition through CLOSING instead of FIN_WAIT_2.)

Here’s the final state according to netstat:


[root@kodos ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------   1049792      0 1049800      0 TIME_WAIT

There’s no output for this connection at all on kang.

The intermediate states transition very quickly, but you could see them using the DTrace TCP provider. You can see the packet flow using snoop(1m) or tcpdump(1).

Conclusions: We’ve seen the normal path of system calls for connection establishment and teardown. Note that kang immediately found out when kodos’s connection was closed — it was woken up out of poll and read returned 0 to indicate end-of-stream. At that point, kang elected to close its socket, which cleaned up the connection state on kodos. We’ll revisit this later to see how the results can differ if kang doesn’t close its socket here.

Puzzler 1: Power cycling

What happens to an established, idle TCP connection if one system is power cycled?

Since most processes go through normal exit as part of a graceful reboot (e.g., if you use the “reboot” command), you get basically the same result if you type “reboot” on kodos instead of killing the server with CTRL-C. But what would happen if we’d power-cycled kodos in the previous example? Surely kang will eventually find out, right?

Let’s try it. Set up the connection:


[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464312528.4308  [ Fri May 27 01:28:48 UTC 2016 ]
 0.0036 bind(3, 0x08065790, 32, SOV_SOCKBSD)            = 0
 0.0036 listen(3, 1, SOV_DEFAULT)                       = 0
 0.2518 accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) = 4
pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)

[root@kang ~]# truss -d -t open,connect,pollsys,read,write,close nc 8080
Base time stamp:  1464312535.7634  [ Fri May 27 01:28:55 UTC 2016 ]
 0.0055 connect(3, 0x08066DD8, 16, SOV_DEFAULT)         = 0
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

Now I’ll use VMware’s “reboot” function to power-cycle the system. It’s important for this test that this be an actual power cycle (or OS panic) — anything that causes a graceful shut down will look more like the first case above.

20 minutes later, kang is still sitting in the same place:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

We tend to believe that TCP’s job is to maintain a consistent abstraction (namely, the TCP connection) across multiple systems, so it’s surprising to discover cases where the abstraction is broken like this. And lest you think this is some nc(1) issue, it’s not. “netstat” on kodos shows no connection to kang, but kang shows a perfectly healthy connection to kodos:


[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    32806      0 1049800      0 ESTABLISHED

We can leave this forever without touching it and kang will never figure out that kodos rebooted.

Now, suppose at this point, kang tries to send data to kodos. What happens?

Let’s try it:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
kodos, are you there?
3872.6918       pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
3872.6920       read(0, " k o d o s ,   a r e   y".., 1024)     = 22
3872.6924       write(3, " k o d o s ,   a r e   y".., 22)      = 22
3872.6932       pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
3872.6932       read(3, 0x08043670, 1024)                       Err#131 ECONNRESET
3872.6933       close(3)                                        = 0
[root@kang ~]#

When I type the message and hit enter, kodos gets woken up, reads the message from stdin, and sends it over the socket. The write call completes successfully! nc goes back to poll to wait for another event, eventually finds out that the socket can be read without blocking, and then calls read. This time, read fails with ECONNRESET. What does this mean? POSIX’s definition of read(2) says that this means:

A read was attempted on a socket and the connection was forcibly closed by its peer.

The illumos read(2) man page provides a little more context:

              The filedes argument refers to a connection oriented
              socket and the connection was forcibly closed by the peer
              and is no longer valid.  I/O can no longer be performed to

This error doesn’t mean that something specific to the read call went wrong, but rather that the socket itself is disconnected. Most other socket operations would fail the same way at that point.

So what happened here? At the point when the nc on kang tried to send data, the TCP stack still didn’t know the connection was dead. kang sent a data packet over to kodos, which responded with an RST because it didn’t know anything about the connection. kang saw the RST and tore down its connection. It cannot close the socket file descriptor — that’s not how file descriptors work — but subsequent operations fail with ECONNRESET until nc does close the fd.


  1. A hard power-cycle is very different than a graceful shutdown. When testing distributed systems, this case needs to be specifically tested. You can’t just kill a process and expect it to test the same thing.
  2. There are situations when one side believes a TCP connection is established, but the other side does not, and where this situation will never be automatically resolved. It’s possible to manage this problem using TCP or application-level keep-alive.
  3. The only reason that kang eventually found out that the remote side had disappeared was that it took action: it sent data and received a response indicating the connection was gone.

That raises the question: what if kodos had not responded to the data message for some reason?

Puzzler 2: Power off

What happens if one endpoint of a TCP connection is powered off for a while? Does the other endpoint ever discover this? If so, how? And when?

Once again, I’ll set up the connection with nc:


[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464385399.1661  [ Fri May 27 21:43:19 UTC 2016 ]
 0.0030 bind(3, 0x08065790, 32, SOV_SOCKBSD)        = 0
 0.0031 listen(3, 1, SOV_DEFAULT)           = 0
accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) (sleeping...)
 6.5491 accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) = 4
pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)

[root@kang ~]# truss -d -t open,connect,pollsys,read,write,close nc 8080
Base time stamp:  1464330881.0984  [ Fri May 27 06:34:41 UTC 2016 ]
 0.0057 connect(3, 0x08066DD8, 16, SOV_DEFAULT)         = 0
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

Now, I’ll cut power to kodos abruptly and attempt to send data from kang:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

114.4971        pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
114.4974        read(0, "\n", 1024)                             = 1
114.4975        write(3, "\n", 1)                               = 1
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

The write completes normally, and I don’t see anything for quite a while. Just over 5 minutes later, I see this:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
425.5664        pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
425.5665        read(3, 0x08043670, 1024)                       Err#145 ETIMEDOUT
425.5666        close(3)                                        = 0

This looks similar to the case where the system was power-cycled instead of powered off, except for two things: it took 5 minutes for the system to notice, and the error reported was ETIMEDOUT. Note that again, it’s not that the read timed out per se. We would have seen the same error from other socket operations, and subsequent socket operations would likely fail immediately with the same ETIMEDOUT error. That’s because it’s the socket that has entered a state where the underlying connection has timed out. The specific reason for this is that the remote side failed to acknowledge a data packet for too long — 5 minutes, as configured on this system.


  1. When the remote system is powered off instead of power cycled, again, the first system only finds out if it’s trying to send data. If it doesn’t send packets, it will never find out about the terminated connections.
  2. When the system has been attempting to send data for long enough without receiving acknowledgments from the remote side, the TCP connection will be terminated and a socket operation on them will fail with ETIMEDOUT.

Puzzler 3: Broken connection with no failure

This time, instead of giving you a specific case and asking what happens, I’ll flip it around: here’s an observation, and see if you can figure out how it could happen. We’ve discussed several cases where kang can believe it has a connection to kodos, but kodos doesn’t know about it. Is it possible for kang to have a connection to kodos without kodos knowing about it — indefinitely (i.e., where this will not resolve itself) and even if there’s been no power off, no power cycle, and no other failure of the operating system on kodos and no failure of networking equipment?

Here’s a hint: consider the case above where a connection is stuck in ESTABLISHED. In that case, it’s fair to say that the application has the responsibility to deal with the problem, since by definition it still has the socket open and could find out by sending data, at which point the connection will eventually be terminated. But what if the application didn’t have the socket open any more?

In the Warmup, we looked at the case where kodos’s nc closed its socket, and we said that kang’s nc read 0 (indicating end-of-stream) and then closed its socket. What if it didn’t close the socket, but kept it open? Obviously, it couldn’t read from it. But there’s nothing about TCP that says you can’t send more data to the other side that has sent you a FIN. FIN only means end-of-data-stream in the direction that the FIN was sent.

In order to demonstrate this, we can’t use nc on kang because it automatically closes its socket when it reads 0. So I’ve written a demo version of nc called dnc, which simply skips this behavior. It also prints out explicitly which system calls it’s making. This will give us a chance to observe the TCP states.

First, we’ll set up the connections as usual:


[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464392924.7841  [ Fri May 27 23:48:44 UTC 2016 ]
 0.0028 bind(3, 0x08065790, 32, SOV_SOCKBSD)            = 0
 0.0028 listen(3, 1, SOV_DEFAULT)                       = 0
accept(3, 0x08047B2C, 0x08047C2C, SOV_DEFAULT, 0) (sleeping...)
 1.9356 accept(3, 0x08047B2C, 0x08047C2C, SOV_DEFAULT, 0) = 4
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

[root@kang ~]# dnc 8080
2016-05-27T08:40:02Z: establishing connection
2016-05-27T08:40:02Z: connected
2016-05-27T08:40:02Z: entering poll()

Now let’s verify the connections are in the ESTABLISHED state we expect on both sides:


[root@kodos ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------   1049792      0 1049800      0 ESTABLISHED

[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    32806      0 1049800      0 ESTABLISHED

Now, let’s CTRL-C the nc process on kodos:


pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
^C[root@kodos ~]#

We immediately see this on kang:


2016-05-27T08:40:12Z: poll returned events 0x0/0x1
2016-05-27T08:40:12Z: reading from socket
2016-05-27T08:40:12Z: read end-of-stream from socket
2016-05-27T08:40:12Z: read 0 bytes from socket
2016-05-27T08:40:12Z: entering poll()

Now let’s look at the TCP connection states:


[root@kodos ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------   1049792      0 1049800      0 FIN_WAIT_2

[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    1049792      0 1049800      0 CLOSE_WAIT

This makes sense: kodos sent a FIN to kang. FIN_WAIT_2 indicates that kodos received an ACK from kang for the FIN that it sent, and CLOSE_WAIT indicates that kang received the FIN but has not sent a FIN back. This is a perfectly valid state for a TCP connection for an indefinite period of time. Imagine kodos had sent a request to kang and was not planning to send any more requests; kang could happily send data in response for hours, and this would work. Only in our case, kodos has actually closed the socket.

Let’s wait a minute and check the TCP connection states again. A minute later, the connection is completely missing from kodos, but it’s still present on kang:


[root@kang ~]# netstat -f inet -P tcp -n

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------    1049792      0 1049800      0 CLOSE_WAIT

What happened here? We hit a lesser-known special case in the TCP stack: when an application has closed a socket, the TCP stack has sent a FIN, and the remote TCP stack has acknowledged the FIN, the local TCP stack waits a fixed period of time and then closes the connection. The reason? In case the remote side has rebooted. This case is actually an analog to the case above where one side had an ESTABLISHED connection and the other doesn’t know about it. The difference is that in this specific case, the application has closed the socket, so there’s no other component that could deal with the problem. As a result, the TCP stack waits a fixed period of time and then tears down the connection (without sending anything to the other side).

Follow-up question: what happens if kang sends data to kodos at this point? Remember, kang still thinks the connection is open, but it’s been torn down on kodos.


2016-05-27T08:40:12Z: entering poll()
kodos, are you there?
2016-05-27T08:41:34Z: poll returned events 0x1/0x0
2016-05-27T08:41:34Z: reading from stdin
2016-05-27T08:41:34Z: writing 22 bytes read from stdin to socket
2016-05-27T08:41:34Z: entering poll()
2016-05-27T08:41:34Z: poll returned events 0x0/0x10
2016-05-27T08:41:34Z: reading from socket
dnc: read: Connection reset by peer

This is the same as we saw in case Puzzler 1: the write() actually succeeds, since the TCP stack doesn’t know that the connection is closed yet. But it does get a RST, which wakes up the thread in poll(), and the subsequent read() returns ECONNRESET.


  • It’s possible for two sides to disagree about the state of the connection even when there’s been no OS, network, or hardware failure at all.
  • In a case like the one above, it’s not possible for kang to distinguish between the case where kodos is attentively waiting to receive data or kodos has closed the socket and isn’t listening (at least, not without sending a packet). For this reason, maybe it’s not a great idea to design a system that uses sockets in these half-open states for an extended period of time under normal operation.


TCP is typically presented as a protocol for maintaining a consistent abstraction — the “TCP connection” — across two systems. We know that in the face of certain types of network and software problems, connections will fail, but it’s not always obvious that there are cases where the abstraction fails, in that the two systems disagree about the state of the connection. Specifically:

  • It’s possible for one system to think it has a working, established TCP connection to a remote system, while the remote system knows nothing about that connection.
  • This is possible even when there has been no failure of the network, hardware, or operating system.

These behaviors do not reflect deficiencies in TCP. Quite the contrary, in all of these cases, the TCP implementation appears to behave as reasonably as it could given the situation. If we were tempted to implement our own communication mechanism instead of using TCP, the presence of these cases might well remind us how complex the underlying problems are. These are intrinsic problems with distributed systems, and a TCP connection is fundamentally a distributed system.

That said, the single most important lesson in all of this is that the notion of a TCP connection that spans multiple systems is a convenient fiction. When it goes wrong, it’s critical to think about it as two separate state machines that operate simultaneously to try to maintain a consistent view of the connection. It’s the responsibility of the application to handle the cases where these differ (often using a keep-alive mechanism).

Furthermore, there’s a disconnect between the application’s file descriptor and the underlying TCP connection. TCP connections exist (in various states related to closing) even after an application has closed the file descriptor, and a file descriptor can be open when the underlying TCP connection has been closed as a result of a failure.

Other lessons to keep in mind:

  • Ungraceful reboot of a system (as happens when the OS crashes) is not the same as a userland process exiting or closing its socket. It’s important to test this case specifically. Reboot, when the remote system comes back online before the connection times out, is also different than power-off.
  • There’s no proactive notification from the kernel when a TCP socket is torn down. You only find this out when you call read(), write(), or some other socket operation on the socket file descriptor. If your program doesn’t do this for some reason, you’ll never find out about the connection failure.

Some related notes that I’ve found aren’t so commonly known:

  • ECONNRESET is a socket error you can see from read(), write(), and other operations that indicates that the remote peer has sent a RST.
  • ETIMEDOUT is a socket error you can see from read(), write(), and other operations that indicates that some timeout associated with the connection has elapsed. The cases I’ve seen most are when the remote side did not acknowledge some packet for too long. These are usually either data packets, a FIN packet, or a KeepAlive probe.

Importantly, neither of these errors means that there was anything wrong with the read or write operation that you tried to do — just that the socket itself is closed.

ACM Turing Centenary Celebration

This past weekend, I was very fortunate to have a chance to attend the ACM‘s Turing Centenary Celebration in honor of the 100th anniversary of the birth of Alan Turing. The event brought together nearly every living Turing Award winner for a series of talks and panel discussions on subjects like AI, theory of computation, computer architecture, and the role of computation in other fields. A webcast of the entire event is already available online. Below are some of my own notes on the conference.

For many of us in the audience, this weekend was about seeing in person the people we’d learned so much from both academically and professionally. This awe-inspiring feeling was best articulated by Vint Cerf in closing yesterday: “it’s like the history books opened up and the people walked out”. For me, by far, the highlight was the panel on Computer Architecture, moderated by David Patterson (yes, that Patterson) and featuring Frederick Brooks (of OS/360 and Mythical Man-Month fame), Ivan Sutherland, and Chuck Thacker. More than the other panels, I found all of the speakers’ prepared remarks accessible (perhaps because my work and interests most closely align with theirs), but at the same time very instructional. Sutherland began with the “Tyranny of the Clock”, an eloquent articulation of an important barrier in modern chip design and a call to action for designers and researchers. Then, in a sort of reverential but thoughtful engineering-style postmortem, Brooks discussed why the machine that Turing actually built, unlike so much of his other work, was not very influential. Thacker discussed the nature of computer architecture research and the modern developments that have made it more accessible for students today. In the subsequent discussion, Patterson referenced a prophetic quote by Maurice Wilkes at the dawn of modern computing (that Bryan also cited in his QCon talk last year) in which Wilkes suddenly “realized that a large part of my life from then on was going to be spent in ļ¬nding mistakes in my own programs”.


Complexity was a prominent theme in several sessions. Ken Thompson expressed his disappointment at the complexity of modern Linux, pointing to its 1000+ syscalls compared to the original Unix’s 200. In his prepared remarks, he also shared some grim reflections on how Turing would feel about the state of computing today: Windows, phishing, botnets, C++, and so on. He compared his feeling to that of an early television pioneer visiting our time and seeing people watching Maury Povich. On a more positive note in the same session, Fernando Corbato (leader on Multics) gave a brief but fascinating account of what it was like to work on computers in the early days. He called Unix actually one the greatest results of Multics, Unix being a “distillation” of the better ideas of Multics without all the complexity. (It’s well worth watching Fernando’s and Ken’s remarks from the “Systems Architecture, Design, Engineering, and Verification” session.) Later, Thacker too called complexity “the enemy”, suggesting that we seriously reconsider many of the held-over assumptions in today’s systems that are costing us enormously today. (I’m sure that’s a good idea, and I’d have loved to hear some examples of these things that he had in mind.)

In the Programming Languages panel, Barbara Liskov lamented that the languages people use today for production systems are pretty complex for introducing new students to computer programming, but also admitted that building languages simple enough to be thoroughly understood in an introductory course and rich enough to support what professional engineers want is a serious challenge. She suggested starting from scratch with only the essential language features of modularity and encapsulation. In the same session, Nicklaus Wirth (in an entertaining, light presentation) explained how he sought to design languages that were both simpler and more powerful than their contemporaries — that these are not opposing goals. All of the participants agreed that in practice, most popular languages seem to accrete lots of cruft from small changes that seem good at the time, but contribute to an overly complex system.

Lucky or good?

Another theme that came up quite a bit was the role of luck in the speakers’ success. Many of them attributed their success to luck, leaving it there, but I liked Dahlia Malkhi’s reference to a golfer who hit a hole-in-one. He was asked if that was a result of luck or training, and replied that he was lucky, but he had to train a lot in order to get that lucky.

Beware elegance?

Several speakers (notably Butler Lampson and Joseph Sifakis) mentioned that they tend to be suspicious of clean, elegant solutions to problems, because such solutions often don’t work well in the real world. I’d never heard it put so generally, especially by leaders in the field, as that goes against a common intuition among mathy people that’s usually nurtured as part of our education. (That’s still a good thing — as Einstein famously said, it’s important to strive for simplicity, but be careful not to go to far.) In fact, Sifakis attributed the lack of serious work in rigorous system design to researchers’ preference for nice theories, even if those theories don’t match reality. (While probably a factor, this explanation seems to leave out the economic cost of such rigor as an important reason why many systems today aren’t built the way he suggests.)

System verification

In the Systems Architecture and Verification session, Vint Cerf noted that automatic verifiers don’t seem to work well for many types of systems we build and asked Sifakis and E. Allen Emerson whether there existed interactive tools that would help programmers test assertions about their systems, rather than automatically trying to verify the whole thing. Emerson pointed out that this is called semi-automatic verification, but still seemed more interested in the fully-automatic kind. Vint’s idea made me think of a sort of extensible lint, since lint is already an admittedly limited tool for checking a fixed set of assertions about a program. But despite its limits, lint is incredibly useful (at least in languages like C and JavaScript) for rooting out large classes of bugs, and it would be interesting to think about a more interactive workflow that would free the tool from having to report only things it knows are problems. (People generally won’t use an error-checking tool that reports many false positives, but they might use a tool that can evaluate static assertions about their code in a less rigid context.)

“What”-based programming

Alan Kay and others talked about the idea of “what”-based programming, rather than the “how”-based approaches we typically use today. The idea is that humans tell the computer what to do, and some engine under the hood figures out how to do it. Kay demonstrated a graphical environment based on this idea, and then wondered why we couldn’t build more complex systems (including operating systems) that way. Bill, Robert, and I tried for a while to imagine what this would look like. On the one hand, many classes of device drivers are similar enough that you could imagine modeling some pieces of them with a declarative “what”-based description, but interaction with physical devices often requires precise sequences of register reads and writes and it’s hard to imagine encoding that without essentially describing “how”. Achieving good performance may be challenging, since humans writing code for such systems today necessarily describe how to organize them to be fast. And if you could solve this for something as constrained as device drivers, how could you generalize it for the schedulers or VM system, without encoding detailed knowledge into the engine that actually translates the “what” to the “how”? You could also imagine that debugging such systems would be very difficult. Still, I found the idea compelling, because there are many cases where we do build “what”-based descriptions and the result is that it’s much easier to verify both that the description does what the human wants it to do and that the “what”-to-”how” system properly translates it (e.g., Meta-D, a Cloud Analytics module that describes a family of DTrace scripts declaratively, or even the D language itself). It would be interesting to hear from Alan Kay what he was thinking in posing this question.

Computation in biology and physics

I was especially intrigued by Leonard Adleman‘s remarks during the “Algorithmic View of the Universe” panel, in which he talked about vastly different notions of computation and how results based on the Turing model, plus the Church-Turing thesis, can inform physics and biology. He discussed protein folding in the cell as a form of computation, and what implications that has for biologists trying to understanding cellular processes. Later he wondered what implications the proven constraints of the Turing model, taken as physical laws, would have on quantum mechanics (e.g., that certain types of time travel allowed by QM must actually be impossible).


These were just a few of the bits I found most interesting, but the whole weekend was a humbling experience. Besides being able to see so many important figures in the field, it was a good opportunity to step outside the confines of day-to-day engineering, which for me tends toward a time horizon of a few years. And most of the talks provoked interesting discussions. So thanks to all the speakers, and to the ACM for putting together the event and making the video available.