Visualizing PostgreSQL Vacuum Progress

As heavy users of PostgreSQL since 2012, we’ve learned quite a bit about operating PostgreSQL at scale. Our Manta object storage system uses a large fleet of sharded, highly-available, replicated PostgreSQL clusters at the heart of the metadata tier. When an end user requests their object, say http://us-east.manta.joyent.com/dap/public/kartlytics/videos/2012-09-06_0000-00.mov, Manta winds up looking in this PostgreSQL cluster for the metadata for that object in order to find the storage servers hosting copies of the object (along with the size, checksum, and other useful information).

From 2012 right up through the present, one of our biggest challenges has been managing PostgreSQL’s vacuum process. I won’t rehash the details here, as there’s plenty of useful information about it. Indeed, managing vacuum is such well-known challenge in operating PostgreSQL that major efforts are underway to fundamentally change the on-disk storage system to eliminate the underlying problem. Since these efforts appear years away (especially for deployments where major PostgreSQL version upgrades will take time to test and orchestrate), it remains important to understand this problem today.

This post assumes that you’re familiar enough with PostgreSQL to understand:

  • The basics of a PostgreSQL database: tables, rows, indexes, and basic SQL statements.
  • The terms database “page” or “block” (which I believe are used interchangeably to refer to 8KiB chunks of files making up the table’s on-disk representation), “heap” (the sequence of pages, potentially stored across several files, that store the actual table data), “tuple” (internal representation of a row), and “index” (an on-disk data structured used for efficient access to rows in the table). For more on these structures, see “Introduction to PostgreSQL physical storage”.
  • Internally, row updates in PostgreSQL are processed similarly to a delete of the original row plus an insert of a new row. Since we’re not concerned about inserts in this post, we can generally just think about deletes.
  • When rows are deleted in PostgreSQL tables, they are not cleaned up right away because doing so while providing desirable ACID semantics (when other transactions might still be able to see the updated or deleted rows) would make the system much more complex (and possibly slow). Deleted or updated rows (tuples) are called “dead tuples”. For more on this, see “Routine Vacuuming” from PostgreSQL documentation.
  • The vacuum process is a long-running database operation that scans the heap and removes dead tuples (i.e., those invalidated by previous “update” or “delete” operations) from both the heap and indexes. Vacuum is also used to address other issues like XID wraparound, which has certainly been a major challenge for us and other providers, but it’s largely unrelated to this post.

In our production deployments, vacuum operations often take at least 5-10 days. For various reasons, it’s not uncommon for them to take 15-20 days, and we’ve seen some take upwards of 40 days. Vacuum has historically been a somewhat opaque process: it runs for a long time until it finishes, at which point it logs some basic information about what happened. Fortunately, recent versions of PostgreSQL include a view describing the progress of ongoing vacuum operations. This is fantastic! The docs are a bit light on interpreting this view, and while you can find posts explaining a bit about how to use the new view, I was left with a lot of questions. Which of the several sets of counters should we be looking at to determine overall progress? Are there even counters for all the parts we care about? Are all the counters expected to reach 100%? How do we know when certain phases will repeat and what can we do about this? This post documents how we’ve visualized the data in the view and what we’ve learned about the underlying process.

First cut: visualizing vacuum progress

We started putting together a PostgreSQL vacuum visualization by just plotting the values in the pg_stat_progress_vacuum view. We collect this information from PostgreSQL using a component we built called pgstatsmon, store it in Prometheus, and visualize it with Grafana.

First, we plot the phases:

The green line shows the phases of vacuum. These are described in the docs; the key phases here are “scanning heap” (2), “vacuuming indexes” (3), and “vacuuming heap” (4). The other phases finished quickly enough that we didn’t catch them with pgstatsmon. (Note that the phase numbers here might differ from what you see because pgstatsmon adds one to the phase number reported by PostgreSQL.)

We can also add the counts of blocks scanned and vacuumed, plus the total count of blocks (which is constant):

We see the count of blocks scanned (yellow line) rise up to the count of total blocks (blue line). Then we see nothing for a while, then we see the count of blocks vacuumed (orange line) rise up to total blocks (blue line again). This seems pretty useful! We have a better sense of what’s going on, where previously we just knew this was an opaque 4-day process. It’s useful to note that the lines aren’t quite linear (particularly the orange one). And there’s still a long period (about 50% of the vacuum) where we can’t see any meaningful change in progress.

The phases (the green line, now on the right y-axis) add a bit of information here: note that the count of blocks scanned increases during the “scanning heap” phase. Then we see a long, opaque “vacuuming indexes” phase. Finally, the count of blocks vacuumed increases during the “vacuuming heap” phase. (This might sound obvious, but notice that there are “scan” and “vacuum” phases for the heap and “vacuum” and “cleanup” phases for the index, so if you don’t really know what’s going on under the hood, it’s not obvious which counters increase during which phases.)

We also created a percentage view that’s a little less busy, but tells us essentially the same thing:

What’s going on here? Why do we seem to take two passes over the heap? What’s happening during the gap in between passes?

Diving a bit deeper: how vacuum operates

To understand PostgreSQL’s vacuum progress view, you really need to understand more about how vacuum operates. My understanding here is primarily thanks to Peter Geoghegan’s talk at PostgresOpen 2018 (slides, video). Any errors here are my own, and I’d appreciate correction from folks who know better!

First, tuples in a PostgreSQL table are identified by a ctid. This ctid is itself a two-tuple (page number, tuple number). (Pages here refer to sequential 8K pages in the heap, not operating system pages.) The page number is a type of physical address: that is, from the page number, PostgreSQL knows exactly which offset in which file contains that page. The tuple number is a logical address: it’s an ordinal number of the tuple within the page. In order to find information for that tuple, PostgreSQL looks in a sort of table-of-contents in the page’s header.

Most importantly for our discussion: this table-of-contents not only points PostgreSQL to the offset in the page where the tuple’s data is stored, but it also contains a bit indicating whether the tuple is known to be dead. If that bit is set, PostgreSQL behaves as if the tuple doesn’t exist — without even looking at the tuple data itself.

Why is all this important? Because tuples aren’t just referenced from the heap: they’re also referenced from indexes! Indexes are complex on-disk data structures that generally allow more rapid access to specific tuples — e.g., because the index is sorted by some field in the tuple. Rather than storing the entire contents of a tuple, indexes store the ctid. When a query accesses a tuple through an index, it first finds the ctid in the index, then accesses the tuple contents in the heap by ctid. (Again, if the “dead” bit is set in what I’m calling the page’s table of contents, PostgreSQL sees this when accessing the tuple before even looking at what data the tuple contains.)

In order to remove a particular tuple from the table, vacuum has to remove it not just from the heap, but also all of the indexes that reference the tuple. But there could be many indexes, and they could be huge, and they’re not necessarily structured to quickly find references to a given ctid. How can PostgreSQL find all references to a tuple — a handful of needles in a handful of haystacks — so that it can remove it from the heap without leaving references dangling from indexes?

PostgreSQL makes use of the indirection described above. Logically, the process of removing a tuple from the heap involves a few steps (note: this isn’t exactly what happens — it’s a conceptual approximation):

  1. Mark the tuple dead in the header of the page containing the tuple. Immediately, any attempt to reference the tuple from an index (i.e., by its ctid) will find the bit set and treat the tuple as though it’s already gone. (This is correct because vacuum has already established that this tuple really is dead. We didn’t discuss that, but vacuum can tell this because it records the id of the the oldest transaction id running when it started, and any tuple with a death transaction id older than that cannot be visible to any running transaction.)
  2. For each index: scan the index from start to finish for references to this ctid and remove them from the index.
  3. Go back to the page containing the tuple and mark the table-of-contents slot free (instead of just dead). Now this ctid can be reused for a new insert, knowing that there are no dangling references to the previous ctid.

To summarize: if PostgreSQL just reused a given ctid without first removing references from the indexes, then indexes would be left with pointers to who-knows-what in the heap. So it’s a three-step process for each tuple: mark it dead in the heap so that existing references stop working gracefully, remove references from the indexes, then actually free the slot in the heap.

PostgreSQL doesn’t scan all indexes from start to finish for each tuple it cleans up. Instead, as the vacuum process scans the heap and finds dead tuples, it marks the tuple dead as described above and adds it to a list of dead tuples it found. This list is buffered in memory, often called maintenance_work_mem after the tunable that controls how large this list can be. Later, the vacuum process scans each index from start to finish and removes references to all the tuples in the list. Then PostgreSQL goes back to the heap and completes step 3, actually freeing each ctid. In the ideal case, all of this happens once: PostgreSQL batches up the whole list of tuples and takes care of them in one whole pass over each index.

Back to the visualization

Here’s our graph of phases and blocks scanned again:

With our understanding of ctids, we can say a bit more about what’s going on:

  1. During the “scanning heap” phase (phase 2, on the right-Y axis), PostgreSQL is scanning through all blocks of the heap looking for dead tuples. I infer that during this phase, PostgreSQL is marking those tuples dead in the page header and adding them to the list of tuples being removed.
  2. During the “vacuuming indexes” phase (phase 3), PostgreSQL is scanning all of the indexes and removing tuples from each index that are also in the list of tuples being removed (accumulated during the previous phase).
  3. During the “vacuuming heap” phase (phase 4), PostgreSQL is going back through all the blocks of the heap and actually freeing ctids that were removed.

Great! We’ve got a decent understanding of what’s going on during a vacuum and a visualization to observe it in real-time. There are still some gaps mentioned above, like we can’t tell progress at all during the “vacuuming indexes” phase. However, there are a few things we can do with this:

  • If we’re in the “scanning heap” phase, we can at least estimate when that will finish, since progress appears roughly linear. We may also guess how long the “vacuuming heap” phase will take, since it seems to be about the same (which makes sense, since they’re doing similar work). This puts a lower bound on the total vacuum time, since we don’t know how long “vacuuming indexes” will take.
  • Once we reach the “vacuuming heap” phase, we could make plausible estimates about the vacuum completion time.

Is it always so simple?

Multiple index scans

Let’s take a look at more complex case:

What’s going on here? It looks like the vacuum went backwards. In the table describing the vacuum phases, the text mentions that “vacuuming indexes…may happen multiple times per vacuum if maintenance_work_mem is insufficient to store the number of dead tuples found.” It also says that during the “vacuuming heap” phase, “If heap_blks_scanned is less than heap_blks_total, the system will return to scanning the heap after this phase is completed; otherwise, it will begin cleaning up indexes after this phase is completed.”

I found it a little hard to understand this from the docs alone, but the deeper information above about how vacuum works helps explain what’s going on. Above, we said that as vacuum scans through the heap, it accumulates a list of tuples that are being removed so that it can remove them from the indexes as well. That list of tuples goes into a fixed-size block of memory whose size is determined by the maintenance_work_mem tunable. What happens when we fill up this memory buffer? PostgreSQL doesn’t have a lot of options. It proceeds to finish the vacuum of the dead tuples it has already found. That means scanning (and vacuuming) all of the indexes, removing the dead tuples it’s found so far. Then it vacuums those tuples from the heap as well (just like in a normal vacuum). Now, the maintenance work memory can be cleared and it can pick up scanning the heap where it left off. Here’s what it looks like in terms of the blocks scanned and vacuumed:

The graph is starting to get a little busy! But there’s a lot of information here.

  • The green line represents the current phase. This is the same as in the graph above. The phases are indexed on the right y-axis, and the main ones are the three we saw earlier: 2 = “scanning heap”, 3 = “vacuuming indexes”, and 4 = “vacuuming heap”.
  • Just like before, during the “scanning heap” phase, the count of blocks scanned increases. However, instead of increasing all the way to the total count of heap blocks, it stops at what looks like the halfway point. PostgreSQL jumps to the “vacuuming indexes” phase.
  • As before, we have no progress information for the “vacuuming indexes” phase.
  • Once that finishes, we move to “vacuuming heap” (again, just like the example we looked at before). Here, the count of blocks vacuumed rises up to the count of blocks scanned — and stops there.

At this point, something very interesting happens:

  • PostgreSQL moves back to the “scanning heap” phase.
  • The count of “index scans” increases. We didn’t talk about this before because it wasn’t that interesting. In the previous example, there was one index scan for the whole vacuum. This time, we’ve completed one index scan partway through, but the whole vacuum will require a second one.
  • The heap scan picks up where it had previously stopped. (The yellow line showing blocks scanned, which had been flat for a while, starts moving up again. The orange line, which shows blocks vacuumed, now flattens.)
  • This time, the heap scan makes it all the way to the end. We have another “vacuuming indexes” phase, followed by another “vacuuming heap” phase in which the count of blocks vacuumed rises to the total count of blocks.
  • When the count of blocks vacuumed reaches the total count of blocks, the vacuum is essentially finished.

Essentially, the entire table vacuum has been broken up into two pieces: one covering about the first half of the heap, and another covering the rest.

We said the reason for this second scan was that PostgreSQL had filled the maintenance memory with dead tuples and had to vacuum indexes before finishing the heap scan. Can we be sure of this? Can we visualize it? Fortunately, the pg_stat_progress_vacuum view reports both the number of dead tuples accumulated as well as the total number of tuples that can be stored. We plot this against the phases:

We see clearly the count of dead tuples rising up to the maximum count during the “scanning heap” phase. It remains flat for the next two phases. Having finished vacuuming those tuples, the list is emptied and PostgreSQL goes back to “scanning heap” phase. This time, we don’t quite fill up the buffer — the count of dead tuples flatlines at about half the total buffer size.

It might be clearer to look at this in terms of percentages:

Here, we see the dead tuple memory filling up (from 0% to 100%) as we scan the heap, filling up when the scan reaches about 55%. We vacuum indexes for a while, then starting vacuuming tuples from the heap again (the yellow line increasing from 0% up to the green line, about 55%). The dead tuple memory empties (goes back to 0%), starts filling up during the scan, and flattens when we finish scanning the heap.

This visualization allows us to clearly see that the vacuum is running out of maintenance memory before finishing the heap scan and we now understand that results in an extra whole index scan.

How many index scans?

In this example, we saw two index scans. The vacuum was broken up into two pieces about halfway through the heap scan. Just to be clear, it’s only a coincidence that this happened about halfway through the scan. Here’s an example where we just barely ran out of maintenance memory (i.e., the first round made it through most of the heap before having to scan indexes):

Specifically, we can see that the heap scan makes it over 80% of the way through the heap the first time. The second round is needed to finish the last bit of the heap. (Somehow, this case really emphasizes how much more expensive it turned out to be that we had to do a second index scan.)

As far as I know, it’s possible to have any number of index scans. The worst cases we’ve noticed in production involve about 5 index scans. It’s easy to tell how many have been needed for a running vacuum because it’s exposed in the pg_stat_progress_vacuum view.

Is this a big deal? All three of the vacuums that we looked at scanned the whole heap twice (once to scan, and once to vacuum each tuple) — it was just broken into two pieces for the second and third vacuums. However: these last two vacuums took two whole passes over the index files, compared to just one pass in the previous vacuum. That’s a big deal! In our case, the indexes alone are a few TiB, and in the second example, this phase took almost 3 days. Having to do a whole second scan likely added about an extra 3 days to this vacuum, nearly doubling how long it took. Yes, that’s a big deal.

Putting these graphs together, this is what our PostgreSQL vacuum dashboard looks like:

I think of this a bit like an X-Ray in that it’s quite useful in some cases but it requires pretty deep understanding to interpret. You really have to understand the multiple heap scans involved in each table scan, the index scan in between, and the use of maintenance memory and multiple index scans.

Armed with this understanding, we’ve been able to generate promising hypotheses to explain some major points of confusion. For example: we long wondered why some vacuums on the same shards take so much longer than others? Sometimes a shard will experience a 12-day vacuum, and the next one will be 20 days. We can now guess that the second vacuum required an extra index scan. If true, this is very actionable: in principle, we could tune up maintenance_work_mem. Tuning maintenance_work_mem can have an enormous impact on vacuum time, and the effect is highly non-linear. If you’ve got 520MiB worth of dead tuples, with maintenance_work_mem of 520 MiB the vacuum might take closer to 60% of the time required if maintenance_work_mem was just 512 MiB. That’s an extreme case, but the third example above isn’t far from that. (Unfortunately in our case, we’re already using a value for this larger than PostgreSQL can make use of, so there’s no low hanging fruit here.)

In terms of understanding progress:

  • Once the heap scan has begun (which is essentially right away), based on progress, we can estimate how long it will take. Anecdotally, this has been comparable to the time required for the vacuum pass as well. By doubling that estimate, we get a lower bound on how long the vacuum can take. While coarse, that’s still helpful.
  • If we can assume that the density of dead tuples in the heap is approximately uniform throughout the heap, we can also estimate how many index scans will be necessary for a given vacuum based on how quickly we’re filling the maintenance memory. We could even verify this assumption by seeing how quickly we fill the dead tuple memory as we scan the heap.
  • Once we’ve completed a round of vacuuming indexes, we can put together a reasonable estimate of the whole vacuum’s completion time. That’s especially useful when there will be many subsequent index scans, since if there’s only one, we’re likely close to the end of the vacuum at this point anyway.

This still feels pretty rough, though. We’ve got an awful lot of open questions, starting with:

  • How do we apply this to a system with several hundred shards? We can’t rely on human interpretation of a graphic.
  • Is it possible to get progress information during the “vacuum indexes” phase? This is still a pretty long, opaque phase.
  • Can we get information about the number of tuples frozen during the vacuum? How about the number of tuples frozen (or unfrozen) in the whole table? (This would help us understand if we should be tuning the system to freeze more aggressively during normal vacuums or if that’s causing the vacuums to take too long.)
  • For each block updated, was the block updated because a tuple was frozen, some tuples were dead, or some other reason? (This would help us understand why some vacuums require more scans than others.)

Final notes

Believe it or not, this is a grossly simplified explanation of what’s going on in PostgreSQL. Peter’s talk (below) explains much more detail, including opportunistic garbage collection, heap-only tuples, the fact that most space is actually freed during the heap scan pass, and much more. Other mechanisms like hint bits, the free space map, the visibility map affect all of this.

The above represents my understanding from putting together information from the official docs, various talks, and the source. I’ve tried to cite sources where possible, but in some cases I’ve made inferences that could be way off. I’d appreciate any feedback where it’s incorrect or unclear.

The pg_stat_progress_vacuum view has been a huge step forward for us. Many thanks to the PostgreSQL community. PostgreSQL has helped us scale Manta several orders of magnitude over just a few years. The documentation is quite extensive. Thanks to those members of the community who help others understand these implementation details that play an outsize role in operating PostgreSQL at scale. And thanks especially to Peter Geoghegan for his talk at PostgresOpen in SF in 2018, which helped us put these visuals together.

It’s hard to convey how significant the vacuum issue has been for us over the last few years. To be clear, that’s in large part because of our own missteps related to database configuration, system configuration, and application behavior. (PostgreSQL defaults didn’t help either — the default behavior of vacuuming when 20% of the heap is dead means that vacuums on linearly-growing tables become farther apart and longer.) Nevertheless, some of our mistakes are quite hard to correct because they involve time-consuming, downtime-inducing schema migrations on hundreds of clusters. A lot of folks at Joyent have worked hard over the years to better understand and improve the performance of PostgreSQL at Joyent — too many to name!

Further reading:

  • “Introduction to PostgreSQL physical storage” by Rachid Belaid. This post covers relevant PostgreSQL internal structures in much more detail than I did above, with more diagrams, and without assuming too much knowledge of PostgreSQL internals already.
  • “Bloat in PostgreSQL: A Taxonomy”. Peter Geoghegan. PostgresOpen 2018 (slides, video). This talk inspired and informed most of this work.
  • “Routine Vacuuming” from PostgreSQL documentation. This section is worth reading several times.
  • “Database Page Layout” from the PostgreSQL documentation.
  • “All the Dirt on Vacuum”. Jam Nasby. This link is to the latest instance of this popular talk on vacuum. You can find videos from previous versions elsewhere.

TCP puzzlers

This post was cross-posted to the Joyent blog.

It’s been said that we don’t really understand a system until we understand how it fails. Despite having written a (toy) TCP implementation in college and then working for several years in industry, I’m continuing to learn more deeply how TCP works — and how it fails. What’s been most surprising is how basic some of these failures are. They’re not at all obscure. I’m presenting them here as puzzlers, in the fashion of Car Talk and the old Java puzzlers. Like the best of those puzzlers, these are questions that are very simple to articulate, but the solutions are often surprising. And rather than focusing on arcane details, they hopefully elucidate some deep principles about how TCP works.

Prerequisites

These puzzlers assume some basic knowledge about working with TCP on Unix-like systems, but you don’t have to have mastered any of this before diving in. As a refresher:

  • TCP states, the three-way handshake used to establish a connection, and the way connections are terminated are described pretty concisely on the TCP Wikipedia page.
  • Programs typically interact with sockets using read, write, connect, bind, listen, and accept. There’s also send and recv, but for our purposes, these work the same way as read and write.
  • I’ll be talking about programs that use poll. Although most systems use something more efficient like kqueue, event ports, or epoll, these are all equivalent for our purposes. As for applications that use blocking operations instead of any of these mechanisms: once you understand how TCP failure modes affect poll, it’s pretty easy to understand how it affects blocking operations as well.

You can try all of these examples yourself. I used two virtual machines running under VMware Fusion. The results match my experiences in our production systems. I’m testing using the nc(1) tool on SmartOS, and I don’t believe any of the behavior shown here is OS-specific. I’m using the illumos-specific truss(1) tool to trace system calls and to get some coarse timing information. You may be able to get similar information using dtruss(1m) on OS X or strace(1) on GNU/Linux.

nc(1) is a pretty simple tool. We’ll use it in two modes:

  • As a server. In this mode, nc will set up a listening socket, call accept, and block until a connection is received.
  • As a client. In this mode, nc will create a socket and establish a connection to a remote server.

In both modes, once connected, each side uses poll to wait for either stdin or the connected socket to have data ready to be read. Incoming data is printed to the terminal. Data you type into the terminal is sent over the socket. Upon CTRL-C, the socket is closed and the process exits.

In these examples, my client is called kang and my server is called kodos.

Warmup: Normal TCP teardown

This one demonstrates a very basic case just to get the ball rolling. Suppose we set up a server on kodos:

Server

[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464310423.7650  [ Fri May 27 00:53:43 UTC 2016 ]
 0.0027 bind(3, 0x08065790, 32, SOV_SOCKBSD)            = 0
 0.0028 listen(3, 1, SOV_DEFAULT)                       = 0
accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) (sleeping...)

(Remember, in these examples, I’m using truss to print out the system calls that nc makes. The -d flag prints a relative timestamp and the -t flag selects which system calls we want to see.)

Now on kang, I establish a connection:

Client

[root@kang ~]# truss -d -t connect,pollsys,read,write,close nc 10.88.88.140 8080
Base time stamp:  1464310447.6295  [ Fri May 27 00:54:07 UTC 2016 ]
...
 0.0062 connect(3, 0x08066DD8, 16, SOV_DEFAULT)         = 0
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

Over on kodos, we see:

Server

23.8934 accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) = 4
pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)

At this point, the TCP connections are in state ESTABLISHED and both processes are in poll. We can see this using the netstat tool on each system:

Server

[root@kodos ~]# netstat -f inet -P tcp -n

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.88.88.140.8080    10.88.88.139.33226   1049792      0 1049800      0 ESTABLISHED
...
Client

[root@kang ~]# netstat -f inet -P tcp -n

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.88.88.139.33226   10.88.88.140.8080    32806      0 1049800      0 ESTABLISHED
...

The question is, when we shut down one of these processes, what happens in the other process? Does it find out? How? Try to predict the specific syscall behavior and explain why each system call does what it does.

Let’s try it. I’ll CTRL-C the server on kodos:

Server

pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)
^C127.6307          Received signal #2, SIGINT, in pollsys() [default]

Here’s what we see on kang:

Client

pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
126.1771        pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
126.1774        read(3, 0x08043670, 1024)                       = 0
126.1776        close(3)                                        = 0
[root@kang ~]#

What happened here? Let’s be specific:

  1. We sent SIGINT to the server, causing the process to exit. Upon exit, file descriptors are closed.
  2. When the last file descriptor for an ESTABLISHED socket is closed, the TCP stack on kodos sends a FIN over the connection and enters the FIN_WAIT_1 state.
  3. The TCP stack on kang receives the FIN packet, transitions its own connection to CLOSE_WAIT, and sends an ACK back. Since the nc client is blocked on this socket being ready to read, the kernel wakes this thread with the event POLLIN.
  4. The nc client sees POLLIN for the socket and calls read, which returns 0 immediately. This indicates end-of-stream. nc presumes that we’re done with this socket and closes it.
  5. Meanwhile, the TCP stack on kodos receives the ACK and enters FIN_WAIT_2.
  6. Since the nc client on kang closed its socket, the TCP stack on kang sent a FIN to kodos. The connection on kang enters LAST_ACK.
  7. The TCP stack on kodos receives the FIN, the connection enters TIME_WAIT, and the stack on kodos acknowledges the FIN.
  8. The TCP stack on kang receives the ACK for the FIN and removes the connection entirely.
  9. Two minutes later, the TCP connection on kodos is closed and the stack removes the connection entirely.

(It’s possible for steps to be slightly reordered and for kang to transition through CLOSING instead of FIN_WAIT_2.)

Here’s the final state according to netstat:

Server

[root@kodos ~]# netstat -f inet -P tcp -n

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.88.88.140.8080    10.88.88.139.33226   1049792      0 1049800      0 TIME_WAIT

There’s no output for this connection at all on kang.

The intermediate states transition very quickly, but you could see them using the DTrace TCP provider. You can see the packet flow using snoop(1m) or tcpdump(1).

Conclusions: We’ve seen the normal path of system calls for connection establishment and teardown. Note that kang immediately found out when kodos’s connection was closed — it was woken up out of poll and read returned 0 to indicate end-of-stream. At that point, kang elected to close its socket, which cleaned up the connection state on kodos. We’ll revisit this later to see how the results can differ if kang doesn’t close its socket here.

Puzzler 1: Power cycling

What happens to an established, idle TCP connection if one system is power cycled?

Since most processes go through normal exit as part of a graceful reboot (e.g., if you use the “reboot” command), you get basically the same result if you type “reboot” on kodos instead of killing the server with CTRL-C. But what would happen if we’d power-cycled kodos in the previous example? Surely kang will eventually find out, right?

Let’s try it. Set up the connection:

Server

[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464312528.4308  [ Fri May 27 01:28:48 UTC 2016 ]
 0.0036 bind(3, 0x08065790, 32, SOV_SOCKBSD)            = 0
 0.0036 listen(3, 1, SOV_DEFAULT)                       = 0
 0.2518 accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) = 4
pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)
Client

[root@kang ~]# truss -d -t open,connect,pollsys,read,write,close nc 10.88.88.140 8080
Base time stamp:  1464312535.7634  [ Fri May 27 01:28:55 UTC 2016 ]
...
 0.0055 connect(3, 0x08066DD8, 16, SOV_DEFAULT)         = 0
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

Now I’ll use VMware’s “reboot” function to power-cycle the system. It’s important for this test that this be an actual power cycle (or OS panic) — anything that causes a graceful shut down will look more like the first case above.

20 minutes later, kang is still sitting in the same place:

Client

pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

We tend to believe that TCP’s job is to maintain a consistent abstraction (namely, the TCP connection) across multiple systems, so it’s surprising to discover cases where the abstraction is broken like this. And lest you think this is some nc(1) issue, it’s not. “netstat” on kodos shows no connection to kang, but kang shows a perfectly healthy connection to kodos:

Client

[root@kang ~]# netstat -f inet -P tcp -n

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.88.88.139.50277   10.88.88.140.8080    32806      0 1049800      0 ESTABLISHED
...

We can leave this forever without touching it and kang will never figure out that kodos rebooted.

Now, suppose at this point, kang tries to send data to kodos. What happens?

Let’s try it:

Client

pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
kodos, are you there?
3872.6918       pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
3872.6920       read(0, " k o d o s ,   a r e   y".., 1024)     = 22
3872.6924       write(3, " k o d o s ,   a r e   y".., 22)      = 22
3872.6932       pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
3872.6932       read(3, 0x08043670, 1024)                       Err#131 ECONNRESET
3872.6933       close(3)                                        = 0
[root@kang ~]#

When I type the message and hit enter, kodos gets woken up, reads the message from stdin, and sends it over the socket. The write call completes successfully! nc goes back to poll to wait for another event, eventually finds out that the socket can be read without blocking, and then calls read. This time, read fails with ECONNRESET. What does this mean? POSIX’s definition of read(2) says that this means:

[ECONNRESET]
A read was attempted on a socket and the connection was forcibly closed by its peer.

The illumos read(2) man page provides a little more context:

   ECONNRESET
              The filedes argument refers to a connection oriented
              socket and the connection was forcibly closed by the peer
              and is no longer valid.  I/O can no longer be performed to
              filedes.

This error doesn’t mean that something specific to the read call went wrong, but rather that the socket itself is disconnected. Most other socket operations would fail the same way at that point.

So what happened here? At the point when the nc on kang tried to send data, the TCP stack still didn’t know the connection was dead. kang sent a data packet over to kodos, which responded with an RST because it didn’t know anything about the connection. kang saw the RST and tore down its connection. It cannot close the socket file descriptor — that’s not how file descriptors work — but subsequent operations fail with ECONNRESET until nc does close the fd.

Conclusions:

  1. A hard power-cycle is very different than a graceful shutdown. When testing distributed systems, this case needs to be specifically tested. You can’t just kill a process and expect it to test the same thing.
  2. There are situations when one side believes a TCP connection is established, but the other side does not, and where this situation will never be automatically resolved. It’s possible to manage this problem using TCP or application-level keep-alive.
  3. The only reason that kang eventually found out that the remote side had disappeared was that it took action: it sent data and received a response indicating the connection was gone.

That raises the question: what if kodos had not responded to the data message for some reason?

Puzzler 2: Power off

What happens if one endpoint of a TCP connection is powered off for a while? Does the other endpoint ever discover this? If so, how? And when?

Once again, I’ll set up the connection with nc:

Server

[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464385399.1661  [ Fri May 27 21:43:19 UTC 2016 ]
 0.0030 bind(3, 0x08065790, 32, SOV_SOCKBSD)        = 0
 0.0031 listen(3, 1, SOV_DEFAULT)           = 0
accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) (sleeping...)
 6.5491 accept(3, 0x08047B3C, 0x08047C3C, SOV_DEFAULT, 0) = 4
pollsys(0x08045680, 2, 0x00000000, 0x00000000) (sleeping...)
Client

[root@kang ~]# truss -d -t open,connect,pollsys,read,write,close nc 10.88.88.140 8080
Base time stamp:  1464330881.0984  [ Fri May 27 06:34:41 UTC 2016 ]
...
 0.0057 connect(3, 0x08066DD8, 16, SOV_DEFAULT)         = 0
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

Now, I’ll cut power to kodos abruptly and attempt to send data from kang:

Client

pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

114.4971        pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
114.4974        read(0, "\n", 1024)                             = 1
114.4975        write(3, "\n", 1)                               = 1
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)

The write completes normally, and I don’t see anything for quite a while. Just over 5 minutes later, I see this:

Client

pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
425.5664        pollsys(0x08045670, 2, 0x00000000, 0x00000000)  = 1
425.5665        read(3, 0x08043670, 1024)                       Err#145 ETIMEDOUT
425.5666        close(3)                                        = 0

This looks similar to the case where the system was power-cycled instead of powered off, except for two things: it took 5 minutes for the system to notice, and the error reported was ETIMEDOUT. Note that again, it’s not that the read timed out per se. We would have seen the same error from other socket operations, and subsequent socket operations would likely fail immediately with the same ETIMEDOUT error. That’s because it’s the socket that has entered a state where the underlying connection has timed out. The specific reason for this is that the remote side failed to acknowledge a data packet for too long — 5 minutes, as configured on this system.

Conclusions:

  1. When the remote system is powered off instead of power cycled, again, the first system only finds out if it’s trying to send data. If it doesn’t send packets, it will never find out about the terminated connections.
  2. When the system has been attempting to send data for long enough without receiving acknowledgments from the remote side, the TCP connection will be terminated and a socket operation on them will fail with ETIMEDOUT.

Puzzler 3: Broken connection with no failure

This time, instead of giving you a specific case and asking what happens, I’ll flip it around: here’s an observation, and see if you can figure out how it could happen. We’ve discussed several cases where kang can believe it has a connection to kodos, but kodos doesn’t know about it. Is it possible for kang to have a connection to kodos without kodos knowing about it — indefinitely (i.e., where this will not resolve itself) and even if there’s been no power off, no power cycle, and no other failure of the operating system on kodos and no failure of networking equipment?

Here’s a hint: consider the case above where a connection is stuck in ESTABLISHED. In that case, it’s fair to say that the application has the responsibility to deal with the problem, since by definition it still has the socket open and could find out by sending data, at which point the connection will eventually be terminated. But what if the application didn’t have the socket open any more?

In the Warmup, we looked at the case where kodos’s nc closed its socket, and we said that kang’s nc read 0 (indicating end-of-stream) and then closed its socket. What if it didn’t close the socket, but kept it open? Obviously, it couldn’t read from it. But there’s nothing about TCP that says you can’t send more data to the other side that has sent you a FIN. FIN only means end-of-data-stream in the direction that the FIN was sent.

In order to demonstrate this, we can’t use nc on kang because it automatically closes its socket when it reads 0. So I’ve written a demo version of nc called dnc, which simply skips this behavior. It also prints out explicitly which system calls it’s making. This will give us a chance to observe the TCP states.

First, we’ll set up the connections as usual:

Server

[root@kodos ~]# truss -d -t bind,listen,accept,poll,read,write nc -l -p 8080
Base time stamp:  1464392924.7841  [ Fri May 27 23:48:44 UTC 2016 ]
 0.0028 bind(3, 0x08065790, 32, SOV_SOCKBSD)            = 0
 0.0028 listen(3, 1, SOV_DEFAULT)                       = 0
accept(3, 0x08047B2C, 0x08047C2C, SOV_DEFAULT, 0) (sleeping...)
 1.9356 accept(3, 0x08047B2C, 0x08047C2C, SOV_DEFAULT, 0) = 4
pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
Client

[root@kang ~]# dnc 10.88.88.140 8080
2016-05-27T08:40:02Z: establishing connection
2016-05-27T08:40:02Z: connected
2016-05-27T08:40:02Z: entering poll()

Now let’s verify the connections are in the ESTABLISHED state we expect on both sides:

Server

[root@kodos ~]# netstat -f inet -P tcp -n

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.88.88.140.8080    10.88.88.139.37259   1049792      0 1049800      0 ESTABLISHED
Client

[root@kang ~]# netstat -f inet -P tcp -n

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.88.88.139.37259   10.88.88.140.8080    32806      0 1049800      0 ESTABLISHED

Now, let’s CTRL-C the nc process on kodos:

Server

pollsys(0x08045670, 2, 0x00000000, 0x00000000) (sleeping...)
^C[root@kodos ~]#

We immediately see this on kang:

Client

2016-05-27T08:40:12Z: poll returned events 0x0/0x1
2016-05-27T08:40:12Z: reading from socket
2016-05-27T08:40:12Z: read end-of-stream from socket
2016-05-27T08:40:12Z: read 0 bytes from socket
2016-05-27T08:40:12Z: entering poll()

Now let’s look at the TCP connection states:

Server

[root@kodos ~]# netstat -f inet -P tcp -n

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.88.88.140.8080    10.88.88.139.37259   1049792      0 1049800      0 FIN_WAIT_2
Client

[root@kang ~]# netstat -f inet -P tcp -n

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.88.88.139.37259   10.88.88.140.8080    1049792      0 1049800      0 CLOSE_WAIT

This makes sense: kodos sent a FIN to kang. FIN_WAIT_2 indicates that kodos received an ACK from kang for the FIN that it sent, and CLOSE_WAIT indicates that kang received the FIN but has not sent a FIN back. This is a perfectly valid state for a TCP connection for an indefinite period of time. Imagine kodos had sent a request to kang and was not planning to send any more requests; kang could happily send data in response for hours, and this would work. Only in our case, kodos has actually closed the socket.

Let’s wait a minute and check the TCP connection states again. A minute later, the connection is completely missing from kodos, but it’s still present on kang:

Client

[root@kang ~]# netstat -f inet -P tcp -n

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.88.88.139.37259   10.88.88.140.8080    1049792      0 1049800      0 CLOSE_WAIT

What happened here? We hit a lesser-known special case in the TCP stack: when an application has closed a socket, the TCP stack has sent a FIN, and the remote TCP stack has acknowledged the FIN, the local TCP stack waits a fixed period of time and then closes the connection. The reason? In case the remote side has rebooted. This case is actually an analog to the case above where one side had an ESTABLISHED connection and the other doesn’t know about it. The difference is that in this specific case, the application has closed the socket, so there’s no other component that could deal with the problem. As a result, the TCP stack waits a fixed period of time and then tears down the connection (without sending anything to the other side).

Follow-up question: what happens if kang sends data to kodos at this point? Remember, kang still thinks the connection is open, but it’s been torn down on kodos.

Client

2016-05-27T08:40:12Z: entering poll()
kodos, are you there?
2016-05-27T08:41:34Z: poll returned events 0x1/0x0
2016-05-27T08:41:34Z: reading from stdin
2016-05-27T08:41:34Z: writing 22 bytes read from stdin to socket
2016-05-27T08:41:34Z: entering poll()
2016-05-27T08:41:34Z: poll returned events 0x0/0x10
2016-05-27T08:41:34Z: reading from socket
dnc: read: Connection reset by peer

This is the same as we saw in case Puzzler 1: the write() actually succeeds, since the TCP stack doesn’t know that the connection is closed yet. But it does get a RST, which wakes up the thread in poll(), and the subsequent read() returns ECONNRESET.

Conclusions:

  • It’s possible for two sides to disagree about the state of the connection even when there’s been no OS, network, or hardware failure at all.
  • In a case like the one above, it’s not possible for kang to distinguish between the case where kodos is attentively waiting to receive data or kodos has closed the socket and isn’t listening (at least, not without sending a packet). For this reason, maybe it’s not a great idea to design a system that uses sockets in these half-open states for an extended period of time under normal operation.

Conclusions

TCP is typically presented as a protocol for maintaining a consistent abstraction — the “TCP connection” — across two systems. We know that in the face of certain types of network and software problems, connections will fail, but it’s not always obvious that there are cases where the abstraction fails, in that the two systems disagree about the state of the connection. Specifically:

  • It’s possible for one system to think it has a working, established TCP connection to a remote system, while the remote system knows nothing about that connection.
  • This is possible even when there has been no failure of the network, hardware, or operating system.

These behaviors do not reflect deficiencies in TCP. Quite the contrary, in all of these cases, the TCP implementation appears to behave as reasonably as it could given the situation. If we were tempted to implement our own communication mechanism instead of using TCP, the presence of these cases might well remind us how complex the underlying problems are. These are intrinsic problems with distributed systems, and a TCP connection is fundamentally a distributed system.

That said, the single most important lesson in all of this is that the notion of a TCP connection that spans multiple systems is a convenient fiction. When it goes wrong, it’s critical to think about it as two separate state machines that operate simultaneously to try to maintain a consistent view of the connection. It’s the responsibility of the application to handle the cases where these differ (often using a keep-alive mechanism).

Furthermore, there’s a disconnect between the application’s file descriptor and the underlying TCP connection. TCP connections exist (in various states related to closing) even after an application has closed the file descriptor, and a file descriptor can be open when the underlying TCP connection has been closed as a result of a failure.

Other lessons to keep in mind:

  • Ungraceful reboot of a system (as happens when the OS crashes) is not the same as a userland process exiting or closing its socket. It’s important to test this case specifically. Reboot, when the remote system comes back online before the connection times out, is also different than power-off.
  • There’s no proactive notification from the kernel when a TCP socket is torn down. You only find this out when you call read(), write(), or some other socket operation on the socket file descriptor. If your program doesn’t do this for some reason, you’ll never find out about the connection failure.

Some related notes that I’ve found aren’t so commonly known:

  • ECONNRESET is a socket error you can see from read(), write(), and other operations that indicates that the remote peer has sent a RST.
  • ETIMEDOUT is a socket error you can see from read(), write(), and other operations that indicates that some timeout associated with the connection has elapsed. The cases I’ve seen most are when the remote side did not acknowledge some packet for too long. These are usually either data packets, a FIN packet, or a KeepAlive probe.

Importantly, neither of these errors means that there was anything wrong with the read or write operation that you tried to do — just that the socket itself is closed.

ACM Turing Centenary Celebration

This past weekend, I was very fortunate to have a chance to attend the ACM‘s Turing Centenary Celebration in honor of the 100th anniversary of the birth of Alan Turing. The event brought together nearly every living Turing Award winner for a series of talks and panel discussions on subjects like AI, theory of computation, computer architecture, and the role of computation in other fields. A webcast of the entire event is already available online. Below are some of my own notes on the conference.

For many of us in the audience, this weekend was about seeing in person the people we’d learned so much from both academically and professionally. This awe-inspiring feeling was best articulated by Vint Cerf in closing yesterday: “it’s like the history books opened up and the people walked out”. For me, by far, the highlight was the panel on Computer Architecture, moderated by David Patterson (yes, that Patterson) and featuring Frederick Brooks (of OS/360 and Mythical Man-Month fame), Ivan Sutherland, and Chuck Thacker. More than the other panels, I found all of the speakers’ prepared remarks accessible (perhaps because my work and interests most closely align with theirs), but at the same time very instructional. Sutherland began with the “Tyranny of the Clock”, an eloquent articulation of an important barrier in modern chip design and a call to action for designers and researchers. Then, in a sort of reverential but thoughtful engineering-style postmortem, Brooks discussed why the machine that Turing actually built, unlike so much of his other work, was not very influential. Thacker discussed the nature of computer architecture research and the modern developments that have made it more accessible for students today. In the subsequent discussion, Patterson referenced a prophetic quote by Maurice Wilkes at the dawn of modern computing (that Bryan also cited in his QCon talk last year) in which Wilkes suddenly “realized that a large part of my life from then on was going to be spent in ļ¬nding mistakes in my own programs”.

Complexity

Complexity was a prominent theme in several sessions. Ken Thompson expressed his disappointment at the complexity of modern Linux, pointing to its 1000+ syscalls compared to the original Unix’s 200. In his prepared remarks, he also shared some grim reflections on how Turing would feel about the state of computing today: Windows, phishing, botnets, C++, and so on. He compared his feeling to that of an early television pioneer visiting our time and seeing people watching Maury Povich. On a more positive note in the same session, Fernando Corbato (leader on Multics) gave a brief but fascinating account of what it was like to work on computers in the early days. He called Unix actually one the greatest results of Multics, Unix being a “distillation” of the better ideas of Multics without all the complexity. (It’s well worth watching Fernando’s and Ken’s remarks from the “Systems Architecture, Design, Engineering, and Verification” session.) Later, Thacker too called complexity “the enemy”, suggesting that we seriously reconsider many of the held-over assumptions in today’s systems that are costing us enormously today. (I’m sure that’s a good idea, and I’d have loved to hear some examples of these things that he had in mind.)

In the Programming Languages panel, Barbara Liskov lamented that the languages people use today for production systems are pretty complex for introducing new students to computer programming, but also admitted that building languages simple enough to be thoroughly understood in an introductory course and rich enough to support what professional engineers want is a serious challenge. She suggested starting from scratch with only the essential language features of modularity and encapsulation. In the same session, Nicklaus Wirth (in an entertaining, light presentation) explained how he sought to design languages that were both simpler and more powerful than their contemporaries — that these are not opposing goals. All of the participants agreed that in practice, most popular languages seem to accrete lots of cruft from small changes that seem good at the time, but contribute to an overly complex system.

Lucky or good?

Another theme that came up quite a bit was the role of luck in the speakers’ success. Many of them attributed their success to luck, leaving it there, but I liked Dahlia Malkhi’s reference to a golfer who hit a hole-in-one. He was asked if that was a result of luck or training, and replied that he was lucky, but he had to train a lot in order to get that lucky.

Beware elegance?

Several speakers (notably Butler Lampson and Joseph Sifakis) mentioned that they tend to be suspicious of clean, elegant solutions to problems, because such solutions often don’t work well in the real world. I’d never heard it put so generally, especially by leaders in the field, as that goes against a common intuition among mathy people that’s usually nurtured as part of our education. (That’s still a good thing — as Einstein famously said, it’s important to strive for simplicity, but be careful not to go to far.) In fact, Sifakis attributed the lack of serious work in rigorous system design to researchers’ preference for nice theories, even if those theories don’t match reality. (While probably a factor, this explanation seems to leave out the economic cost of such rigor as an important reason why many systems today aren’t built the way he suggests.)

System verification

In the Systems Architecture and Verification session, Vint Cerf noted that automatic verifiers don’t seem to work well for many types of systems we build and asked Sifakis and E. Allen Emerson whether there existed interactive tools that would help programmers test assertions about their systems, rather than automatically trying to verify the whole thing. Emerson pointed out that this is called semi-automatic verification, but still seemed more interested in the fully-automatic kind. Vint’s idea made me think of a sort of extensible lint, since lint is already an admittedly limited tool for checking a fixed set of assertions about a program. But despite its limits, lint is incredibly useful (at least in languages like C and JavaScript) for rooting out large classes of bugs, and it would be interesting to think about a more interactive workflow that would free the tool from having to report only things it knows are problems. (People generally won’t use an error-checking tool that reports many false positives, but they might use a tool that can evaluate static assertions about their code in a less rigid context.)

“What”-based programming

Alan Kay and others talked about the idea of “what”-based programming, rather than the “how”-based approaches we typically use today. The idea is that humans tell the computer what to do, and some engine under the hood figures out how to do it. Kay demonstrated a graphical environment based on this idea, and then wondered why we couldn’t build more complex systems (including operating systems) that way. Bill, Robert, and I tried for a while to imagine what this would look like. On the one hand, many classes of device drivers are similar enough that you could imagine modeling some pieces of them with a declarative “what”-based description, but interaction with physical devices often requires precise sequences of register reads and writes and it’s hard to imagine encoding that without essentially describing “how”. Achieving good performance may be challenging, since humans writing code for such systems today necessarily describe how to organize them to be fast. And if you could solve this for something as constrained as device drivers, how could you generalize it for the schedulers or VM system, without encoding detailed knowledge into the engine that actually translates the “what” to the “how”? You could also imagine that debugging such systems would be very difficult. Still, I found the idea compelling, because there are many cases where we do build “what”-based descriptions and the result is that it’s much easier to verify both that the description does what the human wants it to do and that the “what”-to-”how” system properly translates it (e.g., Meta-D, a Cloud Analytics module that describes a family of DTrace scripts declaratively, or even the D language itself). It would be interesting to hear from Alan Kay what he was thinking in posing this question.

Computation in biology and physics

I was especially intrigued by Leonard Adleman‘s remarks during the “Algorithmic View of the Universe” panel, in which he talked about vastly different notions of computation and how results based on the Turing model, plus the Church-Turing thesis, can inform physics and biology. He discussed protein folding in the cell as a form of computation, and what implications that has for biologists trying to understanding cellular processes. Later he wondered what implications the proven constraints of the Turing model, taken as physical laws, would have on quantum mechanics (e.g., that certain types of time travel allowed by QM must actually be impossible).

 

These were just a few of the bits I found most interesting, but the whole weekend was a humbling experience. Besides being able to see so many important figures in the field, it was a good opportunity to step outside the confines of day-to-day engineering, which for me tends toward a time horizon of a few years. And most of the talks provoked interesting discussions. So thanks to all the speakers, and to the ACM for putting together the event and making the video available.