Per-thread caching in libumem
libumem was developed in 2001 by Jeff Bonwick and Jonathan Adams. While the Solaris implementation of malloc(3C) and free(3C) performed adequately for single threaded applications, it did not scale. Drawing on the work that was done to extend the original kernel slab allocator, Jeff and Jonathan brought it to userland in the form of libumem. Since then, libumem has even been brought to other operating systems. libumem offers great performance for multi-threaded applications, though there are a few cases where libumem doesn’t quite perform compared to libc and the allocators found in other operating systems like eglibc. The most common case for this is when you have short-lived small allocations, often less than 200 bytes.
What’s happening?
As part of our work with Voxer, they had uncovered some general performance problems that Brendan and I were digging into. We distilled this down to a small Node.js benchmark that was calculating a lot of md5 sums. As part of narrowing down the problem, I eventually broke out one of Brendan’s flame graphs. Since we had a CPU-bound problem, this allowed us to easily visualize and understand what’s going on. When you throw that under DTrace with jstack(), you get something that looks pretty similar to the following flame graph:
There are two main pieces to this flame graph. The first is performing the actual md5 operations. The second is creating and destroying the md5 objects and the initialization associated with that. In drilling down, we found that we were spending comparatively more time trying to handle the allocations. If you look at the flamegraph in detail, you’ll see that when calling malloc and free we’re spending a lot of that time in in the lock portions of libc. libc’s malloc has a global mutex. Using a simple DTrace script like dtrace -n 'pid$target::malloc:entry{ @[tid] = count(); }', we can verify that only one thread is calling into malloc, so we’re grabbing an uncontended lock. One’s next guess might be to try and run this with libumem to see if there is any difference. This gives us the following flame graph:
You can easily spot all of the libumem related allocations because they are a bit more like towers that consist of a series of three functions calls. First to malloc(3C), then umem_alloc(3MALLOC), and finally umem_cache_alloc(3MALLOC). On top of that are additional stacks related to grabbing locks. In umem_cache_alloc there is still a lock that a thread has to grab. Unlike libc’s malloc, this lock is not a global lock. Each cache has a lock per-CPU, which, when combined with magazines allows for highly-parallel allocations. However, we’re only doing mallocs from one thread so this is an uncontested mutex lock. The key takeaway from this is that the uncontested mutex lock can still be problematic. This is also much trickier in user-land where there is a lot more to deal with when grabbing a lock. Compare the kernel implementation with the user-land implementation. One conclusion that you reach from this is that we should do something to get rid of the lock.
When getting rid of mutexes, one might first think of using atomics and trying to rewrite this to be lock-free. But, aside from the additional complexity that rewriting portions of this to be lock-free might induce, that doesn’t solve the fundamental problem: we don’t want to be synchronizing at all. Instead this suggests a different idea that other memory allocators have taken: adding a thread-local cache.
A brief aside: libumem and malloc design
As part of libumem’s design it creates a series of fixed size caches which it uses to handle allocations. These caches are sized from 8 bytes to 128KB, with the difference between caches growing larger with the cache size. If a malloc comes in that is within the range of one of these caches then we use the cache. If the allocation is larger than 128KB then libumem uses a different means to allocate that. For the rest of this entry we’ll only talk about the allocations that are handled by one these caches. For the full details of libumem, I strongly suggest you read the two papers on libumem and the original slab allocator.
Keeping track of your allocations
When you call malloc(3C) you pass in a size, but you don’t need to pass that size back into free(3C). You only need to pass in the buffer. malloc ends up doing work to handle this for you. malloc will actually allocate an extra eight byte tag and prepend that to the buffer. So if you request a 36 bytes, malloc will actually allocate 42 bytes from the system and return you a pointer that starts right after that tag. This tag encodes two pieces of information. The first piece is the size and the second piece is a tag that is encoded with the size. It uses the second field to help detect programs that erroneously write to memory. The structure that it prepends looks like:
typedef struct malloc_data {
uint32_t malloc_size;
uint32_t malloc_stat;
} malloc_data_t;
When you call free, libumem grabs that structure, reads the buffer size, and validates the tag. If everything checks out, it releases the entire buffer back to the appropriate cache. If it doesn’t check out, libumem aborts the program.
Per-Thread Caching: High-level
To speed up the allocation and free process, we’re going to change how malloc and free work. When a given thread calls free, instead of releasing the buffer directly back to the cache, we will instead store it with the thread. That way if the thread comes around and requests a buffer that would be satisfied by that cache, it can just take the one that it stored. By creating this per-thread cache, we have a lock-free and contention-free means of servicing allocations. We store these allocations as a linked list and use one list per cache. When we want to add a buffer to the list, we make it the new head. When we remove an entry from the list, we remove the head. If the head is set to NULL then we know that the list is empty. When the list is empty, we simply go ahead and use the normal allocation path. When a thread exits, then all of the buffers in that thread are freed back to the underlying caches.
We don’t want the per-thread cache to end up storing an unbounded amount of memory. That would end up appearing no different from a memory leak. Instead, we have two mechanisms in place to control this.
- A cap on the amount of memory each thread may cache.
- We only enable this for a subset of umem caches.
By default, each thread is capped at 1 MB of cache. This can be configured on a per process basis using the UMEM_OPTIONS environment variable. Simply set perthread_cache=[size]. The size is in bytes and you can use the common K, M, G, and even T, suffixes. We only support doing this for sixteen caches at this time and we opt to make this the first sixteen caches. If you don’t tune the cache sizes, allocations up to 256 bytes for 32-bit applications and up to 448 bytes for 64-bit applications will be cached.
Finally, when a thread exits, all of the memory in that thread’s cache is released back to the underlying general purpose umem caches.
Another aside: Position Independent Code
Modern libraries are commonly built with Position Independent Code (PIC). The goal of building something PIC is that it can be loaded anywhere in the address space and no additional relocations will need to be performed. This means that all the offsets and addresses within a given library that are for the library itself are relative addresses. The means for doing this for amd64 programs is relatively straightforward. amd64 offers an addressing mode known as RIP-relative. RIP-relative addressing is where you specify an offset relative to the current instruction pointer which is stored in the register %rip. 32-bit i386 programs don’t have RIP-relative addressing, so compilers have to use different tricks to for relative addressing. One of the more common techniques is to use a call +0 instruction to establish a known address. Here is the disassembly of a simple function which happens to call a global function pointer in a library.
amd64-implementation > testfunc::dis testfunc: movq +0x1bb39(%rip),%rax <0x86230> testfunc+7: pushq %rbp testfunc+8: movq %rsp,%rbp testfunc+0xb: call *(%rax) testfunc+0xd: leave testfunc+0xe: ret
i386-implementation > testfunc::dis testfunc: pushl %ebp testfunc+1: movl %esp,%ebp testfunc+3: pushl %ebx testfunc+4: subl $0x10,%esp testfunc+7: call +0x0 <testfunc+0xc> testfunc+0xc: popl %ebx testfunc+0xd: addl $0x1a990,%ebx testfunc+0x13: pushl 0x8(%ebp) testfunc+0x16: movl 0x128(%ebx),%eax testfunc+0x1c: call *(%eax) testfunc+0x1e: addl $0x10,%esp testfunc+0x21: movl -0x4(%ebp),%ebx testfunc+0x24: leave testfunc+0x25: ret
Position independent code is still really quite useful, one just has to be aware that they do pay a small price for it. In this case, we’re doing several more loads and stores. When working in intensely performance-sensitive code, those can really add up.
Per-Thread Caching: Implementation
The first problem that we needed to solve for per-thread caching was to figure out how we would store the data for the per-thread caches. While we could have gone with some of the functionality provided by the threading libraries (see threads(5)), that would end up sending us through the Procedure Linkage Table (PLT). Because we are cycle-bumming here our goal is to minimize the number of such calls that we have to make. Instead, we’ve added some storage to the ulwp_t. The ulwp_t is libc’s per-thread data structure. It is the userland equivalent of the kthread_t. We extended the ulwp_t of each thread with the following structure:
typedef struct {
size_t tm_size;
void *tm_roots[16];
} tmem_t;
Each entry in the tm_roots array is the head of one of the linked lists that we use to store a set of similarly sized allocations. The tm_size field keeps track of how much data has been stored across all of the linked lists for this thread. Because these structures exist per-thread, there is no need for any synchronization.
Why we need to know about the size of the caches
The set of caches that libumem uses for allocations only exists once umem_init() has finished processing all of the UMEM_OPTIONS environment variables. One of the options can add additional caches and another one can remove caches. It is impossible to know what these caches are at compile time. Given that this is the case, you might ask why do we want to know the size of the caches that libumem creates? Why not just create our own set of sizes that we’re going to use for honoring other allocations?
Failing to use the size of the umem caches would not only cause us to use extra space, but it would also cause us to incur additional fragmentation. Our goal is to be able to reuse these allocations. We can’t have a bucket for every possible allocation size, that would grow quite unwieldy. Let’s say that we used the same bucketing scheme that libumem uses by default. Because we have no way of knowing what cache libumem honored something from, we instead have to assume that the returned allocation is the largest possible size we can use the buffer for. If we make a 65-byte allocation that actually comes from the 80 byte cache, we would instead bucket it in the thread’s 64-byte cache. Effectively, we have to always round an allocation down to the next cache. Items that could satisfy a 64-byte allocation would end up being items that were originally 64-79 bytes large. This is clearly wasteful of our memory.
If you look at the signature of umem_free(3MALLOC) you’ll see that it takes a size argument. This means that it is our responsibility to keep track of what the original size of the allocation was. Normally malloc and free wrap this up in the malloc tags, but since we are reusing buffers, we’ll want to keep track of both the original size and the currently requested size when we reuse it. To do this, we would have to extend the malloc tag structure that we talked about above. While there are some allocations that have extra space for something like this for 64-bit programs, that is not the case for 32-bit programs. To implement it this way, that would require at another eight-byte tag be prepended to every 32-bit malloc and some 64-bit allocations as well.
Obviously this is not an ideal way to go approach the problem. Instead, if we use the cache sizes we don’t have to suffer from either of the above problems. We know that when a umem_alloc comes in, it rounds the allocation size up to the nearest cache to satisfy the request. We leverage that same idea so that when a buffer is freed we put it into the per-thread list that corresponds to the cache that it originally came from. When new requests come in we can use that buffer to satisfy anything that the underlying cache would be able to. Because of this strategy we don’t have to have a second round of fragmentation for our buffers. Similarly, because we know what the cache size is, we don’t have to keep track of the original request size. We know exactly what cache the buffer came from initially.
Following in the footsteps of trapstat and fasttrap
Now that we have opted to know about the cache sizes at run time this means that we have a few approaches we can take to generate the function that handles this per-thread layer. Remember, that we’re here because of performance. We need to cycle-bum and minimize our use of cycles. Loads and stores to and from main memory are much more expensive than simple register arithmetic, comparisons, and small branches. While there is an array of allocations sizes, we don’t want to have to always load each entry from that array. We also have another challenge. We need to avoid doing anything that causes us to use the PLT. We don’t want to end up having a call +0 instruction so we can access 32-bit PIC code. There is one fortunate aspect of the umem caches. Once they are determined at runtime, they never change.
Armed with this information we end up down a different path: we are going to dynamically generate the assembly for our functions. This isn’t the first time this has been done in illumos. For an idea of what this looks like, see trapstat. Our code functions in a very similar way. We have blocks of assembly with holes for addresses and constants. These blocks get copied into executable memory and the appropriate constants get inserted in the correct place. One of the important pieces of this is that we don’t end up calling any other functions. If we detect an error or we can’t satisfy the allocation in the function itself, we end up jumping out to the original malloc and free reusing the same stack frame.
Reaching the assembly generated functions
Once we have generated the machine code, we have the challenge of making it be what applications reach when they call malloc(). This is complicated by the fact that calls to malloc can come in before we create and initialize the new functions as part of the libumem initialization. The standard way you might solve this is with a function pointer that you swap out at some point. However, having that global function pointer causes us to need to address that in a position independent way and adds noticeable overhead. Instead we utilized a small feature of the illumos linker, local auditing, to create a trampoline. Before we get into the details of the auditing library, here’s the data we used to support the decision. We made a small and simple benchmark that just does a fixed number of small mallocs and frees in a loop and compared the differences.
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>
#define MAX 1024 * 1024 * 512
int
main(int argc, char *argv[])
{
int ii;
void *f;
int size = 4;
hrtime_t start;
if (argc == 2)
size = atoi(argv[1]);
start = gethrtime();
for (ii = 0; ii < MAX; ii++) {
f = malloc(4);
free(f);
}
printf("%lld\n", (hrtime_t)(gethrtime() - start));
return (0);
}
| arch | libc (ns) | libumem (ns) | indirect call (ns) | audit library (ns) |
|---|---|---|---|---|
| i386 | 39833278751 | 57784034737 | 14522070966 | 9215801785 |
| amd64 | 32470572792 | 47828105321 | 9654626131 | 8980269581 |
From this data we can see that the audit library technique ended up being a small win on amd64, but for i386, it was a much more substantial win. This all comes down to how the compiler generated PIC code.
Audit libraries are a feature of the illumos linker that allow you to interpose on all the relocations that are being made to and from a given library. For the full details on how audit libraries work consult the Linkers and Libraries guide (one of the best pieces of Sun Documentation) or the linker source itself. We created a local auditing library that allows us to only audit libumem. As part of auditing the relocation for libumem's malloc and free symbols the audit library gives us an opportunity to replace the symbol with one of our own choice. The audit library instead returns the address of a local buffer which contains a jump instruction to the either the actual malloc or free. This installs our simple trampoline.
Later, when umem_init() runs we end up generating the assembly versions of our functions. libumem uses symbols which the auditing library interposes upon to be told where the buffers it should put the generated function are. After both the malloc and free implementations have been successfully generated, it removes the initial jump instruction and atomically replaces it with a five byte nop instruction. We looked at using both the multi-byte nop, five single byte nops, and just zeroing out the jump offset so it would become a jmp +0. Using the same microbenchmark we used earlier, we saw that the multi-byte nop made a noticeable difference.
| arch | jmp +0 (ns) | single-byte nop (ns) | multi-byte nop (ns) |
|---|---|---|---|
| i386 | 9215801785 | 9434344776 | 9091563309 |
| amd64 | 8980269581 | 8989382774 | 8562676893 |
For more details on how this all works and fits together, take a look at the updated libumem big theory statement and pay particular attention to section 8. You may also want to look at the i386 and amd64 specific files which generate the assembly.
Needed linker fixes
There are two changes that are necessary for local auditing to work correctly. Thanks to Bryan who went and made those changes and figured out the way to create this trampoline with the local auditing library.
Understanding per-thread cache utilization
Bryan worked to supply not only the necessary enhancements to the linker, but also supply enhancements to various mdb dcmds to better understand the behavior of the per-thread caching in libumem. ::umastat was enhanced to show both the amount that each thread has used and to show how many allocations each cache has in the per-thread cache (ptc).
> ::umastat
memory % % % % % % % % % % % % % % % % %
tid cached cap 8 16 32 48 64 80 96 112 128 160 192 224 256 320 384 448
--- ------- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
1 174K 16 0 6 6 1 4 0 0 18 2 50 0 4 0 1 0 2
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 201K 19 0 6 6 2 4 8 1 16 2 43 0 3 0 1 0 1
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 62.1K 6 0 8 7 3 9 0 0 13 5 38 0 6 0 2 1 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cache buf buf buf buf memory alloc alloc
name size in use in ptc total in use succeed fail
------------------------- ------ ------- ------- ------- ------- --------- -----
umem_magazine_1 16 33 - 252 4K 35 0
umem_magazine_3 32 36 - 126 4K 39 0
umem_magazine_7 64 0 - 0 0 0 0
umem_magazine_15 128 4 - 31 4K 4 0
umem_magazine_31 256 0 - 0 0 0 0
umem_magazine_47 384 0 - 0 0 0 0
umem_magazine_63 512 0 - 0 0 0 0
umem_magazine_95 768 0 - 0 0 0 0
umem_magazine_143 1152 0 - 0 0 0 0
umem_slab_cache 56 46 - 63 4K 48 0
umem_bufctl_cache 24 153 - 252 8K 155 0
umem_bufctl_audit_cache 192 0 - 0 0 0 0
umem_alloc_8 8 0 0 0 0 0 0
umem_alloc_16 16 2192 1827 2268 36K 2192 0
umem_alloc_32 32 1082 921 1134 36K 1082 0
umem_alloc_48 48 275 202 336 16K 275 0
umem_alloc_64 64 487 359 504 32K 487 0
umem_alloc_80 80 246 234 250 20K 246 0
umem_alloc_96 96 42 41 42 4K 42 0
umem_alloc_112 112 741 676 756 20K 741 0
umem_alloc_128 128 133 109 155 20K 133 0
umem_alloc_160 160 1425 1274 1425 36K 1425 0
umem_alloc_192 192 11 9 21 4K 11 0
umem_alloc_224 224 83 82 90 20K 83 0
umem_alloc_256 256 8 8 15 4K 8 0
umem_alloc_320 320 24 22 24 8K 24 0
umem_alloc_384 384 7 6 10 4K 7 0
umem_alloc_448 448 20 19 27 12K 20 0
umem_alloc_512 512 1 - 16 8K 138 0
umem_alloc_640 640 0 - 18 12K 130 0
umem_alloc_768 768 0 - 10 8K 87 0
umem_alloc_896 896 0 - 18 16K 114 0
...
In addition, this work inspired Bryan to go and add %H to mdb_printf for human readable sizes. As a part of the support for the enhanced ::umastat, there are also new walkers for the various ptc caches.
Performance of the Per-Thread Caching
The ultimate goal of this was to improve our performance. As part of that we did different testing to make sure that we understood what the impact and effects of this would be. We primarily compared ourselves to the libc malloc implementation and a libumem without our new bits. We also did some comparison to other allocators, notably eglibc. We chose eglibc because that is what the majority of customers coming to us from other systems are using and because it is a good allocator, particularly for small objects.
Tight 4 byte allocation loop
One of the basic things that we wanted to test, inspired in part by some of the behavior we had seen in applications, was to measure what a tight malloc and free loop of a small allocation looked like where we varied the number of threads. Below we included a test where we did this one thread and one where we did it with sixteen threads. The main take away we got from this is that libumem has historically been slow at these compared to a single threaded libc program. The sixteen thread graph shows why we really want to use libumem compared to libc. The graph shows the time per thread. As we can see, libc's single mutex for malloc is rather painful.


Time for all cached allocations
Another thing that we wanted to measure was how our allocation time scaled with the size of the cache. While our assembly is straightforward, it could probably be further optimized. We ran the test with both 32-bit and 64-bit code and the results are below. From the graphs you can see that scale fairly linearly across the caches.


The effects of the per-thread cache on uncacheable allocations
One of the things that we wanted to verify was that the presence of the per-thread caching did not unreasonably degrade the performance of other allocations. To look at this we compared what happened if you used libumem and what happened if you did not. We used pbind to lock the program to a single CPU, measured the time it took to do 1KB sized allocations, and compared the differences. We took that value and divided by the total number of allocations we had performed, 512 M in this case. The end result was that for a given loop of malloc and free, the overhead was 8-10ns. That was within reason for our acceptable overhead.
umem_init time
Another one of the areas we wanted to make sure that we didn't seriously regress was the time it takes umem_init. I've included a coarse graph that was created using DTrace. I simply rigged up something that traced the amount of wall and cpu time umem_init took. We repeated that 100 times and graphed the results. The graph below shows a roughly 50 microsecond increase in the wall and cpu time. In this case, a reasonable increase.
Our Original Flame Graph
The last thing that I want to look at is what the original flame graph now looks like using per-thread-caching. We increased the per-thread cache to 64MB because that allows us to cache the majority of the malloc activity which primarily comes from only one thread. The new flame graph is different from the previous two. The amount of time that we've spent in malloc and free has been minimized and compared to libumem previously, we are no longer three layers deep. In terms of raw improvement, while this normally took around 110 seconds with libc, with per-thread-caching we're down to around 78 seconds. Remember, this is a pure node.js benchmark. To have an improvement to malloc() result in a ~30% win was pretty surprising. Even in dynamic garbage collected languages, the performance of malloc is still important.
Wrapping Up
In this entry I've described the high-level design, implementation, and some of the results we've seen from our work on per-thread caching libumem. For some workloads there can be a substantial performance improvement by using per-thread caching. To try it out, grab the upcoming release of SmartOS and either add -lumem to your Makefile or simply try it out by running your application with LD_PRELOAD=libumem.so.
When you link with libumem, per-thread caching is enabled by default with a 1 MB per-thread cache. This value can be tuned via the UMEM_OPTIONS environment variable via UMEM_OPTION=perthread_cache=[size]. For example, to set it to 64 MB, you would do something like: UMEM_OPTIONS=perthread_cache=64M. If you enable any kind of the umem_debug(3MALLOC) facilities then this will be disabled. Similarly if you request nomagazines, this will be disabled.
If you have questions, feel free to ask here or join us on IRC.
In: SunOS · Tagged with: illumos, libc, libumem, mdb, SunOS
mdb tab completion
Last October, the first illumos hack-a-thon took place. Out of that a lot of interesting things were done and have since been integrated into illumos. Two of the more interesting gems were Adam Leventhal and Matt Ahrens adding dtrace -x temporal and Eric Schrock adding the DTrace print() action. Already print() is in the ranks of things where once you have it you really miss it when you don’t. During the hack-a-thon I had the chance to work Matt Amdur. Together we worked on another one of those nice to haves that has finally landed in illumos: tab completion for mdb.
md-what?
For those who have never used it, mdb is the Modular Debugger that comes as a part of illumos and was originally developed for Solaris 8. mdb is primarily used for post-mortem of user and kernel applications and kernel debugger. mdb isn’t a source level debugger, but it works quite well on core dumps from userland, inspects and modifies live kernel state without stopping the system, and provides facilities for traditional debugging where a program is stopped, stepped over, and inspected. mdb replaced adb which came out of AT&T. While mdb isn’t 100% compatible with adb, it does remind you that there’s ‘No algol 68 here’. For the full history, take a look at Mike Shapiro’s talk that he gave at the Brown CS 37th IPP Symposium.
One of the more useful pieces of mdb is its module API which allows other modules to deliver specifically tailored commands and walkers. This is used for everything from the v8 Javascript Engine to understanding cyclics. Between that, pipelines, and other niceties, there isn’t too much else you could want from your debugger.
What’s involved
The work that we’ve done falls into three parts:
- A tab completion engine.
- Changes to the module API and several new functions to allow dcmds
to implement their own tab completion. - Tab completion support for several dcmds
Thanks to CTF data in the kernel, we can tab complete everything from walker names, to types and their members. We went and added tab completion to the following dcmds:
- ::walk
- ::sizeof
- ::list
- ::offsetof
- The dcmds themselves
Seeing is believing: Tab completion in action
Completing dcmds
> ::pr[tab] print printf project prov_tab prtconf prtusb
Completing walkers
> ::walk ar[tab] arc_buf_hdr_t arc_buf_t > ::walk arc_buf_
Completing types
> ::sizeof struct dt[tab] struct dtrace_actdesc struct dtrace_action struct dtrace_aggbuffer struct dtrace_aggdesc struct dtrace_aggkey struct dtrace_aggregation struct dtrace_anon struct dtrace_argdesc struct dtrace_attribute struct dtrace_bufdesc struct dtrace_buffer struct dtrace_conf struct dtrace_cred struct dtrace_difo struct dtrace_diftype struct dtrace_difv struct dtrace_dstate struct dtrace_dstate_percpu struct dtrace_dynhash struct dtrace_dynvar struct dtrace_ecb struct dtrace_ecbdesc struct dtrace_enabling struct dtrace_eprobedesc struct dtrace_fmtdesc struct dtrace_hash ...
Completing members
> ::offsetof zio_t io_v[tab] io_vd io_vdev_tree io_vsd io_vsd_ops
Walking across types with ::print
> p0::print proc_t p_zone->zone_n[tab] zone_name zone_ncpus zone_ncpus_online zone_netstack zone_nlwps zone_nlwps_ctl zone_nlwps_lock zone_nodename zone_nprocs zone_nprocs_ctl zone_nprocs_kstat zone_ntasks
In addition, just as you can walk across structure (.) and array ([]) dereferences in ::print invocations, you can also do the same with tab completion.
What’s next?
Now that mdb tab completion change is in illumos there’s already some work to add backends to new dcmds including:
- ::printf
- ::help
- ::bp
What else would you like to see? Let us know in a comment or better yet, go ahead and implement it yourself!
illumos Hardware Compatibility List
One of the challenges when using any Operating System is answering the question ‘Is my hardware supported?’. To track this down, you often have to scour Internet sites, hoping someone else has already asked the question, or do other, more horrible machinations – or ask someone like me. If you’re running on an illumos-based system like SmartOS, OmniOS, or OpenIndiana, this just got a lot easier: I’ve created the list. Better yet, I’ve created a tool to automatically create the list.
The List
illumos now has a hardware compatibility list (HCL) available at http://illumos.org/hcl.
This list contains all the PCI and PCI Express devices that should work. If your PCI device isn’t listed there, don’t fret, it may still work. This list is a first strike at the problem of hardware compatibility, so things like specific motherboards aren’t listed there.
How it’s generated
The great thing about this list is that it’s automatically generated from the source code in illumos itself. Each driver on the system has a manifest that specifies what PCI IDs it supports. We parse each of these manifests and look up the names using the PCI ID Database, using a small library that I wrote. From there, we automatically generate the static web page that can be deployed. Thanks to K. Adam White for his invaluable help to stop me from fumbling around too much with front end web code and the others who have already come in and improved it.
All the code is available on github. The goal for all of this is to eventually be a part of the illumos-gate itself. If you have improvements or want to make the web page more visually appealing, we’d all welcome the contribution.
Figuring out where you’re longjmp(3c)ing with DTrace
Last Monday was the illumos hack-a-thon. There, I worked with Matt Amdur on adding tab completion support to mdb — the illumos modular debugger. The hack-a-thon was wildly successful and a lot of fun, I hope to put together an entry on the hack-a-thon and give an overview of the projects that were worked on over the course of the next few days. During the hack-a-thon, Matt and I created a working prototype that would complete the types and members using ::print, but there was still some good work for us to do. One of the challenges that we were facing was some unexpected behavior whenever the mdb pager was invoked. We were seeing different actions depending on which actions you took from the pager: continue, quit, or abort.
If you take a look at the source code, you’ll see that sometimes we’ll leave this function by calling longjmp(3c). There’s a lot of different places that we call setjmp(3c) and sigsetjmp(3c) in mdb, so tracking down where we were going just based on looking at the source code would be tricky. So, we want to answer the question, where are we jumping to? There are a few different ways we can do this (and more that aren’t listed):
- Inspect the source code
- Use mdb to debug mdb
- Use the DTrace pid provider and trace a certain number of instructions before we assume we’ve gotten there
- Use the DTrace pid provider and look at the jmp_buf to get the address of where we were jumping
Ultimately, I decided to go with option four, knowing that I would have to solve this problem again at some point in the future. The first step is to look at the definition of the jmp_buf definition. For the full definition take a look at setjmp_iso.h. Here’s the snippet that actually defines the type:
82 #if defined(__i386) || defined(__amd64) || \
83 defined(__sparc) || defined(__sparcv9)
84 #if defined(_LP64) || defined(_I32LPx)
85 typedef long jmp_buf[_JBLEN];
86 #else
87 typedef int jmp_buf[_JBLEN];
88 #endif
89 #else
90 #error "ISA not supported"
91 #endif
Basically, the jmp_buf is just an array where we store some of registers. Unfortunately this isn’t sufficient to figure out where to go. So instead, we need to take a look at the implementation. setjmp is implemented in assembly for the particular architecture. Here it is for x86 and amd64. Now that we have the implementation, let’s figure out what to do. As a heads up, if you’re looking at any of these .s files, the numbers are actually in base 10, which is different from what you get when you look at the mdb output which has them in hex. Let’s take a quick look at the longjmp source for a 32-bit system and dig into what’s going on and how we know what to do:
73 ENTRY(longjmp)
74 movl 4(%esp),%edx / first parameter after return addr
75 movl 8(%esp),%eax / second parameter
76 movl 0(%edx),%ebx / restore ebx
77 movl 4(%edx),%esi / restore esi
78 movl 8(%edx),%edi / restore edi
79 movl 12(%edx),%ebp / restore caller's ebp
80 movl 16(%edx),%esp / restore caller's esp
81
82 movl 24(%edx), %ecx
83 test %ecx, %ecx / test flag word
84 jz 1f
85 xorl %ecx, %ecx / if set, clear ul_siglink
86 movl %ecx, %gs:UL_SIGLINK
87 1:
88 test %eax,%eax / if val != 0
89 jnz 1f / return val
90 incl %eax / else return 1
91 1:
92 jmp *20(%edx) / return to caller
93 SET_SIZE(longjmp)
The function is pretty well commented, so we can follow along pretty easily. Basically we load the jmp_buf that was passed in into %edx, add 0×14 to that value and then jump to that piece of code. So now we know exactly what the address we’re returning to is. With this in hand, we only have two tasks left: transforming this address into a function and offset, and doing this all with a simple DTrace script. Solving the first problem is actually pretty easy. We can just use the DTrace uaddr function which will translate it into an address and offset for us. The script itself is now an exercise in copyin and arithmetic. Here’s the main part of the script:
/*
* Given a sigbuf translate that into where the longjmp is taking us.
*
* On i386 the address is 0x14 into the jmp_buf.
* On amd64 the address is 0x38 into the jmp_buf.
*/
pid$1::longjmp:entry
{
uaddr(curpsinfo->pr_dmodel == PR_MODEL_ILP32 ?
*(uint32_t *)copyin(arg0 + 0x14, sizeof (uint32_t)) :
*(uint64_t *)copyin(arg0 + 0x38, sizeof (uint64_t)));
}
Now, if we run this, here’s what we get:
[root@bh1-build2 (bh1) /var/tmp]# dtrace -s longjmp.d $(pgrep -z rm mdb) dtrace: script 'longjmp.d' matched 1 probe CPU ID FUNCTION:NAME 8 69580 longjmp:entry mdb`mdb_run+0x38
Now we know exactly where we ended up after the longjmp(), and this will method will work on both 32-bit and 64-bit x86 systems. If you’d like to download the script, you can just download it from here.
In: DTrace, Miscellaneous · Tagged with: DTrace, longjmp, mdb
Visualizing KVM
Last March, Bryan Cantrill and I joined Max Bruning on working towards bringing KVM to illumos. Six months ago we found ourselves looping in x86 real mode and today we’re booting everything from Linux to Plan 9 to Weenix! For a bit more background on how we got there take a gander at Bryan’s entry on KVM on illumos.
For the rest of this entry I’m going to talk about the exciting new analytics we get by integrating DTrace and kstats into KVM. We’ve only scratched the surface of what we can see, but already we’ve integrated several metrics into Cloud Analytics and have gained insight into different areas of guest behavior that the guests themselves haven’t really seen before. While we can never gain the same amount of insight into Virtual Machines (VMs) that we can with a zone, we easily have insight into three main resources of a VM: CPU, disk, and network. Cloud Operators can use these metrics to determine if there is a problem with a VM, determine which VMs are having issues, and what areas of the system are suffering. In addition, we’ve pushed the boundaries of observability by taking advantage of the fact that several components of the hardware stack are virtualized. All in all, we’ve added metrics in the following areas:
- Virtual NIC Throughput
- Virtual Disk IOps
- Virutal Disk Throughput
- Hardware Interrupts
- Virtual Machine Exits
- vCPU Samples
NICs and Disks
One of the things that we had to determine early on was how the guests virtual devices would interface with the host. For NICs, this was simple: rather than trying to map a guest’s NIC to a host’s TUN or TAP device; we just used a VNIC, which was introduced into the OS by the Crossbow project. Each guest NIC corresponds directly to a Crossbow VNIC. This allows us to leverage all of the benefits of using a VNIC including anti-spoof and the analytics that already exist. This lets us see the throughput in terms of either bytes or packets that the guest is sending and receiving on a per guest NIC basis.
The story with disks is quite similar. In this case each disk that the guest sees is backed by a zvol from ZFS. This means that guests unknowingly get the benefits of ZFS: data checksums, snapshots and clones, the ease of transfer via zfs send and zfs receive, redundant pooled storage, and a proven reliability. What is more powerful here is the insight that we can provide into the disk operations. We provide two different views of disk activity. The first is based on throughput and the second is based on I/O operations.
The throughput based analytics are a great way to understand the actual disk activity that the VM is doing. Yet the operations view gives us several interesting avenues to drill down into VM activity. The main decompositions are operation type, latency, offset, and size. This gives us insight into how guest filesystems are actually sending activity to disk. As an example, we generated the following screenshot from a guest running Ubuntu on an ext3 filesystem. The guest would loop creating a 3gb file, sleeping for a seconds, reading the file, and deleting the file before beginning again. In the image below we see operations decomposed by operation type and offset. This allows us to see where on disk ext3 is choosing to lay out blocks on the filesystem. The x-axis represents time; each unit is one second. The y-axis shows the virtual disk block number.
Hardware Interrupts
Brendan Gregg has been helping us out by diligently measuring our performance, comparing us to both a bare metal Linux system and KVM under Linux. While trying to understand the performance characteristics and ensuring that KVM on illumos didn’t have too many performance pathologies he stumbled across an interesting function in the kvm source code: vmx_inject_irq. This function is called any time a hardware interrupt is injected into the guest. We combined this information with an incredibly valuable idea for heatmaps that Brendan thought up. A heatmap based upon subsecond offset allows us to see when across a given second some action occurred. The x-axis is the same as the previous graph, one second. The y-axis though represents when in the second this item occurred, i.e. where in the 1,000,000 microseconds did this action occur. Take a look at the following image:
Here we are visualizing which interrupts occurred in a given second and looking at it based upon when in the second they occur. Each interrupt vector is colored differently. The red represents interrupts caused by the disk controller and yellow by the network controller. The blue is the most interesting: these are timer based interrupts generated by the APIC. Lines that are constant across the horizontal means that these are events that are happening at the same time every second. These represent actions caused by an interval timer, something that
always fires every second. However there are lines that look like a miniature stair; ones that go up at an angle. These represent an application that does work, calls sleep(3C) for an interval, does a little bit of work and sleeps again.
VM Exits
A VM exits when the processor ceases running the guest code and returns to the host kernel to handle something which must be emulated such as memory mapped I/O or accessing one of the emulated devices. One of the ways to increase VM performance is to minimize the number of exits. Early on during our work porting KVM we saw that there were various statistics that were being gathered and exported via debugfs in the Linux KVM code. Here we leveraged kstats and Bryan quickly wrote up kvmstat. kvmstat quickly became an incredibly useful tool for us to easily understand VM behavior. What we’ve done is leverage the kstats which allow us to know which VM, which vCPU, and which of a multitude of reasons the guest exited and add that insight into Cloud Analytics.
vCPU Samples
While working on KVM and reading through the Intel Architecture Manuals I reminded myself of a portion of x86 architecture that is quite important, mainly that the root of the page table is always going to be stored in cr3. Unique values in cr3 represent unique Memory Management Unit (MMU) contexts. Most modern operating systems that run on x86 use a different a different MMU context for each process running on the system and the kernel. Thus if we look at the value of cr3 we get an opaque token that represents something running in the guest.
Brendan had recently been adding metrics to Cloud Analytics based upon using DTrace’s profile provider and combining the gathered data with the subsecond offset heatmaps that we previously discussed. Bryan had added a new variable to D that allowed us to look at the register state of a given running vCPU. To get the value of cr3 that we wanted we could use something along the lines of vmregs[VMX_GUEST_CR3]. When you combine these two we get a heatmap that shows us what is running in the guest. Check out the image below:
Here, we’ve sampled at a frequency of 99 Hz. We avoid sampling at 100 Hz because that would catch a lot of periodic system activity. We’re looking at a one CPU guest running Ubuntu. The guest is initially idle. Next we start two CPU bound activities, highlighted in blue and yellow. What we can visualize are the scheduling decisions being made by the Linux kernel. To further see what happened, we used renice on one of the processes setting it to a value of 19. You can see the effect in the first image below as the blue process rarely runs in comparison to the yellow. In the second image we experimented with the effects of setting different nice values.
These visualizations are quite useful. They let us give someone an idea of what is running in their VM. While it can’t pinpoint it to the exact process, it does let the user understand what the characteristics of their workload are and whether it is a few long lived processes fighting for the CPU, lots of short lived processes coming and going, or something in between. Like the rest of these metrics this lets you understand where in your fleet of VMs the problem may be occurring and help narrow things down to which few VMs should be looked at with native tools.
Conclusion
We’ve only begun to scratch the surface of what we can understand about a virtual machine running under KVM on illumos. Needless to say, this wouldn’t be possible without DTrace and its guarantees of safety for use on production systems and only the overhead of a few NOPs when not in use. As time goes on we’ll be experimenting on what information can help us, operators, and end users better understand their VM’s performance and adding those abilities to Cloud Analytics.
In: DTrace · Tagged with: Cloud, DTrace, KVM, SunOS
Racing in the depths of SMF
I recently found myself having to dive into the depths of SMF — The SunOS (illumos / Solaris) Service Management Framework — to debug a nasty race condition between svccfg import and svcadm enable -s. Understanding what happened sent me chasing around and dealing with a cheerful cast of characters that you might or might not expect: svc.configd, svc.startd, the EMI (early manifest import) service, and the ON build process. I found myself digging and doing a lot of reading to understand how all these different pieces worked together and communicated, which made me realize that this would be incredibly useful for the next person (really when I forget) who has to make another trip back into this important yet quite complicated subsystem.
The Problem
We had a heavily loaded system that was doing boot up and initializing lots of zones. This was running on VMware Fusion, which while great for development, is understandably not a performance king. During this process we have lots of scripts that do something similar to the following shell snippet:
# svccfg import service.xml # svcadm enable -s service svcadm: svc:/SERVICE/:default is misconfigured (lacks "restarter" property group)
Well, that’s a problem. Now, you might say that obviously our manifest is misconfigured, but that actually isn’t the case. Manifests optionally may specify a restarter property group. If they don’t, svc.startd takes control of restarting the instance. This is what the majority of services want so the problem here isn’t that we didn’t specify the restarter group, but for some reason it’s missing after we imported! Before we can explain what actually happened and how to fix it, we need to do a bit of an explanation for how SMF works and communicates. Keep in mind I didn’t write SMF, so there may be one or two oversights.
Rough SMF Architecture
There are a few different components that make up SMF and are responsible for different pieces of functionality that are used:
- svc.startd(1M): A daemon that is in charge of starting, stopping, and restarting services and instances
- svc.configd: A daemon that manages the repository that stores information, property groups, and state of the different services and instances
- libscf(3LIB): A C library that provides the glue for communicating, accessing, and updating information about services and instances
- svccfg(1M): A utility to add and remove services as well as change the properties associated with different services and instances.
- svcadm(1M): A utility to control the different instance of a service. You can use this to enable and disable them among some other useful things.
- svcs(1): A utility that reports on the status of various services on the system
Now how all of these work together is far from simple, in fact it can be quite confusing. Here’s a block diagram I put together that helps explain everything and how they all communicate:
/* * The SMF Block Diagram * Repository * This attempts to show ___________ __________ * the relations between | | SQL | | * the different pieces | configd |<----------->| SQLite | * that make SMF work and | | Transaction | | * users/administrators ----------- ---------- * call into. /|\ /|\ * | | * door_call(3C)| | door_call(3C) * | | * \|/ \|/ * ____________ __________ __________ ____________ * | | | | | | | svccfg | * | startd |<--->| libscf | | libscf |<---->| svcadm | * | | | (3LIB) | | (3LIB) | | svcs | * ------------ ---------- ---------- ------------ * /|\ /|\ * | | fork(2)/exec(2) * | | libcontract(3LIB) * \|/ \|/ Various System/User services * --------------------------------------------------------------------- * | system/filesystem/local:default system/coreadm:default | * | network/lookpback:default system/zones:default | * | network/ntp:default system/cron:default | * | smartdc/agent/ca/cainstsvc:default network/ssh:default | * | appliance/kit/akd:default system/svc/restarter:default | * --------------------------------------------------------------------- */
Chatting with configd and sharing repository information
As you run commands with svcs, svccfg, and svcadm, they are all creating a libscf handle to communicate with configd. As calls are made via libscf they ultimately go and talk to configd to get information. However, how we actually are talking to configd is not as straightforward as it appears.
When configd starts up it creates a door located at /etc/svc/volatile/repository_door. This door runs the routine called main_switcher() from usr/src/cmd/svc/configd/maindoor.c. When you first invoke svc(cfg|s|adm), one of the first things that occurs is creating a scf_handle_t and binding it to configd by calling scf_handle_bind(). This function makes a door call to configd and gets returned a new file descriptor. This file descriptor is itself another door which calls into configd’s client_switcher(). This is the door that is actually used when getting and fetching properties, and many other useful things.
svc.startd needs a way to notice the changes that occur to the repository. For example, if you enabled a service that was not previously running, it’s up to startd to notice that this has happened, check dependencies, and eventually start up the service. The way it gets these notifications is via a thread who’s sole purpose in life is to call _scf_notify_wait(). This function acts like poll(2) but for changes that occur in the repository. Once this thread gets the event, it dispatches it handles the event appropriately.
The Events of svc.startd
svc.startd has to handle a lot of complexity. Understanding how you go from getting the notification that a service was enabled to actually enabling it is not obvious from a cursory glance. The first thing to keep in mind is that startd maintains a graph of all the related services and instances so it can keep track of what is enabled, what dependencies exist, etc. all so that it can answer the question of what is affected by a change. Internally there are a lot of different queues for events, threads to process these queues, and different paths to have events enter these queues. What follows is a diagram that attempts to explain some of those paths, though it’s important to note that for some of these pieces, such as the graph and vertex events, there are many additional ways and code paths these threads and functions can take. And yes, restarter_event_enqueue() is not the same thing as restarter_queue_event().
/* * Threads/Functions Queues Threads/Functions * * called by various * ------------------ --------- --------------- * --->| graph_protocol | graph_event | graph | graph_event_ | graph_event | * --->| _send_event() |------------>| event |----------------->| _thread | * ------------------ _enqueue() | queue | dequeue() --------------- * --------- | * _scf_notify_wait() vertex_send_event()| * | \|/ * | -------------------- ---------------------- * |->| repository_event | vertex_send_event() | restarter_protocol | * | _thread |----------------------------->| _send_event() | * -------------------- ---------------------- * | | out to other * restarter_ restarter_ | | restarters * event_dequeue() ------------- event_ | | not startd * |----------------| restarter |<------------| |-------------> * \|/ | event | enqueue() * ------------------- | queue | |------------------> * | restarter_event | ------------- ||-----------------> * | _thread | |||----------------> * ------------------- ||| start/stop inst * | ---------------- ---------------------- * | | instance | | restarter_process_ | * |-------------->| event |------>| events | * restarter_ | queue | | per-instance lwp | * queue_event() ---------------- ---------------------- * ||| various funcs * ||| controlling * ||| instance state * |||---------------> * ||----------------> * |-----------------> */
What’s important to take away is that there is a queue for each instance on the system that handles events related to dealing directly with that instance and that events can be added to it because of changes to properties that are made to configd and acted upon asynchronously by startd.
How does the restarter property group show up
The last thing that we wanted to answer was where does the restarter property actually get set if it is not specified. While looking around the source code, I finally came across an interesting function: libscf_inst_get_or_add_pg. This function was getting called in a few various places and specifies the restarter property group. However, none of this is done in configd or svccfg when you import the manifest. Rather it is all taken care of by startd asynchronously.
To test that this was getting called when you imported a service for the first time and verify that this was getting called by startd, I used the following DTrace snippet that utilizes the pid provider. For more on how to use it, consult Brendan’s blog articles on the pid provider.
[root@headnode (coal:0) ~]# dtrace -n 'pid8::libscf_inst_get_or_add_pg:entry{
printf("%s", copyinstr(arg1)); ustack(); }'
dtrace: description 'pid8::libscf_inst_get_or_add_pg:entry' matched 1 probe
CPU ID FUNCTION:NAME
0 82690 libscf_inst_get_or_add_pg:entry restarter
svc.startd`libscf_inst_get_or_add_pg
svc.startd`libscf_note_method_log+0x6c
svc.startd`method_run+0x132
svc.startd`method_thread+0x184
libc.so.1`_thrp_setup+0x9b
libc.so.1`_lwp_start
0 82690 libscf_inst_get_or_add_pg:entry restarter
svc.startd`libscf_inst_get_or_add_pg
svc.startd`libscf_note_method_log+0x6c
svc.startd`method_run+0x132
svc.startd`method_thread+0x184
libc.so.1`_thrp_setup+0x9b
libc.so.1`_lwp_start
0 82690 libscf_inst_get_or_add_pg:entry restarter
svc.startd`libscf_inst_get_or_add_pg
svc.startd`libscf_write_start_pid+0x6e
svc.startd`method_run+0x43a
svc.startd`method_thread+0x184
libc.so.1`_thrp_setup+0x9b
libc.so.1`_lwp_start
1 82690 libscf_inst_get_or_add_pg:entry restarter
svc.startd`libscf_inst_get_or_add_pg
svc.startd`libscf_write_method_status+0xbc
svc.startd`write_status+0x2f
svc.startd`method_run+0x616
svc.startd`method_thread+0x184
libc.so.1`_thrp_setup+0x9b
libc.so.1`_lwp_start
From this, we see that as a part of getting ready to actually run the specified instance we’re writing out the restarter property group. Thus svccfg should not return until this this property group has been added by startd otherwise we will see invalid state that causes the tools like svcs and svcadm to complain.
The fix and some gotchas
So, the fix here is actually pretty straightforward. What we want to do is after we have imported all of the services and instances associated with a given manifest, we want to verify that every service and instance has a restarter property group. They will have this property group regardless of whether the instance is enabled, disabled, in maintenance, or can’t start due to missing dependencies. The logic here is very simple, iterate over each service and instance specified in the manifest and don’t move on until we can retrieve that property group. Once we can, move onto the next instance. This is pretty straightforward, but there are two times when this logic surprisingly breaks that we have to watch out for and special case.
Native Build
I discovered that as a part of the build process for ON, there is a phase where it builds a version of svc.configd and svccfg which it calls svc.configd-native and svccfg-native. These create initial repositories for the system. However, they are designed to run separately from the normal series of configd and startd that are on the system. In fact, there is no native startd while the native configd and svccfg are running. If we did this check, the restarter property groups will never be created and the build will always spin forever. The only solution is to not do the check. There are a few other places throughout configd and svccfg that already have to deal with the fact that we’re using the same source base and running it in two very different environments. We can work around it by using the preprocessor directive NATIVE_BUILD and a few #ifdefs. I did not introduce that directive, it was already being used liberally in configd and in a few places in svccfg.
Early Manifest Import
PSARC 2010/013 SMF Early Manifest Import introduced a substantial change in when various manifest are imported into the repository during boot. In this case svc.startd purposefully does not listen for notifications from configd while it is running EMI. This has two important ramifications:
- We cannot check for the restarter property while EMI is running its start method.
- This race condition could occur after the EMI service finishes and startd starts acting upon all the changes.
To deal with this, we check the state of the EMI service. If the instance is online, that means that EMI has successfully finished and will never run again until the next time the system boots. This is how svc.startd makes sure not to run it twice in case startd restarts. In our case, we do not try and verify that the instance has a restarter property group unless svc:/system/early-manifest-import:default is online.
The likelihood of the race condition occurring after EMI starts is very unlikely because most start methods are not calling svcadm enable -s on some other service that was imported via EMI, but that does not mean it does not exist and it is worth keeping that in mind if writing the manifest for such a service.
Takeaways
Hopefully the block diagrams here help someone who is making future dives into the depths of SMF. If you do, here are a couple things to keep in mind:
- Have the code base open and in cscope.
- Use DTrace on the different pieces of SMF doing known actions and use ustack().
- Watch out for the two special modes that these programs run in: EMI and NATIVE_BUILD.
- The SMF Codebase is huge. Do not try to keep it all in memory at once. Trace interactions and keep notes.
In: SunOS · Tagged with: SMF, SunOS
A trip down into <sys/regset.h>
Just the other day I was working with Ryan Dahl on debugging an issue he hit while working on adding support for Crankshaft – the new JIT for Google’s v8 — for SunOS. This came about from Bryan’s discovery of what can happen when magic collides. Now, this is a rather delicate operation and there is a lot of special stuff that is going on. Since Ryan and I had an interesting little debugging session and both learned something, I thought I’d share a bit of what was going on with an explanation.
As a part of Crankshaft, they are firing a signal to do a bit of the profiling. Some of the code that is in bleeding edge for src/platform-solaris.cc currently looks like:
615 static void ProfilerSignalHandler(int signal, siginfo_t* info, void* context) {
616 USE(info);
617 if (signal != SIGPROF) return;
618 if (active_sampler_ == NULL || !active_sampler_->IsActive()) return;
619 if (vm_tid_ != pthread_self()) return;
620
621 TickSample sample_obj;
622 TickSample* sample = CpuProfiler::TickSampleEvent();
623 if (sample == NULL) sample = &sample_obj;
624
625 // Extracting the sample from the context is extremely machine dependent.
626 ucontext_t* ucontext = reinterpret_cast(context);
627 mcontext_t& mcontext = ucontext->uc_mcontext;
628 sample->state = Top::current_vm_state();
629
630 #if V8_HOST_ARCH_IA32
631 sample->pc = reinterpret_cast(mcontext.gregs[KDIREG_EIP]);
632 sample->sp = reinterpret_cast(mcontext.gregs[KDIREG_ESP]);
633 sample->fp = reinterpret_cast(mcontext.gregs[KDIREG_EBP]);
634 #elif V8_HOST_ARCH_X64
635 sample->pc = reinterpret_cast(mcontext.gregs[KDIREG_RIP]);
636 sample->sp = reinterpret_cast(mcontext.gregs[KDIREG_RSP]);
637 sample->fp = reinterpret_cast(mcontext.gregs[KDIREG_RBP]);
638 #else
639 UNIMPLEMENTED();
640 #endif
641 active_sampler_->SampleStack(sample);
642 active_sampler_->Tick(sample);
643 }
Now for those of you who have spent a long time working with SunOS might notice what’s wrong with this right away. But in some ways it’s not quite so obvious, so let’s talk about what’s happening.
This code is being used as a signal handler, specifically for SIGPROF. If we pull up the manual page for sigaction(2), the Solaris version has the following comment in its notes section:
NOTES
The handler routine can be declared:
void handler (int sig, siginfo_t *sip, ucontext_t *ucp);
The sig argument is the signal number. The sip argument is a
pointer (to space on the stack) to a siginfo_t structure,
which provides additional detail about the delivery of the
signal. The ucp argument is a pointer (again to space on the
stack) to a ucontext_t structure (defined in <sys/ucon-
text.h>) which contains the context from before the signal.
It is not recommended that ucp be used by the handler to
restore the context from before the signal delivery.
SunOS 5.11 Last change: 23 Mar 2005 5
When a signal is delivered on an x86 UNIX system a program stops doing what it is currently doing and if there is a signal handler, executes the code for the signal handler and then returns to what it was previously doing (this is a bit more complicated in a multi-threaded program). We generally describe this as a signal interrupting the thread in question. This third argument to the handler is a context, which is all the information necessary to describe where a user program is executing. If we peek our heads into <sys/ucontext.h> on an x86 based system (the SPARC version is different)) we will find the following declaration for the structure (with a few #ifdefs along for the ride):
75 #if !defined(_XPG4_2) || defined(__EXTENSIONS__)
76 struct ucontext {
77 #else
78 struct __ucontext {
79 #endif
80 unsigned long uc_flags;
81 ucontext_t *uc_link;
82 sigset_t uc_sigmask;
83 stack_t uc_stack;
84 mcontext_t uc_mcontext;
85 long uc_filler[5]; /* see ABI spec for Intel386 */
86 };
Specifically here we are interested in the mcontext — what v8 is using. To best understand what the mcontext is, I took a look at what the OpenGroup defines for ucontext.h in SUSv2. They have the following to say about the mcontext:
mcontext_t uc_mcontext a machine-specific representation of the saved context
More specifically the mcontext_t has two members. From <sys/regset.h> we get:
378 /*
379 * Structure mcontext defines the complete hardware machine state.
380 * (This structure is specified in the i386 ABI suppl.)
381 */
382 typedef struct {
383 gregset_t gregs; /* general register set */
384 fpregset_t fpregs; /* floating point register set */
385 } mcontext_t;
Well, that’s exactly what v8 is looking for. From the code snippet there, they are saving three registers that describe how the machine works:
- The Base Pointer – ebp for i386 and rbp on amd64
- The Instruction Pointer – eip for i386 and rip for amd64
- The Stack Pointer – esp for i386 and rsp on amd64
Now keeping track of what each of these does can be quite confusing, so let’s do a quick review.
The instruction pointer holds the address of the next assembly instruction that the CPU should execute for this program. The Base Pointer and Stack Pointer are unfortunately, not quite as intuitive. Memory is laid out in the stack from high addresses towards low addresses. The stack pointer tells us where the bottom of the stack is, i.e. if we decrement the address we can store a new value. When we use the stack, we break it up into what are called stack frames. A stack frame contains everything necessary to run a function: arguments to the function, copies of registers that are expected to be saved, the instruction to return to after the function completes (the eip) and a pointer to the previous stack frame. The ebp points into the current stack frame.
After this brief interlude, we now return to the code that we were working on v8 src/platform_solaris.cc. Now, every so often that code would segfault. With a brief bit of debugging work and comparing the registers before the interrupt was taken with those in the mcontext, we found that we were using the wrong value! Now, if you look back, you’ll see that we’re using macros with prefix KDIREG. These are generally gotten from <sys/kdi_regs.h>. Specifically the definitions used are architecture dependent and for x86 will be found in <ia32/sys/kdi_regs.h> and in <amd64/sys/kdi_regs.h> for amd64. This is the interface that kmdb uses for operating.
In this context, kdi stands for the Kernel/Debugger Interface. So these definitions are meant for structures that are using that interface. When we specified KDIREGS_ESP the value it ended up actually getting out of the register actually was giving us the register ECX. ECX can be used as a general purpose and historically CX was used for loop counters, so the chances that we’re getting an invalid address are pretty high.
However, it turned out it was not too hard to use the correct registers. Looking at <sys/regset.h> had the answer right in front of us:
91 /* 92 * The names and offsets defined here are specified by i386 ABI suppl. 93 */ 94 95 #define SS 18 /* only stored on a privilege transition */ 96 #define UESP 17 /* only stored on a privilege transition */ 97 #define EFL 16 98 #define CS 15 99 #define EIP 14 100 #define ERR 13 101 #define TRAPNO 12 102 #define EAX 11 103 #define ECX 10 104 #define EDX 9 105 #define EBX 8 106 #define ESP 7 107 #define EBP 6 108 #define ESI 5 109 #define EDI 4 110 #define DS 3 111 #define ES 2 112 #define FS 1 113 #define GS 0
This led us to making the obvious substitutions:
630 #if V8_HOST_ARCH_IA32 631 sample->pc = reinterpret_cast(mcontext.gregs[EIP]); 632 sample->sp = reinterpret_cast(mcontext.gregs[ESP]); 633 sample->fp = reinterpret_cast(mcontext.gregs[EBP]);
Well, actually it was almost too obvious, because it segfaulted as well in the same location. However, instead of using address 0xf (a reasonable value for ECX), it actually had 0×0 in the ESP register! Now wait a minute, this is what tells us where the bottom of the stack is, that’s not right, if the bottom of the stack is at 0 we’re in a lot of trouble.
Now, on Solaris x86/amd64 we take interrupts on the stack. These days, most systems use a 1:1 threading model (for reasons why, ask Bryan or read his paper) so for each userland thread there is a kernel thread that corresponds to it which means that each thread has a stack in both userland and the kernel. So here ESP really could be called KESP — referring to the ESP of the kernel thread. So really what we are interested in here is the ESP for userland or the register UESP.
Now that we know that we need to be using UESP, I took another look at the header file and found the following snippet:
115 /* aliases for portability */ 116 117 #if defined(__amd64) 118 119 #define REG_PC REG_RIP 120 #define REG_FP REG_RBP 121 #define REG_SP REG_RSP 122 #define REG_PS REG_RFL 123 #define REG_R0 REG_RAX 124 #define REG_R1 REG_RDX 125 126 #else /* __i386 */ 127 128 #define REG_PC EIP 129 #define REG_FP EBP 130 #define REG_SP UESP 131 #define REG_PS EFL 132 #define REG_R0 EAX 133 #define REG_R1 EDX
One of the nice things about this here is that it makes it easier to write code that works across both the x86 and amd64 architectures. Of course, this doesn’t really work when looking at SPARC platforms because the ABI and calling conventions are different due to the differences in CPU architecture. This is one of the things that I personally enjoy about SunOS. The act of defining these more portable aliases is really helpful and if we ever get a 128 bit processor for some reason, those macros will be extended to make sense for it as well. Those portable definitions allowed us to take those architecture ifdefs and just replace it with the following three lines:
631 sample->pc = reinterpret_cast(mcontext.gregs[REG_PC]); 632 sample->sp = reinterpret_cast(mcontext.gregs[REG_SP]); 633 sample->fp = reinterpret_cast(mcontext.gregs[REG_FP]);
That’s about it for our little trip down to sys/regset.h. The fix should hopefully land in v8 (it may even have by the time I get around to posting this) shortly. It should be fun to play around with node and a proper Crankshaft on v8.
In: SunOS · Tagged with: SunOS
DTrace probes for Node v0.4.x
As a part of the work for Cloud Analytics that Dave, Bryan, Brendan, and I been doing, we’ve added several USDT probes to Node. As I’ve been doing a lot of the integration work and working with Ryan Dahl to get them into Node itself, I thought that I would document what we’ve added and talk a bit about what information we can get from the probes. Dave has a blog entry that explains how you can use these probes and Cloud Analytics to understand what’s going on with your application.
Currently Node has probes that let us track the following events:
- HTTP Server Requests and Responses
- HTTP Client Requests and Responses
- Socket Operations
- Garbage Collection
Before we jump too far ahead of ourselves, we should talk a bit about how to enable DTrace USDT probes with node. To get the DTrace probes in node, you can pass the option –with-dtrace to configure. This flag will fail on systems that aren’t SunOS (i.e. Illumos and OpenSolaris). Now, OS X and FreeBSD do have DTrace. You can certainly hack the wscript for node to get the probes compiled into OS X and FreeBSD; however, they have not been tested. On OS X you can get the probes to fire; however, the translator does not currently work on OS X, which limits their power. I have not tested this on FreeBSD 9 which added support for the USDT provider. So, let’s build node with support for DTrace!
$ ./configure --with-dtrace && make && make install
So, now we can go ahead and list all of the probes that DTrace has for node. To do this we need to fire up our newly compiled node and then run the following DTrace command to see all the probes that we have:
rm@devel ~ $ dtrace -l -n node*::: ID PROVIDER MODULE FUNCTION NAME 4331 node12090 node _ZN4node14dtrace_gc_doneEN2v86GCTypeENS0_15GCCallbackFlagsE gc-done 4332 node12090 node _ZN4node15dtrace_gc_startEN2v86GCTypeENS0_15GCCallbackFlagsE gc-start 4333 node12090 node _ZN4node26DTRACE_HTTP_CLIENT_REQUESTERKN2v89ArgumentsE http-client-request 4334 node12090 node _ZN4node27DTRACE_HTTP_CLIENT_RESPONSEERKN2v89ArgumentsE http-client-response 4335 node12090 node _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE http-server-request 4336 node12090 node _ZN4node27DTRACE_HTTP_SERVER_RESPONSEERKN2v89ArgumentsE http-server-response 4337 node12090 node _ZN4node28DTRACE_NET_SERVER_CONNECTIONERKN2v89ArgumentsE net-server-connection 4338 node12090 node _ZN4node22DTRACE_NET_SOCKET_READERKN2v89ArgumentsE net-socket-read 4339 node12090 node _ZN4node23DTRACE_NET_SOCKET_WRITEERKN2v89ArgumentsE net-socket-write 4340 node12090 node _ZN4node21DTRACE_NET_STREAM_ENDERKN2v89ArgumentsE net-stream-end
As we can see there are several probes here, for this entry we’re going to leave out net-stream-end and net-server-connection. If we include the path to the node.d file, we can even view all the arguments of the probe. The node.d file is a translator that is installed into $PREFIX/lib/dtrace. As long as you have the libdir patch (or move the file into /usr/lib/dtrace) then you should be all set and we can run:
rm@devel ~ $ dtrace -L $PREFIX/lib/dtrace -v -n node*:::
ID PROVIDER MODULE FUNCTION NAME
4331 node12090 node _ZN4node14dtrace_gc_doneEN2v86GCTypeENS0_15GCCallbackFlagsE gc-done
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: int
args[1]: int
4332 node12090 node _ZN4node15dtrace_gc_startEN2v86GCTypeENS0_15GCCallbackFlagsE gc-start
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: int
args[1]: int
4333 node12090 node _ZN4node26DTRACE_HTTP_CLIENT_REQUESTERKN2v89ArgumentsE http-client-request
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: node_http_request_t *
args[1]: node_connection_t *
4334 node12090 node _ZN4node27DTRACE_HTTP_CLIENT_RESPONSEERKN2v89ArgumentsE http-client-response
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: node_connection_t *
4335 node12090 node _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE http-server-request
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: node_http_request_t *
args[1]: node_connection_t *
4336 node12090 node _ZN4node27DTRACE_HTTP_SERVER_RESPONSEERKN2v89ArgumentsE http-server-response
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: node_connection_t *
4338 node12090 node _ZN4node22DTRACE_NET_SOCKET_READERKN2v89ArgumentsE net-socket-read
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: node_connection_t *
args[1]: int
4339 node12090 node _ZN4node23DTRACE_NET_SOCKET_WRITEERKN2v89ArgumentsE net-socket-write
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics: Evolving
Dependency Class: ISA
Argument Types
args[0]: node_connection_t *
args[1]: int
For the rest of this entry we’ll go over the different probes and their arguments. I’ll detail what they are and how you can get some useful information from them. In a future entry I’ll discuss how you can use this information.
HTTP Server Probes
When an HTTP request comes in it will fire the http-server-request probe. The probe is fired once node receives all of the HTTP headers for the incoming connection. Thus the probe fires right before the HTTP socket emits the event ‘request’. The http-server-request probe has two arguments. The first argument tells us information about the HTTP request. We can use it to get the method and url requested. The second argument tells us information about who is connecting to our HTTP server. We can view all this information with the following DTrace snippet:
# cat -n /tmp/node-httpd-request.d
1 #pragma D option quiet
2
3 node21291:::http-server-request
4 {
5 printf("Connection from %s:%d on fd %d - Request %s: %s\n",
6 args[1]->remoteAddress, args[1]->remotePort, args[1]->fd,
7 args[0]->method, args[0]->url);
8 }
# dtrace -L$PREFIX/lib/dtrace -s node-httpd-request.d
Connection from 10.99.99.254:63486 on fd 7 - Request GET: /
Connection from 10.99.99.254:63487 on fd 7 - Request GET: /
Connection from 10.99.99.254:63488 on fd 7 - Request GET: /
Connection from 10.99.99.254:63489 on fd 7 - Request GET: /
Connection from 10.99.99.254:63490 on fd 7 - Request GET: /
^C
The corresponding probe, http-server-response, will fire once we have finished sending all the data in our response to the client. This probe no longer has the information on the incoming HTTP request, but does have all of the information about who we are talking to. Here’s all the information that we have:
# cat node-http-response.d
1 #pragma D option quiet
2
3 node*:::http-server-response
4 {
5 printf("Connection from %s:%d on fd %d\n",
6 args[0]->remoteAddress, args[0]->remotePort, args[0]->fd);
7 }
# dtrace -L$PREFIX/lib/dtrace -s node-http-response.d
Connection from 127.0.0.1:59149 on fd 8
Connection from 127.0.0.1:39130 on fd 9
Connection from 10.99.99.254:63691 on fd 7
Connection from 10.99.99.254:55004 on fd 19
Connection from 10.99.99.254:55000 on fd 14
Connection from 10.99.99.254:55276 on fd 16
Connection from 10.99.99.254:54977 on fd 11
Now, if we want to track the time that has elapsed, we have to be a little smarter. In node we could have multiple connections going on at the same time, and these are all going to occur in the same thread. This means that we cannot use the traditional thread-local variables that we might do via self->foo. So, how do we actually correlate one http-server-request probe with its http-server-response probe? Each connection is using a separate socket and each socket has its own file descriptor. There is a one to one correlation between sockets and file descriptors. A file descriptor will only be reused if the socket it was previously associated with was closed. So if we wanted to track the latency we can use the file descriptor to correlate events and write a small script that looks something like:
rm@devel ~ $ cat -n/tmp/httpd-lat.d
1 #pragma D option quiet
2
3 node*:::http-server-request
4 {
5 latency[args[1]->fd] = timestamp;
6 }
7
8 node*:::http-server-response
9 {
10 printf("Request took %d nanoseconds to complete\n",
11 timestamp - latency[args[0]->fd]);
12 }
rm@devel ~ $ dtrace -L$PREFIX/lib/dtrace -s /tmp/httpd-lat.d
Request took 444589 nanoseconds to complete
Request took 939339 nanoseconds to complete
Request took 353588 nanoseconds to complete
Request took 1431926 nanoseconds to complete
Request took 334404 nanoseconds to complete
^C
Now, if we run this with the sample Node ‘Hello, world!’ HTTP server we certainly can get some rather interesting data. But the value comes from being able to visualize this data, something that Brendan has looked at extensively.
HTTP Client Probes
The HTTP Client probes have been designed to correspond to their HTTP Server counterparts. There are both http-client-request and http-client-response probes. The http-client-request gets fired when an outgoing request is made and the http-client response probe is fired when we have received the response. In terms of arguments, the HTTP Client probes work identically to their server counterparts. The http-client-request has information about what the underlying connection and the HTTP resource we have asked for. The http-client-response probe only has information on the underlying socket.
Let’s look at the same information as we did with the HTTP Server probes, but this time with the client:
# cat -n /tmp/node-httpc-request.d
1 #pragma D option quiet
2
3 node*:::http-client-request
4 {
5 printf("Connection to %s:%d on fd %d - Request %s: %s\n",
6 args[1]->remoteAddress, args[1]->remotePort, args[1]->fd,
7 args[0]->method, args[0]->url);
8 }
# dtrace -L$PREFIX/lib/dtrace -s /tmp/node-httpc-request.d
Connection to 127.0.0.1:23184 on fd 10 - Request GET: /ca/instrumentations/1/value/raw?start_time=1298733995
Connection to 127.0.0.1:23185 on fd 12 - Request GET: /ca/instrumentations/4/value/raw?start_time=1298733995
Connection to 127.0.0.1:23184 on fd 10 - Request GET: /ca/instrumentations/1/value/raw?start_time=1298733997
Connection to 127.0.0.1:23184 on fd 13 - Request GET: /ca/instrumentations/5/value/raw?start_time=1298733997
Connection to 127.0.0.1:23185 on fd 12 - Request GET: /ca/instrumentations/4/value/raw?start_time=1298733997
^C
Again, we can do a similar thing for the response:
# cat -n /tmp/node-httpc-response.d
1 #pragma D option quiet
2
3 node*:::http-client-response
4 {
5 printf("Connection to %s:%d on fd %d\n",
6 args[0]->remoteAddress, args[0]->remotePort, args[0]->fd);
7 }
# dtrace -L$PREFIX/lib/dtrace -s /tmp/node-httpc-response.d
Connection to 127.0.0.1:23185 on fd 15
Connection to 127.0.0.1:23185 on fd 13
Connection to 127.0.0.1:23184 on fd 18
Connection to 127.0.0.1:23184 on fd 12
Connection to 127.0.0.1:23184 on fd 10
^C
With all of this information we can gather a lot of useful information. Because both probes have the file descriptor, we can link them together to get the latency, as we did previously. Because a node program may be making several HTTP requests to external services, having the ability to know which request is inducing how much latency is quite valuable.
Socket Read/Write Probes
This next set of probes is in fact two different probes, one which fires on every read from a socket and one that fires on every write to a socket. In addition to the information about the socket itself, we get two very useful numbers: the size of the data and the amount buffered. Whenever node does a write to a socket, it attempts to write all of it to the kernel at once. However, if the write would block, node stores the data in userland and will write it out a bit later. When this happens, the call to write returns false. If you aren’t careful or checking the return value from write that buffer can grow quite quickly in userland and eat up a lot of memory. Let’s look at how we can use these two probes:
# cat -n socket.d
1 #pragma D option quiet
2
3 node*:::net-socket-read,
4 node*:::net-socket-write
5 {
6 printf("%s:%d -- %d bytes -- %d buffered\n", args[0]->remoteAddress,
7 args[0]->remotePort, arg1, args[0]->bufferSize);
8 }
# dtrace -L$PREFIX/lib/dtrace -s /tmp/socket.d
10.99.99.254:55000 -- 314 bytes -- 0 buffered
10.99.99.254:55000 -- 2740 bytes -- 0 buffered
10.99.99.254:55000 -- 2 bytes -- 0 buffered
10.99.99.254:55000 -- 5 bytes -- 0 buffered
^C
Garbage Collection Probes
Javascript is a language where you don’t have to explicitly manage memory like you do in C. Instead, the runtime uses various techniques to allocate and later determine what memory can be freed. This technique is known as garbage collection. Node is using v8 under the hood to do this work. v8 gives us the ability to be notified right before garbage collection starts and ends. We have two pieces of information from the garbage collection process: the type of garbage collection and various flags.
To understand the different types, we have to dive into the v8 header file. Node v0.4.0 is currently using v8 version v3.1. If we open up /usr/include/v8.h we see the following for valid garbage collection types:
2472 enum GCType {
2473 kGCTypeScavenge = 1 << 0,
2474 kGCTypeMarkSweepCompact = 1 << 1,
2475 kGCTypeAll = kGCTypeScavenge | kGCTypeMarkSweepCompact
2476 };
From this we can see that v8 supports two different kinds of garbage collection. To help understand the differences, it's important to know that v8 uses the notion of generations. Generational garbage collection makes the observation that objects fall into two camps: those that come and go rather quickly and those that stick around for a while. Mark and Sweep compaction is a traditional form of garbage collection which works well for older objects, but takes much longer. The Scavenge algorithm works well for short lived objects and is generally much quicker than Mark and Sweep. Dave helps explain why we care about GC.
The other argument to our GC probes are a series of flags. Currently this looks like:
2478 enum GCCallbackFlags {
2479 kNoGCCallbackFlags = 0,
2480 kGCCallbackFlagCompacted = 1 << 0
2481 };
When using the probes, there are a few useful things to know. When GC occurs, nothing else is going to run. Furthermore, v8 uses the same thread to call the GC prologue and epilogue handlers. These callbacks are respectively fired immediately before and immediately after the GC operation. Because we have the same thread calling it, we can use a thread local variable in D to keep track of and answer the question of how much latency was due to garbage collection. We can do this with the following D script:
# cat -n /tmp/node-gc.d
1 #pragma D option quiet
2
3 node*:::gc-start
4 {
5 self->ts = timestamp;
6 }
7
8 node*:::gc-done
9 {
10 printf("Spent %d nanoseconds doing Garbage Collection\n",
11 timestamp - self->ts);
12 self->ts = 0
13 }
# dtrace -s /tmp/node-gc.d
Spent 780604 nanoseconds doing Garbage Collection
Spent 50416942 nanoseconds doing Garbage Collection
Spent 33276517 nanoseconds doing Garbage Collection
Spent 14498463 nanoseconds doing Garbage Collection
Spent 3847561 nanoseconds doing Garbage Collection
^C
Those are all the probes that we have currently added to node. All of them are available in the stock node src that you can download and compile yourself. The idea behind all of the probes that we added is to be able to answer a question directly related to what your application is doing and to be able to understand why and where issues are coming from.
In: DTrace · Tagged with: Cloud Analytics, DTrace
Fixing DTrace libdir dependency resolution
As a part of the work Bryan, Dave, Brendan, and I have been doing to instrument Node, Bryan wrote a translator file. One of the nice things about DTrace is that you can define the arguments to a probe to be a structure and simply access them like a C struct. The way that this all works is via a translator for USDT probes which embeds knowledge of how to look up members of a probes struct. Unfortunately, there is currently a bug in how we handle dependencies for translators, but we also have a fix for that problem. But before I get there, let’s show an example of he use of translators. Let’s say that I want to get the method for every HTTP request that is coming in with node. I can use the following DTrace snippet:
# dtrace -n 'node*:::http-server-request{ printf("%s", args[0]->method); }'
dtrace: description 'node*:::http-server-request' matched 6 probes
CPU ID FUNCTION:NAME
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
0 3705 _ZN4node26DTRACE_HTTP_SERVER_REQUESTERKN2v89ArgumentsE:http-server-request GET
The DTrace probes themselves specify a specific argument type. If that translator argument is a struct, we can define an operation to transform the probe argument and get out a specific value. Generally this involves a lot of use of copyin and copyinstr and is quite messy o write, but once written makes all scripts that leverage it much more powerful. These translators is that you can express the dependencies between them. In this case, the Node translator relies on the procfs translator. Thus for this to work, you need to have procfs.d which ships with SunOS.
Now, the big question to ask is how does DTrace know where to look for these translators. By default DTrace looks in /usr/lib/dtrace. However, you can also tell DTrace to look in an additional location by using the -Lpath flag to dtrace(1M). So if you wanted to keep your node.d provider in say /opt/rm/lib/dtrace, we could run the above command as:
# dtrace -L/opt/rm/lib/dtrace -n 'node*:::http-server-request{ printf("%s", args[0]->method); }'
Now, unfortunately there is a bug in this; however, I have a fix for it. Without this fix when trying to resolve library dependencies, DTrace only looks for the file in the same directory. So it would try and look for the dependency procfs.d in /opt/rm/lib/dtrace/procfs.d. However, we don’t have it there, we have it in /usr/lib/dtrace.
Here is a patch against Illumos that fixes this issue. Please note that the patch there is released under the CDDL and copyrighted by Joyent. It causes us to search our entire library path to try and find the first occurrence of a file with the name we’re looking for. So in our case above, we will find that proc.d is in /usr/lib/dtrace and resolve the dependency correctly. Hopefully we’ll see this fix make it upstream into Illumos before too long, as well as, Mac OS X and FreeBSD.
In: DTrace, Joyent · Tagged with: DTrace, Joyent, SunOS
Solving Problems with Cloud Analytics
This past Wednesday Brendan Gregg and Bryan Cantrill gave a presentation on the work that’s been going on at Joyent related to visualizing performance and how we’re using this to solve customer problems. This is based on the work that the the four of us (Bryan, Brendan, Dave and I) have been doing.
You can watch the recorded live stream of the presentation Solving Big Problems (with Cloud Analytics) here.
In: Joyent · Tagged with: Cloud Analytics, DTrace, Joyent

