Behind the music: /system/object

September 28, 2004

About a month ago, I added a new pseudo filesystem to Solaris: ‘objfs’, mounted at /system/object. This is an “under the hood” filesystem that no user will interact with directly. But it’s a novel solution to a particularly thorny problem for DTrace, and may be of some interest to the curious techies out there.

When DTrace was first integrated into Solaris, it had a few hacks to get around the problem of accessing kernel module data from userland. In particular, it opened /dev/kmem in order to find the address ranges of module text and data segments, and introduced a private modctl call in order to extract CTF and symbol table information. The end result was something that mostly worked, but with a few drawbacks. Opening /dev/kmem requires all privileges or membership in group sys, so even if you give a user the dtrace_kernel privilege, they still were unable to access kernel CTF and symbol information. Direct access via modctl necessitated a complicated (and sometimes broken) dance to allow 32-bit dtrace apps to work on a 64-bit kernel.

The solution was to create a new pseudo filesystem which would export information about the currently loaded objects in the kernel as standard ELF files. Choosing a pseudo filesystem over a new system call or modctl has several advantages:

Filesystems are great for exporting heirarchal data. We needed to export a collection of named data – perfect for a directory layout.
By modeling objects as directories and choosing an ELF file format, we have left room for expansion without having to go back and modify existing implementations.
We can leverage our existing toolset for working with ELF files: elfdump(1), nm(1), libelf(3LIB), and libctf. The userland changes to libdtrace(3LIB) were minimal because we already have established interfaces for working with ELF files.
Filesystems are easily virtualized in a local zone. DTrace is still not usable from within a local zone for a few small reasons, but we’re one step closer.
There are no data model issues. We simply export a 64-bit ELF object, and the gelf_xxx() routines handle the conversion transparently.

The final result is:

$ elfdump -c /system/object/genunix/object
Section Header[1]:  sh_name: .shstrtab
sh_addr:      0xa6eaea30      sh_flags:   [ SHF_STRINGS ]
sh_size:      0x46            sh_type:    [ SHT_STRTAB ]
sh_offset:    0x1c4           sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[2]:  sh_name: .SUNW_ctf
sh_addr:      0xa61f7000      sh_flags:   0
sh_size:      0x2e79d         sh_type:    [ SHT_PROGBITS ]
sh_offset:    0x20a           sh_entsize: 0
sh_link:      3               sh_info:    0
sh_addralign: 0x8
Section Header[3]:  sh_name: .symtab
sh_addr:      0xa61b5050      sh_flags:   0
sh_size:      0x1f7d0         sh_type:    [ SHT_SYMTAB ]
sh_offset:    0x2e9a7         sh_entsize: 0x10
sh_link:      4               sh_info:    0
sh_addralign: 0x8
Section Header[4]:  sh_name: .strtab
sh_addr:      0xa61d96dc      sh_flags:   [ SHF_STRINGS ]
sh_size:      0x1cd5e         sh_type:    [ SHT_STRTAB ]
sh_offset:    0x4e177         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[5]:  sh_name: .text
sh_addr:      0xfe87e4a0      sh_flags:   [ SHF_ALLOC  SHF_EXECINSTR ]
sh_size:      0x198dc0        sh_type:    [ SHT_NOBITS ]
sh_offset:    0x6aed5         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[6]:  sh_name: .data
sh_addr:      0xfec3eba0      sh_flags:   [ SHF_WRITE  SHF_ALLOC ]
sh_size:      0x3e1c0         sh_type:    [ SHT_NOBITS ]
sh_offset:    0x6aed5         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[7]:  sh_name: .bss
sh_addr:      0xfed7a5f0      sh_flags:   [ SHF_WRITE  SHF_ALLOC ]
sh_size:      0x7664          sh_type:    [ SHT_NOBITS ]
sh_offset:    0x6aed5         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[8]:  sh_name: .info
sh_addr:      0x1             sh_flags:   0
sh_size:      0x4             sh_type:    [ SHT_PROGBITS ]
sh_offset:    0x6aed5         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8
Section Header[9]:  sh_name: .filename
sh_addr:      0xfec3e8e0      sh_flags:   0
sh_size:      0x10            sh_type:    [ SHT_PROGBITS ]
sh_offset:    0x6aed9         sh_entsize: 0
sh_link:      0               sh_info:    0
sh_addralign: 0x8

The string table, symbol table, and CTF data are all complete. You’ll notice that we also have text, data, and bss, but they’re marked SHT_NOBITS (which means they’re not present in the file). We use the section headers to extract information about the address range for each section, but we can’t actually export the data due to security. Obviously, letting ordinary users see the data section of loaded modules would be a Bad Thing.

To end in typical “Behind the Music” fashion – After a nightmare descent into drug an alcohol abuse, objfs once again was able to take control of its life (thanks mostly a loving relationship with libdtrace), and now lives a relaxing life on a Montana ranch.

2 Responses

Jonathan Adams says:

September 28, 2004 at 11:43 pm

> Opening /dev/kmem requires all privileges, …
Well, opening it for writing requires all privileges, but opening it for reading (which is what dtrace needed) only needs membership in group ‘sys’… And you get the side-benefit of being able to run <tt>mdb -k</tt> without <tt>su</tt>(1M)ing first.
But /system/object is a much cleaner and better answer, I must say…
Eric Schrock says:

September 28, 2004 at 11:50 pm

Doh! Good point – I’ve been spending too much time with privileges and not enough with good old fashioned groups. I’ve updated the post.

Eric Schrock's Blog

Behind the music: /system/object

2 Responses

Recent Posts

Agile Data Technology

Enterprise Software Hackathons

Engineer Anti-Patterns

A node.js CLI?

Data Replication: Building a better NDMP

Data Replication: Approaching the Problem

Archives

Archives