About a month ago, I added a new pseudo filesystem to Solaris: ‘objfs’, mounted at /system/object. This is an “under the hood” filesystem that no user will interact with directly. But it’s a novel solution to a particularly thorny problem for DTrace, and may be of some interest to the curious techies out there.
When DTrace was first integrated into Solaris, it had a few hacks to get around the problem of accessing kernel module data from userland. In particular, it opened /dev/kmem in order to find the address ranges of module text and data segments, and introduced a private modctl call in order to extract CTF and symbol table information. The end result was something that mostly worked, but with a few drawbacks. Opening /dev/kmem requires all privileges or membership in group sys, so even if you give a user the dtrace_kernel privilege, they still were unable to access kernel CTF and symbol information. Direct access via modctl necessitated a complicated (and sometimes broken) dance to allow 32-bit dtrace apps to work on a 64-bit kernel.
The solution was to create a new pseudo filesystem which would export information about the currently loaded objects in the kernel as standard ELF files. Choosing a pseudo filesystem over a new system call or modctl has several advantages:
- Filesystems are great for exporting heirarchal data. We needed to export a collection of named data – perfect for a directory layout.
- By modeling objects as directories and choosing an ELF file format, we have left room for expansion without having to go back and modify existing implementations.
- We can leverage our existing toolset for working with ELF files: elfdump(1), nm(1), libelf(3LIB), and libctf. The userland changes to libdtrace(3LIB) were minimal because we already have established interfaces for working with ELF files.
- Filesystems are easily virtualized in a local zone. DTrace is still not usable from within a local zone for a few small reasons, but we’re one step closer.
- There are no data model issues. We simply export a 64-bit ELF object, and the gelf_xxx() routines handle the conversion transparently.
The final result is:
$ elfdump -c /system/object/genunix/object Section Header[1]: sh_name: .shstrtab sh_addr: 0xa6eaea30 sh_flags: [ SHF_STRINGS ] sh_size: 0x46 sh_type: [ SHT_STRTAB ] sh_offset: 0x1c4 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[2]: sh_name: .SUNW_ctf sh_addr: 0xa61f7000 sh_flags: 0 sh_size: 0x2e79d sh_type: [ SHT_PROGBITS ] sh_offset: 0x20a sh_entsize: 0 sh_link: 3 sh_info: 0 sh_addralign: 0x8 Section Header[3]: sh_name: .symtab sh_addr: 0xa61b5050 sh_flags: 0 sh_size: 0x1f7d0 sh_type: [ SHT_SYMTAB ] sh_offset: 0x2e9a7 sh_entsize: 0x10 sh_link: 4 sh_info: 0 sh_addralign: 0x8 Section Header[4]: sh_name: .strtab sh_addr: 0xa61d96dc sh_flags: [ SHF_STRINGS ] sh_size: 0x1cd5e sh_type: [ SHT_STRTAB ] sh_offset: 0x4e177 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[5]: sh_name: .text sh_addr: 0xfe87e4a0 sh_flags: [ SHF_ALLOC SHF_EXECINSTR ] sh_size: 0x198dc0 sh_type: [ SHT_NOBITS ] sh_offset: 0x6aed5 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[6]: sh_name: .data sh_addr: 0xfec3eba0 sh_flags: [ SHF_WRITE SHF_ALLOC ] sh_size: 0x3e1c0 sh_type: [ SHT_NOBITS ] sh_offset: 0x6aed5 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[7]: sh_name: .bss sh_addr: 0xfed7a5f0 sh_flags: [ SHF_WRITE SHF_ALLOC ] sh_size: 0x7664 sh_type: [ SHT_NOBITS ] sh_offset: 0x6aed5 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[8]: sh_name: .info sh_addr: 0x1 sh_flags: 0 sh_size: 0x4 sh_type: [ SHT_PROGBITS ] sh_offset: 0x6aed5 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8 Section Header[9]: sh_name: .filename sh_addr: 0xfec3e8e0 sh_flags: 0 sh_size: 0x10 sh_type: [ SHT_PROGBITS ] sh_offset: 0x6aed9 sh_entsize: 0 sh_link: 0 sh_info: 0 sh_addralign: 0x8
The string table, symbol table, and CTF data are all complete. You’ll notice that we also have text, data, and bss, but they’re marked SHT_NOBITS (which means they’re not present in the file). We use the section headers to extract information about the address range for each section, but we can’t actually export the data due to security. Obviously, letting ordinary users see the data section of loaded modules would be a Bad Thing.
To end in typical “Behind the Music” fashion – After a nightmare descent into drug an alcohol abuse, objfs once again was able to take control of its life (thanks mostly a loving relationship with libdtrace), and now lives a relaxing life on a Montana ranch.
2 Responses
> Opening /dev/kmem requires all privileges, …
Well, opening it for writing requires all privileges, but opening it for reading (which is what dtrace needed) only needs membership in group ‘sys’… And you get the side-benefit of being able to run <tt>mdb -k</tt> without <tt>su</tt>(1M)ing first.
But /system/object is a much cleaner and better answer, I must say…
Doh! Good point – I’ve been spending too much time with privileges and not enough with good old fashioned groups. I’ve updated the post.