DTrace Safety

DTrace is a big piece of technology, and it can be easy to lose the principles in the details. But understanding these principles is key to understanding the design decisions that we have made — and to understanding the design decisions that we will make in the future. Of these principles, the most fundamental is the principle of safety: DTrace must not be able to accidentally induce system failure. It is our strict adherence to this principle that allows DTrace to be used with confidence on production systems — and it is its use on production systems that most fundamentally separates DTrace from what has come before it.

Of course, it’s easy to say that this should be true, but what does the safety constraint mean? First and foremost, given that DTrace allows for dynamic instrumentation, this means that the user must not be allowed to instrument code and contexts that are unsafe to instrument. In any sufficiently dynamic instrumentation framework, such code and contexts exist (if nothing else, the framework itself cannot be instrumented without inducing recursion), and this must be dealt with architecturally to assure safety. We have designed DTrace such that probes are provided by instrumentation providers that guarantee their safety. That is, instead of the user picking some random point to instrument, instrumentation providers make available only the points that can be safely instrumented — and the user is restricted to selecting among these published probes. This puts the responsibility for instrumentation safety where it belongs: in the provider. The specific techniques that the providers use to assure safety are a bit too arcane to discuss here,[1] but suffice it to say that the providers are very conservative in their instrumentation.

This addresses one aspect of instrumentation safety — instrumenting wholly unsafe contexts — but it doesn’t address the recursion issue, where the code required to process a probe (a context that we call probe context) ends up itself encountering an enabled probe. This kind of recursion can be dealt with in one of two ways: lazily (that is, the recursion can be detected when it happens, and processing of the probe that induced the recursion can be aborted) or proactively (the system can be designed such that recursion is impossible). For a myriad of reasons, we elected for the second approach: to make recursion architecturally impossible. We achieve this by mandating that while in probe context, DTrace itself must not call into any facilities in the kernel at-large. This means both implicit and explicit transfers of control into the kernel-at-large — so just as DTrace must avoid (for example) allocating memory in probe context, it must also avoid inducing scheduler activity by blocking.[2]

Once the fundamental safety issues of instrumentation are addressed, focus turns to the safety of user-specified actions and predicates. Very early in our thinking on DTrace, we knew that we wanted actions and predicates to be completely programmable, giving rise to a natural question: how are they executed? For us, the answer was so clear that it was almost unspoken: we knew that we needed to develop a virtual machine that could act as a target instruction set for a custom compiler. Why was this the clear choice? Because the alternative — to execute user-specified code natively in the kernel — is untenable from a safety perspective. Executing user-specified
code natively in the kernel is untenable for many reasons:

We left these many drawbacks of native execution largely unspoken because the alternative — a
purpose-built virtual machine for executing user-specified code — was so clearly the better choice.
The virtual machine that we designed, the D Intermediate Format (DIF) virtual machine,
has the following safety properties:

Just having an appropriately restricted virtual machine addressed many safety issues, but several niggling safety issues still had to be dealt with explicitly:

So DTrace is not safe by accident — DTrace is safe by deliberate design and by careful execution. DTrace’s safety comes from a probe discovery process that assures safe instrumentation, a purpose-built virtual machine that assures safe execution, and a careful implementation that assures safe exception handling. Could a safe system for dynamic instrumentation be built with a different set of design decisions? Perhaps — but we believe that were such a system to be as safe, it would be either be so under-powered or so over-complicated as to invalidate those design decisions.

[1] The best place to see this provider-based safety is in the implementation of the FBT provider (x86, SPARC) and in the implementation of the pid provider (x86, SPARC).

[2] While it isn’t a safety issue per se, this led us to two other important design decisions: probe context is lock-free (and almost always wait-free), and
interrupts are disabled in probe context.

Technorati tags:

Posted on July 19, 2005 at 9:55 pm by bmc · Permalink
In: Solaris