sched Provider
The sched
provider makes available probes related
to CPU scheduling. Because CPUs are the one resource that all threads must
consume, the sched
provider is very useful for understanding
systemic behavior. For example, using the sched
provider,
you can understand when and why threads sleep, run, change priority, or wake
other threads.
26.1. Probes
The sched
probes are described in sched Probes.
Probe |
Description |
---|---|
|
Probe that fires whenever a thread's priority is about to be changed. The
|
|
Probe that fires immediately before a runnable thread is dequeued from a run
queue. The |
|
Probe that fires immediately before a runnable thread is enqueued to a run
queue. The |
|
Probe that fires when the current CPU is about to end execution of a thread. The
|
|
Probe that fires when a CPU has just begun execution of a thread. The
|
|
Probe that fires immediately before the current thread is preempted. After this
probe fires, the current thread will select a thread to run and the
|
|
Probe that fires when a scheduling decision has been made, but the dispatcher
has elected to continue to run the current thread. The |
|
Probe that fires when a thread is preempted and then re-enqueued at the
front of the run queue due to a preemption control request. See
schedctl_init(3C)
for details on preemption control. As with |
|
Probe that fires when a thread that is using preemption control is nonetheless
preempted and re-enqueued at the back of the run queue. See
schedctl_init(3C)
for details on preemption control. As with |
|
Probe that fires when a thread that had preemption control enabled and its time slice artificially extended executed code to yield the CPU to other threads. |
|
Probe that fires immediately before the current thread sleeps on a
synchronization object. The type of the synchronization object is contained in
the
|
|
Probe that fires when a CPU has been instructed by another CPU to make a scheduling decision – often because a higher-priority thread has become runnable. |
|
Probe that fires as a part of clock tick-based accounting. In clock tick-based
accounting, CPU accounting is performed by examining which threads and processes
are
running when a fixed-interval interrupt fires. The |
|
Probe that fires immediately before the current thread wakes a thread sleeping
on a synchronization object. The |
26.2. Arguments
The argument types for the sched
probes are listed
in sched Probe Arguments; the arguments
are described in sched Probes.
Probe |
|
|
|
|
---|---|---|---|---|
|
|
|
|
— |
|
|
|
|
— |
|
|
|
|
|
|
|
|
— |
— |
|
— |
— |
— |
— |
|
— |
— |
— |
— |
|
— |
— |
— |
— |
|
|
|
— |
— |
|
|
|
— |
— |
|
|
|
— |
— |
|
— |
— |
— |
— |
|
|
|
— |
— |
|
|
|
— |
— |
|
|
|
— |
— |
As sched Probe Arguments indicates,
many sched
probes have arguments consisting of a pointer
to an lwpsinfo_t
and a pointer to a psinfo_t
,
indicating a thread and the process containing the thread, respectively. These
structures are described in detail in lwpsinfo_t and psinfo_t, respectively.
26.2.1. cpuinfo_t
The cpuinfo_t
structure defines a CPU. As sched Probe Arguments indicates, arguments
to both the enqueue
and dequeue
probes
include a pointer to a cpuinfo_t
. Additionally, the cpuinfo_t
corresponding to the current CPU is pointed to by the curcpu
variable.
The definition of the cpuinfo_t
structure is as follows:
typedef struct cpuinfo {
processorid_t cpu_id; /* CPU identifier */
psetid_t cpu_pset; /* processor set identifier */
chipid_t cpu_chip; /* chip identifier */
lgrp_id_t cpu_lgrp; /* locality group identifer */
processor_info_t cpu_info; /* CPU information */
} cpuinfo_t;
The cpu_id
member is the processor identifier, as
returned by psrinfo(1M) and p_online(2).
The cpu_pset
member is the processor set that contains
the CPU, if any. See psrset(1M) for
more details on processor sets.
The cpu_chip
member is the identifier of the physical
chip. Physical chips may contain several CPUs. See psrinfo(1M) for more information.
The cpu_lgrp
member is the identifier of the latency
group associated with the CPU. See liblgrp(3LIB) for details on latency
groups.
The cpu_info
member is the processor_info_t
structure
associated with the CPU, as returned by processor_info(2).
26.3. Examples
26.3.1. on-cpu and off-cpu
One common question you might want answered is which CPUs are running
threads and for how long. You can use the on-cpu
and off-cpu
probes to easily answer this question on a system-wide basis
as shown in the following example:
sched:::on-cpu
{
self->ts = timestamp;
}
sched:::off-cpu
/self->ts/
{
@[cpu] = quantize(timestamp - self->ts);
self->ts = 0;
}
Running the above script results in output similar to the following example:
# dtrace -s ./where.d dtrace: script './where.d' matched 5 probes ^C 0 value ------------- Distribution ------------- count 2048 | 0 4096 |@@ 37 8192 |@@@@@@@@@@@@@ 212 16384 |@ 30 32768 | 10 65536 |@ 17 131072 | 12 262144 | 9 524288 | 6 1048576 | 5 2097152 | 1 4194304 | 3 8388608 |@@@@ 75 16777216 |@@@@@@@@@@@@ 201 33554432 | 6 67108864 | 0 1 value ------------- Distribution ------------- count 2048 | 0 4096 |@ 6 8192 |@@@@ 23 16384 |@@@ 18 32768 |@@@@ 22 65536 |@@@@ 22 131072 |@ 7 262144 | 5 524288 | 2 1048576 | 3 2097152 |@ 9 4194304 | 4 8388608 |@@@ 18 16777216 |@@@ 19 33554432 |@@@ 16 67108864 |@@@@ 21 134217728 |@@ 14 268435456 | 0
The above output shows that on CPU 1 threads tend to run for less than
100 microseconds at a stretch, or for approximately 10 milliseconds. A noticeable
gap between the two clusters of data shown in the histogram. You also might
be interested in knowing which CPUs are running a particular process. You
can use the on-cpu
and off-cpu
probes
for answering this question as well. The following script displays which CPUs
run a specified application over a period of ten seconds:
#pragma D option quiet
dtrace:::BEGIN
{
start = timestamp;
}
sched:::on-cpu
/execname == $$1/
{
self->ts = timestamp;
}
sched:::off-cpu
/self->ts/
{
@[cpu] = sum(timestamp - self->ts);
self->ts = 0;
}
profile:::tick-1sec
/++x == 10/
{
exit(0);
}
dtrace:::END
{
printf("CPU distribution of imapd over %d seconds:\n\n",
(timestamp - start) / 1000000000);
printf("CPU microseconds\n--- ------------\n");
normalize(@, 1000);
printa("%3d %@d\n", @);
}
Running the above script on a large mail server and specifying the IMAP daemon results in output similar to the following example:
# dtrace -s ./whererun.d imapd CPU distribution of imapd over 10 seconds: CPU microseconds --- ------------ 15 10102 12 16377 21 25317 19 25504 17 35653 13 41539 14 46669 20 57753 22 70088 16 115860 23 127775 18 160517
illumos takes into account the amount of time that a thread has been
sleeping when selecting a CPU on which to run the thread: a thread that has
been sleeping for less time tends not to migrate. You can use the
off-cpu
and on-cpu
probes to observe this behavior:
sched:::off-cpu
/curlwpsinfo->pr_state == SSLEEP/
{
self->cpu = cpu;
self->ts = timestamp;
}
sched:::on-cpu
/self->ts/
{
@[self->cpu == cpu ?
"sleep time, no CPU migration" : "sleep time, CPU migration"] =
lquantize((timestamp - self->ts) / 1000000, 0, 500, 25);
self->ts = 0;
self->cpu = 0;
}
Running the above script for approximately 30 seconds results in output similar to the following example:
# dtrace -s ./howlong.d dtrace: script './howlong.d' matched 5 probes ^C sleep time, CPU migration value -------------- Distribution ------------ count < 0 | 0 0 |@@@@@@@ 6838 25 |@@@@@ 4714 50 |@@@ 3108 75 |@ 1304 100 |@ 1557 125 |@ 1425 150 | 894 175 |@ 1526 200 |@@ 2010 225 |@@ 1933 250 |@@ 1982 275 |@@ 2051 300 |@@ 2021 325 |@ 1708 350 |@ 1113 375 | 502 400 | 220 425 | 106 450 | 54 475 | 40 >= 500 |@ 1716 sleep time, no CPU migration value -------------- Distribution ------------ count < 0 | 0 0 |@@@@@@@@@@@@ 58413 25 |@@@ 14793 50 |@@ 10050 75 | 3858 100 |@ 6242 125 |@ 6555 150 | 3980 175 |@ 5987 200 |@ 9024 225 |@ 9070 250 |@@ 10745 275 |@@ 11898 300 |@@ 11704 325 |@@ 10846 350 |@ 6962 375 | 3292 400 | 1713 425 | 585 450 | 201 475 | 96 >= 500 | 3946
The example output shows that there are many more occurrences of non-migration than migration. Also, when sleep times are longer, migrations are more likely. The distributions are noticeably different in the sub-100 millisecond range, but look very similar as the sleep times get longer. This result would seem to indicate that sleep time is not factored into the scheduling decision once a certain threshold is exceeded.
The final example using off-cpu
and on-cpu
shows
how to use these probes along with the pr_stype
field to
determine why threads sleep and for how long:
sched:::off-cpu
/curlwpsinfo->pr_state == SSLEEP/
{
/*
* We're sleeping. Track our sobj type.
*/
self->sobj = curlwpsinfo->pr_stype;
self->bedtime = timestamp;
}
sched:::off-cpu
/curlwpsinfo->pr_state == SRUN/
{
self->bedtime = timestamp;
}
sched:::on-cpu
/self->bedtime && !self->sobj/
{
@["preempted"] = quantize(timestamp - self->bedtime);
self->bedtime = 0;
}
sched:::on-cpu
/self->sobj/
{
@[self->sobj == SOBJ_MUTEX ? "kernel-level lock" :
self->sobj == SOBJ_RWLOCK ? "rwlock" :
self->sobj == SOBJ_CV ? "condition variable" :
self->sobj == SOBJ_SEMA ? "semaphore" :
self->sobj == SOBJ_USER ? "user-level lock" :
self->sobj == SOBJ_USER_PI ? "user-level prio-inheriting lock" :
self->sobj == SOBJ_SHUTTLE ? "shuttle" : "unknown"] =
quantize(timestamp - self->bedtime);
self->sobj = 0;
self->bedtime = 0;
}
Running the above script for several seconds results in output similar to the following example:
# dtrace -s ./whatfor.d dtrace: script './whatfor.d' matched 12 probes ^C kernel-level lock value -------------- Distribution ------------ count 16384 | 0 32768 |@@@@@@@@ 3 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 11 131072 |@@ 1 262144 | 0 preempted value -------------- Distribution ------------ count 16384 | 0 32768 | 4 65536 |@@@@@@@@ 408 131072 |@@@@@@@@@@@@@@@@@@@@@@ 1031 262144 |@@@ 156 524288 |@@ 116 1048576 |@ 51 2097152 | 42 4194304 | 16 8388608 | 15 16777216 | 4 33554432 | 8 67108864 | 0 semaphore value -------------- Distribution ------------ count 32768 | 0 65536 |@@ 61 131072 |@@@@@@@@@@@@@@@@@@@@@@@@ 553 262144 |@@ 63 524288 |@ 36 1048576 | 7 2097152 | 22 4194304 |@ 44 8388608 |@@@ 84 16777216 |@ 36 33554432 | 3 67108864 | 6 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 | 0 2147483648 | 0 4294967296 | 0 8589934592 | 0 17179869184 | 1 34359738368 | 0 shuttle value -------------- Distribution ------------ count 32768 | 0 65536 |@@@@@ 2 131072 |@@@@@@@@@@@@@@@@ 6 262144 |@@@@@ 2 524288 | 0 1048576 | 0 2097152 | 0 4194304 |@@@@@ 2 8388608 | 0 16777216 | 0 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 | 0 2147483648 | 0 4294967296 |@@@@@ 2 8589934592 | 0 17179869184 |@@ 1 34359738368 | 0 condition variable value -------------- Distribution ------------ count 32768 | 0 65536 | 122 131072 |@@@@@ 1579 262144 |@ 340 524288 | 268 1048576 |@@@ 1028 2097152 |@@@ 1007 4194304 |@@@ 1176 8388608 |@@@@ 1257 16777216 |@@@@@@@@@@@@@@ 4385 33554432 | 295 67108864 | 157 134217728 | 96 268435456 | 48 536870912 | 144 1073741824 | 10 2147483648 | 22 4294967296 | 18 8589934592 | 5 17179869184 | 6 34359738368 | 4 68719476736 | 0
26.3.2. enqueue and dequeue
When a CPU becomes idle, the dispatcher looks for work enqueued on other
(non-idle) CPUs. The following example uses the dequeue
probe
to understand how often applications are transferred and by which CPU:
#pragma D option quiet
sched:::dequeue
/args[2]->cpu_id != --1 && cpu != args[2]->cpu_id &&
(curlwpsinfo->pr_flag & PR_IDLE)/
{
@[stringof(args[1]->pr_fname), args[2]->cpu_id] =
lquantize(cpu, 0, 100);
}
END
{
printa("%s stolen from CPU %d by:\n%@d\n", @);
}
The tail of the output from running the above script on a 4 CPU system results in output similar to the following example:
# dtrace -s ./whosteal.d ^C ... nscd stolen from CPU 1 by: value -------------- Distribution ------------ count 1 | 0 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 28 3 | 0 snmpd stolen from CPU 1 by: value -------------- Distribution ------------ count < 0 | 0 0 |@ 1 1 | 0 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 31 3 |@@ 2 4 | 0 sched stolen from CPU 1 by: value -------------- Distribution ------------ count < 0 | 0 0 |@@ 3 1 | 0 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 36 3 |@@@@ 5 4 | 0
Instead of knowing which CPUs took which work, you might want to know
the CPUs on which processes and threads are waiting to run. You can use the
enqueue
and dequeue
probes together to
answer this question:
sched:::enqueue
{
self->ts = timestamp;
}
sched:::dequeue
/self->ts/
{
@[args[2]->cpu_id] = quantize(timestamp - self->ts);
self->ts = 0;
}
Running the above script for several seconds results in output similar to the following example:
# dtrace -s ./qtime.d dtrace: script './qtime.d' matched 5 probes ^C -1 value -------------- Distribution ------------ count 4096 | 0 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2 16384 | 0 0 value -------------- Distribution ------------ count 1024 | 0 2048 |@@@@@@@@@@@@@@@ 262 4096 |@@@@@@@@@@@@@ 227 8192 |@@@@@ 87 16384 |@@@ 54 32768 | 7 65536 | 9 131072 | 1 262144 | 5 524288 | 4 1048576 | 2 2097152 | 0 4194304 | 0 8388608 | 0 16777216 | 1 33554432 | 2 67108864 | 2 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 | 1 2147483648 | 1 4294967296 | 0 1 value -------------- Distribution ------------ count 1024 | 0 2048 |@@@@ 49 4096 |@@@@@@@@@@@@@@@@@@@@ 241 8192 |@@@@@@@ 91 16384 |@@@@ 55 32768 | 7 65536 | 3 131072 | 2 262144 | 1 524288 | 0 1048576 | 0 2097152 | 0 4194304 | 0 8388608 | 0 16777216 | 0 33554432 | 3 67108864 | 1 134217728 | 4 268435456 | 2 536870912 | 0 1073741824 | 3 2147483648 | 2 4294967296 | 0
Notice the non-zero values at the bottom of the example output. These data points reveal several instances on both CPUs where a thread was enqueued to run for several seconds.
Instead of looking at wait times, you might want to examine the length
of the run queue over time. Using the enqueue
and
dequeue
probes, you can set up an associative array to track
the queue length:
sched:::enqueue
{
this->len = qlen[args[2]->cpu_id]++;
@[args[2]->cpu_id] = lquantize(this->len, 0, 100);
}
sched:::dequeue
/qlen[args[2]->cpu_id]/
{
qlen[args[2]->cpu_id]—;
}
Running the above script for approximately 30 seconds on a largely idle uniprocessor laptop system results in output similar to the following example:
# dtrace -s ./qlen.d dtrace: script './qlen.d' matched 5 probes ^C 0 value -------------- Distribution ------------ count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@ 110626 1 |@@@@@@@@@ 41142 2 |@@ 12655 3 |@ 5074 4 | 1722 5 | 701 6 | 302 7 | 63 8 | 23 9 | 12 10 | 24 11 | 58 12 | 14 13 | 3 14 | 0
The output is roughly what you would expect for an idle system: the majority of the time that a runnable thread is enqueued, the run queue was very short (three or fewer threads in length). However, given that the system was largely idle, the exceptional data points at the bottom of the table might be unexpected. For example, why was the run queue as long as 13 runnable threads? To explore this question, you could write a D script that displays the contents of the run queue when the length of the run queue is long. This problem is complicated because D enablings cannot iterate over data structures, and therefore cannot simply iterate over the entire run queue. Even if D enablings could do so, you should avoid dependencies on the kernel's internal data structures.
For this type of script, you would enable the enqueue
and dequeue
probes and use both speculations and associative arrays.
Whenever a thread is enqueued, the script increments the length of the queue
and records the timestamp in an associative array keyed by the thread. You
cannot use a thread-local variable in this case because a thread might be
enqueued by another thread. The script then checks to see if the queue length
exceeds the maximum. If it does, the script starts a new speculation, and
records the timestamp and the new maximum. Then, when a thread is dequeued,
the script compares the enqueue timestamp to the timestamp of the longest
length: if the thread was enqueued before the timestamp
of the longest length, the thread was in the queue when the longest length
was recorded. In this case, the script speculatively traces the thread's information.
Once the kernel dequeues the last thread that was enqueued at the timestamp
of the longest length, the script commits the speculation data. This script
is shown below:
#pragma D option quiet
#pragma D option nspec=4
#pragma D option specsize=100k
int maxlen;
int spec[int];
sched:::enqueue
{
this->len = ++qlen[this->cpu = args[2]->cpu_id];
in[args[0]->pr_addr] = timestamp;
}
sched:::enqueue
/this->len > maxlen && spec[this->cpu]/
{
/*
* There is already a speculation for this CPU. We just set a new
* record, so we'll discard the old one.
*/
discard(spec[this->cpu]);
}
sched:::enqueue
/this->len > maxlen/
{
/*
* We have a winner. Set the new maximum length and set the
* timestamp of the longest length.
*/
maxlen = this->len;
longtime[this->cpu] = timestamp;
/*
* Now start a new speculation, and speculatively trace the
* length.
*/
this->spec = spec[this->cpu] = speculation();
speculate(this->spec);
printf("Run queue of length %d:\n", this->len);
}
sched:::dequeue
/(this->in = in[args[0]->pr_addr]) &&
this->in <= longtime[this->cpu = args[2]->cpu_id]/
{
speculate(spec[this->cpu]);
printf(" %d/%d (%s)\n",
args[1]->pr_pid, args[0]->pr_lwpid,
stringof(args[1]->pr_fname));
}
sched:::dequeue
/qlen[args[2]->cpu_id]/
{
in[args[0]->pr_addr] = 0;
this->len = --qlen[args[2]->cpu_id];
}
sched:::dequeue
/this->len == 0 && spec[this->cpu]/
{
/*
* We just processed the last thread that was enqueued at the
* time of longest length; commit the speculation, which by now
* contains each thread that was enqueued when the queue was
* longest.
*/
commit(spec[this->cpu]);
spec[this->cpu] = 0;
}
Running the above script on the same uniprocessor laptop results in output similar to the following example:
# dtrace -s ./whoqueue.d Run queue of length 3: 0/0 (sched) 0/0 (sched) 101170/1 (dtrace) Run queue of length 4: 0/0 (sched) 100356/1 (Xsun) 100420/1 (xterm) 101170/1 (dtrace) Run queue of length 5: 0/0 (sched) 0/0 (sched) 100356/1 (Xsun) 100420/1 (xterm) 101170/1 (dtrace) Run queue of length 7: 0/0 (sched) 100221/18 (nscd) 100221/17 (nscd) 100221/16 (nscd) 100221/13 (nscd) 100221/14 (nscd) 100221/15 (nscd) Run queue of length 16: 100821/1 (xterm) 100768/1 (xterm) 100365/1 (fvwm2) 101118/1 (xterm) 100577/1 (xterm) 101170/1 (dtrace) 101020/1 (xterm) 101089/1 (xterm) 100795/1 (xterm) 100741/1 (xterm) 100710/1 (xterm) 101048/1 (xterm) 100697/1 (MozillaFirebird-) 100420/1 (xterm) 100394/1 (xterm) 100368/1 (xterm) ^C
The output reveals that the long run queues are due to many runnable
xterm
processes. This experiment coincided with a change in
virtual desktop, and therefore the results are probably due to some sort of X
event processing.
26.3.3. sleep and wakeup
In enqueue and dequeue, the final example demonstrated that a
burst in run queue length was due to runnable xterm
processes. One hypothesis is that the observations resulted from a change in
virtual desktop. You can use the wakeup
probe to explore this
hypothesis by determining who is waking the xterm
processes,
and when, as shown in the following example:
#pragma D option quiet
dtrace:::BEGIN
{
start = timestamp;
}
sched:::wakeup
/stringof(args[1]->pr_fname) == "xterm"/
{
@[execname] = lquantize((timestamp - start) / 1000000000, 0, 10);
}
profile:::tick-1sec
/++x == 10/
{
exit(0);
}
To investigate the hypothesis, run the above script, waiting roughly
five seconds, and switch your virtual desktop exactly once. If the burst of
runnable xterm
processes is due to switching the virtual
desktop, the output should show a burst of wakeup activity at the five second
mark.
# dtrace -s ./xterm.d Xsun value -------------- Distribution ------------ count 4 | 0 5 |@ 1 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 32 7 | 0
The output does show that the X server is waking xterm processes, clustered
around the time that you switched virtual desktops. If you wanted to understand
the interaction between the X server and the xterm
processes,
you could aggregate on user stack traces when the X server fires the
wakeup
probe.
Understanding the performance of client/server systems like the X
windowing system requires understanding the clients on whose behalf the server
is doing work. This kind of question is difficult to answer with conventional
performance analysis tools. However, if you have a model where a client sends a
message to the server and sleeps pending the server's processing, you can use
the wakeup
probe to determine the client for whom the request
is being performed, as shown in the following example:
self int last;
sched:::wakeup
/self->last && args[0]->pr_stype == SOBJ_CV/
{
@[stringof(args[1]->pr_fname)] = sum(vtimestamp - self->last);
self->last = 0;
}
sched:::wakeup
/execname == "Xsun" && self->last == 0/
{
self->last = vtimestamp;
}
Running the above script results in output similar to the following example:
dtrace -s ./xwork.d dtrace: script './xwork.d' matched 14 probes ^C xterm 9522510 soffice.bin 9912594 fvwm2 100423123 MozillaFirebird 312227077 acroread 345901577
This output reveals that much Xsun
work is being
done on behalf of the processes acroread
, MozillaFirebird
and, to a lesser degree, fvwm2
. Notice that
the script only examined wakeups from condition variable synchronization objects
(SOBJ_CV
). As described in pr_stype Values, condition variables are the type of synchronization object typically
used to synchronize for reasons other than access to a shared data region.
In the case of the X server, a client will wait for data in a pipe by sleeping
on a condition variable.
You can additionally use the sleep
probe along with
the wakeup
probe to understand which applications are blocking
on which applications, and for how long, as shown in the following example:
#pragma D option quiet
sched:::sleep
/!(curlwpsinfo->pr_flag & PR_ISSYS) && curlwpsinfo->pr_stype == SOBJ_CV/
{
bedtime[curlwpsinfo->pr_addr] = timestamp;
}
sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
@[stringof(args[1]->pr_fname), execname] =
quantize(timestamp - bedtime[args[0]->pr_addr]);
bedtime[args[0]->pr_addr] = 0;
}
END
{
printa("%s sleeping on %s:\n%@d\n", @);
}
The tail of the output from running the example script for several seconds on a desktop system resembles the following example:
# dtrace -s ./whofor.d ^C ... xterm sleeping on Xsun: value -------------- Distribution ------------ count 131072 | 0 262144 | 12 524288 | 2 1048576 | 0 2097152 | 5 4194304 |@@@ 45 8388608 | 1 16777216 | 9 33554432 |@@@@@ 83 67108864 |@@@@@@@@@@@ 164 134217728 |@@@@@@@@@@ 147 268435456 |@@@@ 56 536870912 |@ 17 1073741824 | 9 2147483648 | 1 4294967296 | 3 8589934592 | 1 17179869184 | 0 fvwm2 sleeping on Xsun: value -------------- Distribution ------------ count 32768 | 0 65536 |@@@@@@@@@@@@@@@@@@@@@@ 67 131072 |@@@@@ 16 262144 |@@ 6 524288 |@ 3 1048576 |@@@@@ 15 2097152 | 0 4194304 | 0 8388608 | 1 16777216 | 0 33554432 | 0 67108864 | 1 134217728 | 0 268435456 | 0 536870912 | 1 1073741824 | 1 2147483648 | 2 4294967296 | 2 8589934592 | 2 17179869184 | 0 34359738368 | 2 68719476736 | 0 syslogd sleeping on syslogd: value -------------- Distribution ------------ count 17179869184 | 0 34359738368 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3 68719476736 | 0 MozillaFirebird sleeping on MozillaFirebird: value -------------- Distribution ------------ count 65536 | 0 131072 | 3 262144 |@@ 14 524288 | 0 1048576 |@@@ 18 2097152 | 0 4194304 | 0 8388608 | 1 16777216 | 0 33554432 | 1 67108864 | 3 134217728 |@ 7 268435456 |@@@@@@@@@@ 53 536870912 |@@@@@@@@@@@@@@ 78 1073741824 |@@@@ 25 2147483648 | 0 4294967296 | 0 8589934592 |@ 7 17179869184 | 0
You might want to understand how and why MozillaFirebird
is
blocking on itself. You could modify the above script as shown in the following
example to answer this question:
#pragma D option quiet
sched:::sleep
/execname == "MozillaFirebird" && curlwpsinfo->pr_stype == SOBJ_CV/
{
bedtime[curlwpsinfo->pr_addr] = timestamp;
}
sched:::wakeup
/execname == "MozillaFirebird" && bedtime[args[0]->pr_addr]/
{
@[args[1]->pr_pid, args[0]->pr_lwpid, pid, curlwpsinfo->pr_lwpid] =
quantize(timestamp - bedtime[args[0]->pr_addr]);
bedtime[args[0]->pr_addr] = 0;
}
sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
bedtime[args[0]->pr_addr] = 0;
}
END
{
printa("%d/%d sleeping on %d/%d:\n%@d\n", @);
}
Running the modified script for several seconds results in output similar to the following example:
# dtrace -s ./firebird.d ^C 100459/1 sleeping on 100459/13: value -------------- Distribution ------------ count 262144 | 0 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 1048576 | 0 100459/13 sleeping on 100459/1: value -------------- Distribution ------------ count 16777216 | 0 33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 67108864 | 0 100459/1 sleeping on 100459/2: value -------------- Distribution ------------ count 16384 | 0 32768 |@@@@ 5 65536 |@ 2 131072 |@@@@@ 6 262144 | 1 524288 |@ 2 1048576 | 0 2097152 |@@ 3 4194304 |@@@@ 5 8388608 |@@@@@@@@ 9 16777216 |@@@@@ 6 33554432 |@@ 3 67108864 | 0 100459/1 sleeping on 100459/5: value -------------- Distribution ------------ count 16384 | 0 32768 |@@@@@ 12 65536 |@@ 5 131072 |@@@@@@ 15 262144 | 1 524288 | 1 1048576 | 2 2097152 |@ 4 4194304 |@@@@@ 13 8388608 |@@@ 8 16777216 |@@@@@ 13 33554432 |@@ 6 67108864 |@@ 5 134217728 |@ 4 268435456 | 0 536870912 | 1 1073741824 | 0 100459/2 sleeping on 100459/1: value -------------- Distribution ------------ count 16384 | 0 32768 |@@@@@@@@@@@@@@ 11 65536 | 0 131072 |@@ 2 262144 | 0 524288 | 0 1048576 |@@@@ 3 2097152 |@ 1 4194304 |@@ 2 8388608 |@@ 2 16777216 |@ 1 33554432 |@@@@@@ 5 67108864 | 0 134217728 | 0 268435456 | 0 536870912 |@ 1 1073741824 |@ 1 2147483648 |@ 1 4294967296 | 0 100459/5 sleeping on 100459/1: value -------------- Distribution ------------ count 16384 | 0 32768 | 1 65536 | 2 131072 | 4 262144 | 7 524288 | 1 1048576 | 5 2097152 | 10 4194304 |@@@@@@ 77 8388608 |@@@@@@@@@@@@@@@@@@@@@@@ 270 16777216 |@@@ 43 33554432 |@ 20 67108864 |@ 14 134217728 | 5 268435456 | 2 536870912 | 1 1073741824 | 0
You can also use the sleep
and wakeup
probes
to understand the performance of door servers such as the name service cache
daemon, as shown in the following example:
sched:::sleep
/curlwpsinfo->pr_stype == SOBJ_SHUTTLE/
{
bedtime[curlwpsinfo->pr_addr] = timestamp;
}
sched:::wakeup
/execname == "nscd" && bedtime[args[0]->pr_addr]/
{
@[stringof(curpsinfo->pr_fname), stringof(args[1]->pr_fname)] =
quantize(timestamp - bedtime[args[0]->pr_addr]);
bedtime[args[0]->pr_addr] = 0;
}
sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
bedtime[args[0]->pr_addr] = 0;
}
The tail of the output from running the above script on a large mail server resembles the following example:
imapd value -------------- Distribution ------------ count 16384 | 0 32768 | 2 65536 |@@@@@@@@@@@@@@@@@ 57 131072 |@@@@@@@@@@@ 37 262144 | 3 524288 |@@@ 11 1048576 |@@@ 10 2097152 |@@ 9 4194304 | 1 8388608 | 0 mountd value -------------- Distribution ------------ count 65536 | 0 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 49 262144 |@@@ 6 524288 | 1 1048576 | 0 2097152 | 0 4194304 |@@@@ 7 8388608 |@ 3 16777216 | 0 sendmail value -------------- Distribution ------------ count 16384 | 0 32768 |@ 18 65536 |@@@@@@@@@@@@@@@@@ 205 131072 |@@@@@@@@@@@@@ 154 262144 |@ 23 524288 | 5 1048576 |@@@@ 50 2097152 | 7 4194304 | 5 8388608 | 2 16777216 | 0 automountd value -------------- Distribution ------------ count 32768 | 0 65536 |@@@@@@@@@@ 22 131072 |@@@@@@@@@@@@@@@@@@@@@@@ 51 262144 |@@ 6 524288 | 1 1048576 | 0 2097152 | 2 4194304 | 2 8388608 | 1 16777216 | 1 33554432 | 1 67108864 | 0 134217728 | 0 268435456 | 1 536870912 | 0
You might be interested in the unusual data points for automountd
or
the persistent data point at over one millisecond for sendmail
.
You can add additional predicates to the above script to hone in on the causes
of any exceptional or anomalous results.
26.3.4. preempt, remain-cpu
Because illumos is a preemptive system, higher priority threads preempt
lower priority ones. Preemption can induce a significant latency bubble in
the lower priority thread, so you might want to know which threads are being
preempted by which other threads. The following example shows how to use the
preempt
and remain-cpu
probes to display
this information:
#pragma D option quiet
sched:::preempt
{
self->preempt = 1;
}
sched:::remain-cpu
/self->preempt/
{
self->preempt = 0;
}
sched:::off-cpu
/self->preempt/
{
/*
* If we were told to preempt ourselves, see who we ended up
* giving the CPU to.
*/
@[stringof(args[1]->pr_fname), args[0]->pr_pri, execname,
curlwpsinfo->pr_pri] = count();
self->preempt = 0;
}
END
{
printf("%25s %3s %25 %3s %5s\n", "PREEMPTOR", "PRI",
"PREEMPTED", "PRI", "#");
printa("%25s %3d %25s %3d %5@d\n", @);
}
Running the above script for several seconds on a desktop system results in output similar to the following example:
# dtrace -s ./whopreempt.d ^C PREEMPTOR PRI PREEMPTED PRI # sched 60 Xsun 53 1 xterm 59 Xsun 53 1 MozillaFirebird 57 Xsun 53 1 mpstat 100 fvwm2 59 1 sched 99 MozillaFirebird 57 1 sched 60 dtrace 30 1 mpstat 100 Xsun 59 2 sched 60 Xsun 54 2 sched 99 sched 60 2 fvwm2 59 Xsun 44 2 sched 99 Xsun 44 2 sched 60 xterm 59 2 sched 99 Xsun 53 2 sched 99 Xsun 54 3 sched 60 fvwm2 59 3 sched 60 Xsun 59 3 sched 99 Xsun 59 4 fvwm2 59 Xsun 54 8 fvwm2 59 Xsun 53 9 Xsun 59 MozillaFirebird 57 10 sched 60 MozillaFirebird 57 14 MozillaFirebird 57 Xsun 44 16 MozillaFirebird 57 Xsun 54 18
26.3.5. change-pri
Preemption is based on priorities, so you might want to observe changes
in priority over time. The following example uses the change-pri
probe
to display this information:
sched:::change-pri
{
@[stringof(args[0]->pr_clname)] =
lquantize(args[2] - args[0]->pr_pri, -50, 50, 5);
}
The example script captures the degree to which priority is raised or lowered, and aggregates by scheduling class. Running the above script results in output similar to the following example:
# dtrace -s ./pri.d dtrace: script './pri.d' matched 10 probes ^C IA value -------------- Distribution ------------ count < -50 | 20 -50 |@ 38 -45 | 4 -40 | 13 -35 | 12 -30 | 18 -25 | 18 -20 | 23 -15 | 6 -10 |@@@@@@@@ 201 -5 |@@@@@@ 160 0 |@@@@@ 138 5 |@ 47 10 |@@ 66 15 |@ 36 20 |@ 26 25 |@ 28 30 | 18 35 | 22 40 | 8 45 | 11 >= 50 |@ 34 TS value -------------- Distribution ------------ count -15 | 0 -10 |@ 1 -5 |@@@@@@@@@@@@ 7 0 |@@@@@@@@@@@@@@@@@@@@ 12 5 | 0 10 |@@@@@ 3 15 | 0
The output shows the priority manipulation of the Interactive (IA) scheduling
class. Instead of seeing priority manipulation, you might
want to see the priority values of a particular process
and thread over time. The following script uses the change-pri
probe
to display this information:
#pragma D option quiet
BEGIN
{
start = timestamp;
}
sched:::change-pri
/args[1]->pr_pid == $1 && args[0]->pr_lwpid == $2/
{
printf("%d %d\n", timestamp - start, args[2]);
}
tick-1sec
/++n == 5/
{
exit(0);
}
To see the change in priorities over time, type the following command in one window:
$ echo $$ 139208 $ while true ; do let i=0 ; done
In another window, run the script and redirect the output to a file:
# dtrace -s ./pritime.d 139208 1 > /tmp/pritime.out #
You can use the file /tmp/pritime.out
that is generated
above as input to plotting software to graphically display priority over time.
gnuplot
is a freely available plotting package that is
available for illumos through several distributors, including
pkgsrc.
26.3.6. tick
illumos uses tick-based CPU accounting, in which
a system clock interrupt fires at a fixed interval and attributes CPU utilization
to the threads and processes running at the time of the tick. The following
example shows how to use the tick
probe to observe this
attribution:
# dtrace -n sched:::tick'{@[stringof(args[1]->pr_fname)] = count()}' ^C arch 1 sh 1 sed 1 echo 1 ls 1 FvwmAuto 1 pwd 1 awk 2 basename 2 expr 2 resize 2 tput 2 uname 2 fsflush 2 dirname 4 vim 9 fvwm2 10 ksh 19 xterm 21 Xsun 93 MozillaFirebird 260
The system clock frequency varies from operating system to operating system, but generally ranges from 25 hertz to 1024 hertz. The illumos system clock frequency is adjustable, but defaults to 100 hertz.
The tick
probe only fires if the system clock detects
a runnable thread. To use the tick
probe to observe the
system clock's frequency, you must have a thread that is always runnable.
In one window, create a looping shell as shown in the following example:
$ while true ; do let i=0 ; done
In another window, run the following script:
uint64_t last[int];
sched:::tick
/last[cpu]/
{
@[cpu] = min(timestamp - last[cpu]);
}
sched:::tick
{
last[cpu] = timestamp;
}
# dtrace -s ./ticktime.d dtrace: script './ticktime.d' matched 2 probes ^C 0 9883789
The minimum interval is 9.8 millisecond, which indicates that the default clock tick frequency is 10 milliseconds (100 hertz). The observed minimum is somewhat less than 10 milliseconds due to jitter.
One deficiency of tick-based accounting is that the system clock that performs accounting is often also responsible for dispatching any time-related scheduling activity. As a result, if a thread is to perform some amount of work every clock tick (that is, every 10 milliseconds), the system will either over-account for the thread or under-account for the thread, depending on whether the accounting is done before or after time-related dispatching scheduling activity. In illumos, accounting is performed before time-related dispatching. As a result, the system will under-account for threads running at regular interval. If such threads run for less than the clock tick interval, they can effectively “hide” behind the clock tick. The following example shows the degree to which the system has such threads:
sched:::tick,
sched:::enqueue
{
@[probename] = lquantize((timestamp / 1000000) % 10, 0, 10);
}
The output of the example script is two distributions of the millisecond
offset within a ten millisecond interval, one for the tick
probe
and another for enqueue
:
# dtrace -s ./tick.d dtrace: script './tick.d' matched 4 probes ^C tick value -------------- Distribution ------------ count 6 | 0 7 |@ 3 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 79 9 | 0 enqueue value -------------- Distribution ------------ count < 0 | 0 0 |@@ 267 1 |@@ 300 2 |@@ 259 3 |@@ 291 4 |@@@ 360 5 |@@ 305 6 |@@ 295 7 |@@@@ 522 8 |@@@@@@@@@@@@ 1315 9 |@@@ 337
The output histogram named tick
shows that the clock
tick is firing at an 8 millisecond offset. If scheduling were not at all associated
with the clock tick, the output for enqueue
would be evenly
spread across the ten millisecond interval. However, the output shows a spike
at the same 8 millisecond offset, indicating that at least some threads in
the system are being scheduled on a time basis.
26.4. Stability
The sched
provider
uses DTrace's stability mechanism to describe its stabilities, as shown in
the following table. For more information about the stability mechanism, see Stability.
Element |
Name stability |
Data stability |
Dependency class |
---|---|---|---|
Provider |
Evolving |
Evolving |
ISA |
Module |
Private |
Private |
Unknown |
Function |
Private |
Private |
Unknown |
Name |
Evolving |
Evolving |
ISA |
Arguments |
Evolving |
Evolving |
ISA |