Ever notice how two technical types can whip themselves up over unambiguous technical details they nonetheless regard quite differently? I bring you a case in point on the subject of virtual memory operations in Solaris. The outcome of this story became for me an object lesson in how communicating technical detail is at times more fraught with peril than the casual observer -- the nicest colloquial term I can come up with on the spot for certain manager types -- might appreciate.

At a train-the-trainer event for Sun's performance management course, it was suggested that output to the vmstat command, specifically the mf field, was widely misdocumented and misunderstood as "just" minor faults.

A minor fault occurs when the operating system must cannot complete a process' instruction until it fetches some datum from physical memory. This logically occurs before a major fault, fetching the datum from disk, which we'd need to do if memory didn't have our bits. You may hear now and then about micro faults, too. These occur when a fetch attempt misses an on-CPU cache. Major faults are the least-desirable thing, but a necessary consequence of chaining caches from really fast but really small storage (CPU registers) to really slow but really big storage (local/remote disk). On average, one hopes to wash out those very long round trips with many, many shorter ones.

So the rationale given was that some number of minor faults are "really" major faults that haven't happened yet. Or that's what I thought I heard, which could be wrong, teetering on brink of an impending heresy as I was. This news was problematic for me. It sounded as if the speaker meant to say minor faults are a superset that includes major faults. I didn't like the inference, but I also couldn't articulate why. First I went to vmstat's man page and read this under the mf field heading:

               mf    minor faults - but see the -S option for how
                     this field is modified.
Well, I guess that's one reason why people take a course, eh? Now, the speaker in this case is an internals guy, so if he says "misdocumented just about everywhere" I can assume he's lumping in the man page author with everyone else. More generally, I take it to mean just about everyone interprets the source code improperly. So I took a look for myself:
    298 		adjprintf(" %*.0f", 3, (kstat_delta(oldvm, newvm, "hat_fault") +
    299 		    kstat_delta(oldvm, newvm, "as_fault")) / etime);
These lines reside in the function dovmstats(). We see two values, hat_fault and as_fault, added together. The difference between them is that faults originating in a process's address space are managed in a platform-independent manner. Hardware address translation faults (HAT) on the other hand are managed according to the hardware at hand. Not SPARC or x86, mind you, but the memory management unit (MMU) and other implementation details that determine how virtual address space is hashed and mapped to physical memory. So both are treated as minor faults, presumably, because they live in the same time cost range, not because they are the same thing.

Then I pulled down my copies of Solaris Internals. The discussion in both editions is the same, so I cite page numbers here from the second edition. It seems odd now, in light of this question of hierarchy I'm researching, but the text defines major faults first, minor faults second. What further sows the seeds of my discomfort appears on pp. 473-4:

A major page fault occurs when an attempt to access a virtual memory location that is mapped by a segment does not have a physical page of memory mapped to it and the page does not exist in physical memory. ... A minor page fault occurs when an attempt is made to access a virtual memory location that resides within a segment and the page is in physical memory, but no current MMU translation is established to the physical page from the address space that caused the fault.

The first passage tells me, after some reflection, that when a process maps something in which later gets dumped from memory -- i.e., evicted from its cache -- and it's needed again, boom, major fault. The second passage tells me, after quite a bit more reflection and de-booleanizing, that address space faults and HAT faults are logically related events. Or I'm not getting the point, a distinct possibility. My inability to clarify it is compounded by two phenomena: one, the existence of an as_fault() function in the code, but no corresponding hat_fault() function; but two, the existence of both as_fault and hat_fault kstat counters.

The response I got back from the original speaker came with an opinion and a DTrace script. The opinion is that major and minor faults really should be subsets of page faults. A page fault represents access to virtual memory that is not yet or not currently associated with a physical memory location. So, these occur when a mapping has expired and needs recovering, the reason we're reviewing vmstat output in the first place, but also when a process calls a heap space location for the first time. We haven't narrowed things down here, but perhaps I am seeing now the bias that drove the statement that got me to hoist flags. Then the script:

as_fault:entry
{
  self-> as_flag =1;
}

page_lookup:return
/self->as_flag && !arg1/
{
  @major["major"] = count();
}

page_lookup:return
/self->as_flag && arg1/
{
  @minor["minor"] = count();
}

as_fault:return
{
   self-> as_flag =0;
}

tick-5sec
{
   normalize(@minor, 5);
   normalize(@major, 5);
   printa("%s\t%@d\n", @minor);
   printa("%s\t%@d\n", @major);
   clear(@minor);
   clear(@major);
}
This script uses as_fault() calls as a predicate for narrowing page_lookup() calls th those of interest. It assumes, that is, that either a major or minor fault is a result of a page_lookup(). The following clauses, which seem to say that major and minor faults can be sorted by the type of return from page_lookup itself, suggests we do know, at an early stage, which is which.

But I'm not ready to disagree with the speaker as I thought I heard him. It might well be that the way the kstat counters do their work happen to muddle this point along the way. Much as I'd like to know, tonight's flight to Boston is looming large and I'm not packed yet. Shazbat.

The moral to this story so far: technical discussions that have more to do with implementation detail than a generalizing principle or concept are often trouble. In subject areas where relatively few people need to tread, such as virtual memory operations, discussion can be problematic. What passes for a reasonable summary to the uninitiated might begin with a code dicusssion, or paraphrasing the code as briefly but fully as one can, or by using documentation of the intentions in a comprehensive framework.

And there are different forms of confusion that can ensue from each of these approaches. Giving the code-averse admin a code sample, as if it is invariably self-evident stuff, can also say "go away for a minute, I'll come back to your sort when it's important." Giving the overqualified student a paraphrase, however mildly inaccurate or overly smoothing in their estimation, distracting from the topic focus to verifying the understanding of the speaker. Then there's your Magic Garden Explained or Solaris Internals or the like, volumes no one is going to traverse quickly, much less absorb, cross-reference and verify for absence of contradiction.

What concerns me as a trainer -- and in the best of times, an educator -- is getting down audience expectations well enough to avoid the autonomic head-butt a surprised nerd will give you (present company included). Speaking a dialect that reassures all listeners they're in the right room, and that the person up front can ably facilitate the proceedings, is the first challenge I'm expected to overcome. What's surprising is that the technical depth of the subject isn't a key factor, but whether the listener feels the speaker regards the material in the proper light, and I do mean on a point by point, dotted i by crossed t basis.

The value of presenting general concepts (lecture) but tested implementation (labs) in these cases has several benefits. None of them reveal exactly how to articulate exactly how something works. But that's also the struggle: to put such things in a way that the principle at hand and the action that can be observed feels consistent, which lets the student walk away feeling they have both got the idea right and can apply it correctly.

Some time ago I noted a show-and-tell point on DTrace's operations that concerns hot-patching, or the means by which DTrace inserts and removes a probe from kernel code. In retrospect, that article falls short on at least three points. For one, even today I'm finding "Dtrace" where there should be "DTrace." A minor point, the forgiving among you may allow, but once a person is called out for it by one of the DTrace Three in public, a person might obsess.

Two, the blog title: DTrace Overhead Using SPARC. It isn't quite right. The idea of enabling probes and their impact on the system comes up at the end of the article, mostly in principle. The article nonetheless is still more a pointer to a neat-o observation a superuser can make on DTrace's probe insertion effect. One can't really determine probe cost without generating demand for it. But it has potential to become a hot topic, owing to a research paper with some dubious overreaching (slide deck here if you care for high points only). The paper and presentation discuss NTrace, a "dynamic tracing tool for the Windows kernel, drivers, system libraries, and applications that supports function boundary tracing." The bottom-line claim is that NTrace on Windows/IA-32 is faster than DTrace on Solaris/x86. And with a pioneering application of what I call the faith-based transitive property, the authors claim NTrace on Windows/IA-32 probably matches or creates less overhead than DTrace on Solaris/SPARC.

Maybe it's frustration with the current political climate in America that forces me to pick my battles lately, and so I have not read this paper very carefully. There's a certain amount of baiting and trademark graduate student petulance one expects in such a paper, but I don't have the patience to spare for it. Fortunately, Bryan Cantrill has weighed in with a summation for the Solaris team. And he hasn't yet pointed out any usage errors of mine, so I got that going for me, which is nice.

But it's the third point that pains me most about that blog entry: I completely neglected to point out a key difference in how DTrace works on SPARC and x86. I stared right at the assembler and, well, it stared right back. This week, also, I happened to have both Solaris/SPARC and Solaris/x86 on hand at once, and finally the difference struck me as something worth noting. So let's take another look.

To make things a little more interesting, I'm instrumenting the entry and return points for the kernel function ufs_write() as follows:

dtrace -n 'fbt::ufs_write:entry {}' -n 'fbt::ufs_write:return {}'
The function's assembly code before and after this enabling looks like this in SPARC:
ksh:muc19-host05# mdb -k
Loading modules: [ ... ]
> ufs_write::dis -n 3
ufs_write:                      save      %sp, -0x120, %sp
ufs_write+4:                    stx       %i4, [%sp + 0x8b7]
ufs_write+8:                    sethi     %hi(0xfffdfc00), %l0
ufs_write+0xc:                  sethi     %hi(0x20000), %l5
...
ufs_write+0x69c:                ret
ufs_write+0x6a0:                restore

> ufs_write::dis
ufs_write:                      ba,a      +0x23c5f4     <dt=0x4c>
ufs_write+4:                    stx       %i4, [%sp + 0x8b7]
ufs_write+8:                    sethi     %hi(0xfffdfc00), %l0
ufs_write+0xc:                  sethi     %hi(0x20000), %l5
...
ufs_write+0x69c:                ba,a      +0x23bf84     <dt=0x4c>
ufs_write+0x6a0:                restore
On SPARC, DTrace patches the save and ret instructions with a ba,a or branch always instruction. And you can follow that branch if you like. In mdb, just add the offset address given to the symbol/address you're at, then read out the instructions with dis(assemble) dcmd ("dee command").
> ufs_write+0x23c5f4::dis -n 11
...
0x14d3510:                      save      %sp, -0x120, %sp
0x14d3514:                      sethi     %hi(0x4c00), %o0
0x14d3518:                      or        %o0, 0x17, %o0
0x14d351c:                      mov       %i0, %o1
0x14d3520:                      mov       %i1, %o2
0x14d3524:                      mov       %i2, %o3
0x14d3528:                      mov       %i3, %o4
0x14d352c:                      mov       %i4, %o5
0x14d3530:                      sethi     %hi(0x1296c00), %g1
0x14d3534:                      call      +0x79d49dac   <dtrace_probe>
0x14d3538:                      or        %g1, 0x318, %o7
0x14d353c:                      sethi     %hi(0x4c00), %o0
0x14d3540:                      or        %o0, 0x18, %o0
0x14d3544:                      mov       0x69c, %o1
0x14d3548:                      call      +0x79d49d98   <dtrace_probe>
0x14d354c:                      mov       %i0, %o2
0x14d3550:                      ret
0x14d3554:                      restore
Judging by output alone, a plain ::dis dcmd will print out the ten instructions before and after the target address if there isn't a section boundary to anchor to. I suppose I could verify that somewhere, but documentation on dcmds isn't deep (::help dis doesn't). So where's Adam Leventhal now, hm? Anyway, I cheated a little. I used -n 11, but only after counting the actual instructions to the ret instruction by hand, then deleting the prior 0x11 instructions that were also displayed. And that "11" is in hex, mind you. If you want to phrase things decimally, use "0t17" instead. Also, bear in mind if the ufs_write() happens to change in a future release, this count may no longer get you exactly to the end.

Here's what we get by adding the branch offset value displayed at the address of the return instruction:

> ufs_write+0x69c+0x23bf84::dis -n 6
...
0x14d353c:                      sethi     %hi(0x4c00), %o0
0x14d3540:                      or        %o0, 0x18, %o0
0x14d3544:                      mov       0x69c, %o1
0x14d3548:                      call      +0x79d49d98   <dtrace_probe>
0x14d354c:                      mov       %i0, %o2
0x14d3550:                      ret
0x14d3554:                      restore
There's not much to see here, really, unless you like disassembled code. But the call to dtrace_probe() tells us we're in a right place. So upon entry to, and before returning from, the ufs_write() function, the patched instructions jump us to kernel-resident code where DTrace does some work, then restores control to the function. That's a fast way of doing things, by the way. Praise the goto!

On x86, it works differently. Pretty exciting punch-line, huh? Here's the short form for the probe-off, probe-on action in an x86 session:

mfernest@inkling:~# mdb -k
Loading modules: [...]
> ufs_write::dis -n 3
ufs_write:                      pushq  %rbp
ufs_write+1:                    movq   %rsp,%rbp
ufs_write+4:                    subq   $0x28,%rsp
ufs_write+8:                    movq   %rdi,-0x8(%rbp)

> ufs_write::dis -n 3
ufs_write:                      int    $0x3
ufs_write+1:                    movq   %rsp,%rbp
ufs_write+4:                    subq   $0x28,%rsp
ufs_write+8:                    movq   %rdi,-0x8(%rbp)
We're not branching here, we're trapping with an interrupt. Specifically, we're using an interrupt vector that is defined for use as a breakpoint. This approach is, to put it short and sweet, not so fast as branching. From here onward, the whys and wherefores get well past my comfort zone. I did find a slide deck by Frank Hofmann at Sun, but it's not skim-lightly kind of read for me. It could really, really use an audio supplement.

Bad news, huh? Well, maybe, if before now you've felt some real probe effect pain on Solaris/x86 and traced the cause back to the DTrace implementation. The authors of the paper mentioned above did that, using a test case that made 500 million calls to the system call getpid(), with probes enabled and disabled. Sure enough, trap-based handling is more expensive than branch-based handling. More to the point, I think, is that the paper makes certain shock-and-awe noises about something we could have guessed, and I'm sure quantifying the effect is not an actual discovery of any sort. You bang on an instrumentation point, half-a-billion times nonstop, you're going to get some probe effect, and the difference between adding instructions to the stream and halting the stream itself is going to take on its own peculiar dimensions.

The test case also appears to track those half-a-billion probe hits one at a time. I infer this from two points: first, they mention using the dtrace consumer's -q option, ostensibly to remove terminal I/O as a bottleneck. Who would do this with DTrace? We do have aggregations, along with buffer policies and tunings if we need them, to manage data at magnitudes like this one. Second, the authors note in their test case for NTrace/Windows that they wrote their trace data to file on at least one occasion -- and got 7.3 GB to wade through. That amounts to, I surely hope, a record per probe hit. Who cares if NTrace produces less run-time overhead? You're going to give that time back and a whole lot with post-processing. As Cantrill points out, certainly the more prominent performance issue is getting your data as close to a usable form upon dump-ation as you can.

The real point the paper tries to make, in my view, is that NTrace can do on Windows/IA-32 what Solaris cannot do on x86, because the magic -- aka Structured Exception Handling (SEH) -- is in Windows. And only Windows! Suckahs!! But there's a lot of asserting what is important and relevant to make the point, some speculating on improbable comparisons (which boils down to Windows the OS vs. SPARC the processor, hello?) and some heavy perspective-framing on the results to get there. But if you're up for a paper on the current state of German academic elbow-throwing vis-a-vis operating system wars, please to enjoy.

We're working our way through some DTrace providers to better understand what a simple Java application does when it's not doing much. On a Unix platform, a good place to start is with the system calls the application makes. The truss utility and its variants already do this work well. As an example, take a look at this truss-based analysis of the same Java program we're using. So why add DTrace to the mix? In short, greater flexibility, more power (as we learn more about it), and less impact on the system when using it. So we scratched the surface with DTrace's syscall provider in the first entry of this series, then used the proc provider to capture process-level events in the second. Now we'll look at the pid provider and see what it can tell us about our Java app.

The problem with diagnosing Java programs is that they use the JVM, a process-based interpreter, to manage platform details. They therefore sit at a remove from process observability, making many of your Solaris admins' most familiar tools less useful. And while we do have tools that expose Java code to Java practitioners in a Java-meaningful way, the gap between Java-friendly views and admin-friendly views isn't going to make many friends. To see the whole picture, we have to pick our way from system calls to process events to the pid provider. We will go on to a Java program-specific view via the hotspot provider next, but first we acknowledge the larger, process-centric world in which the JVM participates.

The pid provider, alas, can be an unwieldy little helper. Unlike the providers we've seen so far, it exists on a per-process basis. You won't see one until someone or some other process invokes a DTrace program against a named target. When you do see one, it will have a name like pid1244, where 1244 is the target's process ID. And how many probes are available can vary from a very few to a few hundred thousand.

You think I kid? Watch. We'll create our first pid provider using the sample Java program, but with a narrow definition: only the entry and return probes, only the functions defined in the main java code segment. We'll declare the probes separately as well. There isn't a smooth way to express both in one definition, for a reason we'll see shortly:

mfe@inkling$ dtrace -n 'pid$target:java::entry' -n 'pid$target:java::return' -c "java Sleeper"
dtrace: description 'pid$target:java::entry' matched 3 probes
dtrace: description 'pid$target:java::return' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0  61950                     _start:entry 
  0  61951                      __fsr:entry 
  0  61953                     __fsr:return 
  1  61945                     _start:entry 
  1  61946                      __fsr:entry 
  1  61948                     __fsr:return 
[ ~30-second pause ]
  1  61947                       main:entry 
dtrace: pid 814 has exited
The output reports which CPU processed each probe firing, the probe's ID, and the probe name. You can suppress this output with the -q option; using an aggregation, as we've been doing, precludes it. It's beyond our scope to explain why five entry probes fired but only two return probes did, not to mention the mismatch in count between the two names -- but you expected symmetry, right? Now look what happens if we enable all probes:
mfe@inkling$ dtrace -n 'pid$target::: {}' -c "java Sleeper"| wc -l
dtrace: description 'pid$target::: ' matched 247036 probes
[ ~30-second pause ]
dtrace: pid 859 has exited
400096
Almost a quarter-million probes matched! Some are for each function boundary, yes, but there's also one for each assembly-level instruction, the smallest unit of compiled code, in the program itself. That's right: the pid provider can instrument every instruction, which can impose quite a load on a resource-constrained system. Here's an excerpt from the same output without the counting utility wc:
mfe@inkling$ dtrace -n 'pid$target::: {}' -c "java Sleeper"
dtrace: description 'pid$target::: ' matched 247036 probes
...
CPU     ID                    FUNCTION:NAME
  0 310609                    aplist_test:0 
  0 310608                aplist_test:entry 
  0 310610                    aplist_test:1 
  0 310611                    aplist_test:3 
...
  0 310649                   aplist_test:67 
  0 310650                   aplist_test:68 
  0 310651                   aplist_test:69 
  0 310652                   aplist_test:6a 
  0 310607               aplist_test:return 
Now you see why we can't match just any probe name without big consequences. Most probe names are numbers that indicate an offset in bytes from the function's entry point. The probes aplist_test:0 and aplist_test:entry are therefore synonymous. Note also how the probe ID for aplist_test:return comes first in ID order. Later on we may return to this fact to explain some things about code layout at this level. For now, we'll remind ourselves we can still list individual probes as long as the process remains active:
mfe@inkling$ dtrace -ln 'pid$target::aplist_test:6a {}' -c "java Sleeper"  
   ID   PROVIDER            MODULE                          FUNCTION NAME
62478     pid895       LM1`ld.so.1                       aplist_test 6a
The backtick (`) shows that LM1 is a module namespace of sorts within the program code. The DTrace guide says these can range from LM1 to LM9. We'll accept that on faith for the sake of keeping focus. But see how the probe ID is back in five-digit territory? The previous probes disappeared when their process exited and their ID values recouped. With that in mind, beware of reading too much into the "unique value" of these indices. They will not only vary from one release of Solaris to the next, they will vary from one system to the next, even from one boot session to another on the same system. They aren't useful for much, although you can invoke a probe by its ID:
mfe@inkling$ dtrace -ln 4908                            
   ID   PROVIDER            MODULE                          FUNCTION NAME
 4908    syscall                                                read entry
But we do now know the module name that hosts our probe. The pid provider uses the module component of the tuple to distinguish among the code segments that make up the process' map. These modules refer to all shared-object libraries, such as ld.so.1 or libc.so.1, as well as the main executable segment (in our case, java). But we can also generalize the names for convenience. All main executable segments, for example, can be declared by a.out. We can also declare a shared-object libraries without their dot-extension qualifiers, so incanting LM1`ld works as well as LM1`ld.so.1:
mfe@inkling$ dtrace -ln 'pid$target:ld:aplist_test:6a {}' -c "java Sleeper"
   ID   PROVIDER            MODULE                          FUNCTION NAME
62478     pid902       LM1`ld.so.1                       aplist_test 6a
With the pid provider, we can drill right down to the nibs of any process. This capability primarily plays to the strengths of debuggers, and for anyone who knows their source code intimately well. Without those skills or that knowledge, however, the pid provider can make a huge amount of hay around any interesting needle. To use it wisely, an idea of general process structure would certainly help.

Once again, good news! Getting a process map on Solaris is easy: use the pmap utility. Get the PID of a target process with the pgrep utility. First we run "java Sleeper" in a separate window, then we incant:

mfe@inkling$ pmap `pgrep java`                          
977:	java Sleeper
08045000      12K rwx--    [ stack ]
08050000       4K r-x--  /usr/jdk/instances/jdk1.7.0/bin/java
08060000       4K rwx--  /usr/jdk/instances/jdk1.7.0/bin/java
08061000    1008K rwx--    [ heap ]
B6000000    2416K r-x--  /usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3
B626B000       4K rwx--  /usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3
B6303000       4K rwx-R    [ stack tid=12 ]
B6354000       4K rwx-R    [ stack tid=11 ]
B63D5000       4K rwx-R    [ stack tid=10 ]
B6456000       4K rwx-R    [ stack tid=5 ]
B6458000    1692K r--s-  /usr/jdk/instances/jdk1.7.0/jre/lib/rt.jar
B6600000      32K rw---    [ anon ]
B6620000     116K rw---    [ anon ]
B67E8000      16K rw---    [ anon ]
B6821000       4K rw---    [ anon ]
B68DB000       4K rwx-R    [ stack tid=9 ]
B692C000       4K rwx-R    [ stack tid=8 ]
B697A000      16K rwx-R    [ stack tid=7 ]
B69FE000       4K rwx-R    [ stack tid=4 ]
B6A00000   16384K rw---    [ anon ]
BAA00000   59392K rw---    [ anon ]
F3A00000    8192K rw---    [ anon ]
FAC10000      56K r-x--  /usr/lib/locale/common/methods_unicode.so.3
FAC2D000       4K rwx--  /usr/lib/locale/common/methods_unicode.so.3
FAC37000     116K rw---    [ anon ]
FAE00000    4096K rwx--    [ anon ]
FDE20000       4K rwx--    [ anon ]
FDE78000      16K rwx-R    [ stack tid=6 ]
FDEFC000       4K rwx-R    [ stack tid=3 ]
FDEFE000      32K rw---    [ anon ]
FDF1F000      64K rw---    [ anon ]
FDFE0000       4K rwx--    [ anon ]
FDFF0000      64K r-x--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libzip.so
FE000000       8K rwx--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libzip.so
FE010000      32K rw-s-  /tmp/hsperfdata_mfernest/977
FE020000      64K rwx--    [ anon ]
FE040000      32K r-x--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/native_threads/libhpi.so
FE058000       4K rwx--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/native_threads/libhpi.so
FE059000      12K rwx--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/native_threads/libhpi.so
FE060000       4K rw---    [ anon ]
FE070000       4K r----    [ anon ]
FE080000       4K rwx--    [ anon ]
FE090000     164K r-x--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libjava.so
FE0CA000       8K rwx--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libjava.so
FE0D0000      80K r-x--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libverify.so
FE0F4000       4K rwx--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libverify.so
FE100000      64K rwx--    [ anon ]
FE159000      56K rwx-R    [ anon ]
FE168000      28K rwx-R    [ stack tid=2 ]
FE170000      64K rw---    [ anon ]
FE181000     128K rw---    [ anon ]
FE1A2000       4K rwxs-    [ anon ]
FE1B0000      16K r-x--  /lib/libmp.so.2
FE1C4000       4K rw---  /lib/libmp.so.2
FE1D0000       4K rwx--    [ anon ]
FE1E0000      64K r-x--  /lib/libmd.so.1
FE1F0000       4K rw---  /lib/libmd.so.1
FE200000       4K rw---    [ anon ]
FE210000      24K r-x--  /lib/libgen.so.1
FE226000       4K rw---  /lib/libgen.so.1
FE230000       4K rwx--    [ anon ]
FE240000      28K r-x--  /lib/libuutil.so.1
FE257000       4K rw---  /lib/libuutil.so.1
FE260000     172K r-x--  /lib/libscf.so.1
FE29B000       4K rw---  /lib/libscf.so.1
FE2A0000     624K r-x--  /lib/libnsl.so.1
FE34C000      16K rw---  /lib/libnsl.so.1
FE350000      20K rw---  /lib/libnsl.so.1
FE360000       4K rwx--    [ anon ]
FE370000     320K r-x--  /lib/libm.so.2
FE3CF000       8K rwx--  /lib/libm.so.2
FE3E0000       4K rwx--    [ anon ]
FE3F0000       4K r-x--  /lib/libdoor.so.1
FE400000    8856K r-x--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/server/libjvm.so
FECB6000     356K rwx--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/server/libjvm.so
FED0F000      68K rwx--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/server/libjvm.so
FED30000      48K r-x--  /usr/lib/libCrun.so.1
FED4B000       8K rwx--  /usr/lib/libCrun.so.1
FED4D000      20K rwx--  /usr/lib/libCrun.so.1
FED60000       8K r-x--  /lib/libm.so.1
FED71000       4K rwx--  /lib/libm.so.1
FED80000       4K rwx--    [ anon ]
FED90000       4K r-x--  /usr/lib/libsched.so.1
FEDA0000      56K r-x--  /lib/libsocket.so.1
FEDBE000       4K rw---  /lib/libsocket.so.1
FEDC0000       4K rwx--    [ anon ]
FEDD0000      24K rwx--    [ anon ]
FEDE0000    1280K r-x--  /usr/lib/libc/libc_hwcap1.so.1
FEF20000      28K rwx--  /usr/lib/libc/libc_hwcap1.so.1
FEF27000       8K rwx--  /usr/lib/libc/libc_hwcap1.so.1
FEF30000       4K r-x--  /lib/libdl.so.1
FEF40000       4K rwx--    [ anon ]
FEF50000      92K r-x--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/jli/libjli.so
FEF76000       8K rwx--  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/jli/libjli.so
FEF80000      12K r-x--  /lib/libthread.so.1
FEF90000       4K rwx--    [ anon ]
FEFA0000       4K rw---    [ anon ]
FEFB0000       4K rw---    [ anon ]
FEFBE000     180K r-x--  /lib/ld.so.1
FEFFB000       8K rwx--  /lib/ld.so.1
FEFFD000       4K rwx--  /lib/ld.so.1
 total    106984K
You scrolled right to the bottom of that output, didn't you? I can't blame you. Java processes seem to map the whole world into their process spaces, even for small tasks. Each entry in this map is one segment to the whole: a virtual memory address, an allocation size, the segment's permissions, and the file that backs the mapping. If this final column shows [ anon ] or [ heap ] or [ stack ], it signifies a run-time allocation of memory that has no backing file. In which case, it is backed by the system's swap space. Some of these are virtual allocations, so the total memory value, about 107 MB, isn't necessarily completely charged against physical memory.

We can correlate a thing or two in this output to the work we've already done. Notice that the [ stack tid=num ] entries match the number of threads the proc provider reports. You might also notice there is no [ stack tid=1 ] in this pile. Care to guess why?

Relating an application's lifecycle to its structure, for starters, is a useful sanity check. Also, it's a reminder of the complexity that an interpreted program can bring to a system-level diagnostician. The Sleeper code is but an argument passed to the whole of what's going on here, and that remove seems to cause many a console-monitoring admin to throw hands in the air and throw reported problems back to you. But now that you, the Java programmer, can see what they see, we can move on to getting you both to see things from your perspective too. When we pick up with the hotspot provider, we'll aim for exactly that, leading to what I hope is a more tolerable limit to creating a lasting peace among admins and programmers.

Java sleeps, the proc Provider watches
Another DTrace perspective
Dec 31, 2009

In my I made some quick noise on a number of topics. For this entry and probably the next few, I want to dig in on various elements of that discussion, starting with DTrace providers and why tracing Java performance, initially, can seem like a hassle.

The JVM effects portability in Java code by managing the necessary interactions with the operating system. You can, up to a point, write Java without considering the details of the platform underneath it. The good news for developers is that most operating system and hardware concerns can even seem irrelevant when writing a lot of Java code. When and where it's necessary to integrate Java code with system libraries written in C/C++ -- DLLs in Windows, shared-object libraries in Unix -- the Java Native Interface (JNI) API provides a bridge for that gap. It's in fact through the JNI interface that many platform-independent elements of the JVM bind to the local system.

Tracing Java code outside the JVM, therefore, can be problematic. Since the JVM is a process -- a first-class object in the eyes of your system -- your Java code is a second-class thing. The OS cares about supplying resources to running processes and doesn't bother with details, such as how the JVM applies those resources. Much less does the OS care what bytecode is running. But that is the deal you ran with: you, the Java programmer don't want to concern yourself with OS management or JVM configuration, unless of course they've conspired to sully the pristine elegance that is your program code with their petulant truancy and irrational exception-throwing.

So what can we do with DTrace? We first have to determine what we're looking for when we examine a running program. Once we know that, we can select the appropriate DTrace provider. Let's say we'd like to start simple and just learn how our Java/JVM duo behave as a Solaris process. For that we can use proc, a modest provider with a mere fifteen probes to its name. The proc provider cares about four basic event types: process creation/termination, lightweight process creation/termination, executing new programs, and signalling. "Executing new programs" may seem redundant with process creation, but in Unix the action of creating a new process object (i.e., a fork) is separate from the action of giving that object a program file to run (i.e., exec). It's also no guarantee that the provided file will actually work, so we have to look out for that condition as well.

Using the same Java code listed here, we'll enable all the proc provider's probes to observe the process-level events of the program's life:

mfe@inkling$ dtrace -n 'proc::: /pid == $target/ { @[probename] = count();}' -c "java Sleeper"
dtrace: description 'proc::: ' matched 15 probes
dtrace: pid 1216 has exited

  exec                                                              1
  exec-success                                                      1
  exit                                                              1
  lwp-create                                                       11
  lwp-start                                                        11
  lwp-exit                                                         12
  fault                                                            39

Well, that's illuminating. We execute once; exec-success is one of two events that must follow exec, the other being exec-failure. We exit once, burn through eleven (or twelve?) system-level lightweight processes, and we generate a fault 39 times. What is that? Let's ask by listing that probe name:

mfe@inkling$ dtrace -ln fault                                        
   ID   PROVIDER            MODULE                          FUNCTION NAME
  571       proc              unix                              trap fault
Our trusty documentation tells us the fault probe will fire when a machine fault occurs, and that a fault code and some signal information are available. All this faulting, I would guess, occurs in the early going and is probably associated with starting up the process itself. But why guess when we can demonstrate? By looking at events one at a time and marking the time, we can see for ourselves.

To do that, we'll first drop the aggregation and set up two probe definitions, one for all proc probes and another for fault alone. We'll use the expression walltimestamp in the first block. A nanosecond-level timestamp expression is also available, but it's more than we need, and tedious to read in raw form. In the second block, we'll print out the fault code just for grins and show-how. Although I am not yet detailing specific goings-on here, I trust this code is not too mysterious:

proc:::
/ pid == $target /
{
   printf("%s fired at %Y\n", probename, walltimestamp);
}

proc:unix:trap:fault
/ pid == $target /
{
    printf("Fault code: %d\n", args[0]);
}

When we invoke the tuple proc:::, we're asking for any proc provider probes, regardless of the module they're in, the function they're in, or their names. Colons delimit the four tuple components available, so any full description takes the form:

provider:module:function:probe
When we don't specify a tuple components, we permit any possible match. If you want to see a particular probe's full definition, follow the example above (but don't bother with entry or return). The second probe definition allows no ambiguity, but the definition proc:::fault does the same thing. There are no wildcard matches possible and none likely to come into play in a later release. Just the same, it's too early to prefer brevity. I won't unless the alternative is to list a mess of probes, which is rather like importing each class you use on its own line -- potentially tedious to the point of obfuscating your intentions. In DTrace it's done with a comma-separated list like this:
proc:unix:lwp_rtt_initial:lwp-start,
proc:unix:lwp_rtt_initial:start,
proc:unix:trap:fault
{
   printf("probe fired: %s"\n, probename);
}
Ok, where were we? Right. So I committed this D program to a file, java_faults.d. We'll invoke it from the command line using the options -qs, the q to quiet default output, the s to read the following file as a DTrace script. Here we go:
mfe@inkling$ dtrace -qs java_faults.d -c "java Sleeper"                                   
lwp-start fired at 2009 Dec 31 11:29:17
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
lwp-start fired at 2009 Dec 31 11:29:17
lwp-start fired at 2009 Dec 31 11:29:17
lwp-start fired at 2009 Dec 31 11:29:17
lwp-start fired at 2009 Dec 31 11:29:17
lwp-start fired at 2009 Dec 31 11:29:17
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
lwp-start fired at 2009 Dec 31 11:29:17
lwp-start fired at 2009 Dec 31 11:29:17
lwp-start fired at 2009 Dec 31 11:29:17
lwp-start fired at 2009 Dec 31 11:29:17
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
exec fired at 2009 Dec 31 11:29:17
exec-success fired at 2009 Dec 31 11:29:17
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
lwp-create fired at 2009 Dec 31 11:29:17
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
lwp-create fired at 2009 Dec 31 11:29:17
lwp-create fired at 2009 Dec 31 11:29:17
lwp-create fired at 2009 Dec 31 11:29:17
lwp-create fired at 2009 Dec 31 11:29:17
lwp-create fired at 2009 Dec 31 11:29:17
fault fired at 2009 Dec 31 11:29:17
Fault code: 3
fault fired at 2009 Dec 31 11:29:17
Fault code: 4
lwp-create fired at 2009 Dec 31 11:29:17
lwp-create fired at 2009 Dec 31 11:29:17
lwp-create fired at 2009 Dec 31 11:29:17
lwp-start fired at 2009 Dec 31 11:29:17
lwp-create fired at 2009 Dec 31 11:29:17
lwp-create fired at 2009 Dec 31 11:29:17
[note the delay here]
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
lwp-exit fired at 2009 Dec 31 11:29:47
Sure enough, the faulting all occurs at the beginning of the program. We haven't divined the fault codes 3 and 4 yet, but this entry is already getting long, and time's a-wastin' on the ol' bargain shopping window.

For those of you ready to dive deeper on tracing Solaris faults, this article on debugging with Sun Studio dbx has a fault.d script that is worth a look.

In the next blog entry we'll bring in the pid provider, a potentially overwhelming creature that deserves its own introduction and space. Have a Happy New Year! Get smooched tonight! Or IM your bestest buddy and tell him why no self-respecting Continuum would ever let the Borg have their way with the universe, or do whatever it is someone like you would do after reading something this useless in the real world. See you in 2010!

It's been years now since I promised my JavaRanch colleague Ulf a newsletter article on DTrace and Java. In that same stretch, I've moved four times, had my son come live with me, and remarried. I've also proposed to publishers, in the form of a few books, only to acknowledge I had no real time to write them. And of course, there's regular living itself, which consumes more time than I'd otherwise allow. And so I have found that the close meditation I want for myself, before I again expose all my limits in print for all time, doesn't invite itself in. But it's time to get cracking, to start somewhere at the very least. So here's a rumination on DTrace I hope will start explaining why you want it, why you have been missing it, and why it's time to begin with it and Java.

It's not a small problem that many helpful tutorials in this vein start out pretty damn far, by my lights, with little or no idea that anyone who doesn't already get DTrace might want a boost. I don't want to drill into concepts here too deeply -- that's what my current proposed title (The Book of DTrace) is for. And if I ever write enough of it, my intended just might have cause to talk about it without fear of social ridicule. But if you're ready to try DTrace, it means you've got Solaris, Mac OS X, or hell, maybe even BSD, in front of you. And if not, but that Windows thing you have comes with a decent complement of memory, there's hope by way of VirtualBox. I'll discuss using that in a subsequent post.

DTrace, like Java, is many things at once. It has a programming language (of sorts). It is also a framework for plugging in enablers, albeit on a much more limited number of platforms. And it ships on Solaris with a series of enablers, known as providers, that make it possible to watch and record both application and system activity. Also like Java, D programs begin in a scrutable text form. Upon execution, they are converted to an intermediate format, known as DTrace Intermediate Format (DIF). Unlike Java, this bytecode, if you will, shuttles to a virtual machine that is implemented as a device driver, not a process. If the instructions are deemed safe and complete in that stage, they are passed to the appropriate provider(s), which also take the form of kernel-loaded modules. From that vantage point, it is possible to instrument almost every activity that occurs in the system.

Many other diagnostic and debugging tools do similar work. Those that examine user processes are processes themselves, such as truss. Such tools harness or attach to their process targets and collect data in a way that imposes a considerable tax. If you've had a manager that stops what you're doing to report on your last action, and only then allows you to perform the next, you have the idea. This overhead is called observation effect by the most tactful and generous among us. I prefer intrusion effect, for the same reason that plants I like and want to grow are called flowers, while those that deserve a rude yank from the earth or at least a fatal squirrel chomp are called weeds. If you've ever done work with post-mortem analysis tools like gdb or mdb, you're working with a history of something gone wrong at the very end, so you try to backtrack to the cause. Or perhaps you've learned something about debugging live kernels, and so you know that such intrusions can change system behavior such that the very behavior you want to observe might alter significantly.

DTrace's framework makes it possible to probe application and system alike, often without imposing noticeable cost. It costs nothing when not in use and takes little effort, incantation powers aside, to invoke. The DIF scheme helps squelch unsafe operations before they see the light of a CPU, which is why you may have heard claims that DTrace is completely safe, but also potentially destructive. DTrace is an unprecedented and still-unmatched tool for how it does what it does, but that doesn't mean its engineers and advocates worry over the appearance of total thematic consistency to the literal-minded or the impatient.

We'll skip the usual preliminaries of a careful, rules-based introduction to get right to it. Each DTrace provider offers a set of probes to watch and record process or system data within the provider's given scope. (Mac people who know this shmear refer to it as Instruments, because those people just have to have a different word for everything.) The vast majority of probes mark entry into or return from a function. Others mark events that aren't associated with a single function or a function boundary, such as memory paging, process scheduling, and I/O operations. There is also a time-based provider, called profile, that lets you stipulate a regular interval for taking a look around.

For our first look, we'll concern ourselves with relating Java program activity to a Solaris system. To do that, we're going to use the syscall provider, which supplies entry and return probes at the boundaries of every system call. We can list the provider's probes to get several screens of stuff. Or we can for now pipe the result to a word count program, ask for a line count, and get out:

mfe@inkling$ dtrace -lP syscall | wc -l              
471
Subtract one line for the column headers, and you've got 470 probes, 235 of them for entry and 235 for return. You might be wondering how this works if a function has multiple return points. Good question! The short answer is it works very well, thanks for asking. A longer answer is that the syscall provider supplies one probe name for all return points, as does any other provider that supplies them. Information on which return has been used, specifically, is available if you need it, but it's not treated as a topic of primary importance.

If you're like most students and consulting clients I've encountered, the first thing you want to know is how DTrace is going to solve your current problem, yes? Well, we have two prior problems to that. One, you're not ready; that's why I showed up. Two, I'm not a trained circus bear; I tend to insist that you learn, then apply. This approach helps keeps my business demand down, see. But what I will show you today is how to look at behavior from a Java program that simply sleeps for about 30 seconds:

public class Sleeper
{
    public static void main(String args[])
    {
        try {
            Thread.currentThread().sleep(30000);
        }
        catch (InterruptedException ie) {
            ie.printStackTrace();
        }
    }
}
"Sleeping code" isn't much of either, as I am sure you know. There's a measurable start cost so that it can request the time to sleep. The system has to check the current time against the request periodically, and shut things down when it's done. This code in fact relies heavily on the system to do all the work, and it expects you to do it without it caring how or even saying thank you. Did I mention that my son has moved in with me?

Using the syscall provider, let's count how many system calls this program makes. Our first DTrace program will be somewhat like truss in one respect, in that we'll attach to the process we're going to observe. Yes, I know I just maligned this approach a second ago, but I will leave it to a later post to detail how much less intrusive DTrace is by doing it. We're invoking our Java program with DTrace so we can refer to it by the variable name $target in our program. This technique saves us some amount of clumsiness we otherwise have to shoulder.

mfe@inkling$ dtrace -P 'syscall /pid == $target/ {@[probefunc] = count(); }' -c "java Sleeper"
dtrace: description 'syscall ' matched 470 probes
dtrace: pid 1167 has exited

  rexit                                         1
  exece                                         2
  getcwd                                        2
  getpagesizes                                  2
  gtime                                         2
  lseek                                         2
  lwp_mutex_timedlock                           2
  lwp_wait                                      2
  mkdir                                         2
  mmap64                                        2
  pathconf                                      2
  setrlimit                                     2
  sigpending                                    2
  uname                                         2
  unlink                                        2
  lwp_exit                                      3
  lxstat                                        4
  sysi86                                        4
  systeminfo                                    4
  yield                                         4
  getrlimit                                     6
  kill                                          6
  access                                        8
  doorfs                                        8
  fsat                                          8
  getdents64                                    8
  lwp_park                                      8
  pset                                          8
  getpid                                       10
  priocntlsys                                  10
  lwp_cond_signal                              12
  getuid                                       14
  ioctl                                        16
  open64                                       16
  xstat                                        18
  lwp_continue                                 22
  uucopy                                       22
  fcntl                                        24
  lwp_self                                     24
  schedctl                                     24
  fstat64                                      30
  lwp_create                                   33
  setcontext                                   34
  mprotect                                     38
  sysconfig                                    38
  mmapobj                                      46
  memcntl                                      48
  sigaction                                    54
  lwp_sigmask                                  60
  open                                         70
  close                                        76
  resolvepath                                  76
  lwp_cond_wait                                86
  mmap                                        102
  brk                                         176
  stat64                                      244
  pollsys                                    1004
  llseek                                     1234
  read                                       1238
There's quite a bit going on here before we even discuss the output, so let's break it down. Using dtrace -P syscall, we've asked the system to enable all 470 syscall probes. The /pid == $target/ line is a predicate. When one of these probes does fire, the predicate evaluates the conditions given at that point. In this case, we ask if the process ID responsible for the current system call matches our target program. If so, we want to continue; otherwise we're not interested.

The {@[probefunc] = count(); } statement block lists the actions we want to process. The @ specifies a data type called an aggregation. Like an associative array, an aggregation accepts one or more arbitrary expressions to generate an index of keys. The probefunc expression resolves to the name of the function just called. The aggregation then ties its keys to some manner of summarizing function, in this case a unit counter. So here we ask to keep a running count of whatever system call functions we enter or return from. The -c "java Sleeper" option invokes the given program and assigns the process ID to the $target variable, which the predicate test uses. Once the Java program terminates, the DTrace program has no more reason to hang around. Upon its own termination, it sorts and prints the aggregation's contents. Et voila.

The single and double quotes in the command-line invocation merely prevent the shell program from parsing our stuff before the dtrace program gets to it. You can see from this simple program that the need to migrate to a file-based program for more sophisticated work comes up pretty quick.

Now a couple of tuning points. It's probably more meaningful to count system call entries, if only to reduce the drama bigger numbers sometimes invite. To do that, we have to specify entry probes by name, which is easy enough. We use the -n option and supply a fully-specified probe description, also known as a tuple:

mfe@inkling$ dtrace -n 'syscall:::entry /pid == $target/ {@[probefunc] = count(); }' -c "java Sleeper"
dtrace: description 'syscall:::entry ' matched 235 probes
dtrace: pid 1172 has exited

  ...
  pollsys                                     502
  llseek                                      617
  read                                        619
No big changes here, just an acknowledgment that we can specify our tracing all the way down to probe names, even a single probe, when we want to. Now, a typical question might be, what is it we need to read 619 times in 30 seconds? For me, it's a concern that doubles with the nearly-as-many llseek calls, which suggest we're scanning big files, e.g., jars, to get what we need. How can we look at that?

We now know we can call out specific probes in tuple form, such as syscall::read:entry. What we haven't yet discovered is the wealth of information that's available when we get there. And it's a ton of stuff. For starters, we can ask which files are being read, and we can ask how much reading is going on. Check this out:

mfe@inkling$ dtrace -n 'syscall::read:entry /pid == $target / {@[arg0] = sum(arg2); }' -c "java Sleeper"          
dtrace: description 'syscall::read:entry ' matched 1 probe
dtrace: pid 1177 has exited

                3             4096
                4            16384
                6            19447
                5          1008249
Let's start with the good news. We've changed our aggregation to index on arg0, the first parameter passed to the read system call: this parameter happens to be the file to read from. We've also changed our aggregating function to sum() so that we can total up arg2, or the size of each read buffer. The bad news is that we've keyed on file descriptors, not actual files, and a descriptor may not refer to one file for the life of the program. Also, our aggregated values are useless if, for example, descriptor 5 refers to multiple files.

Incidentally, if you want to understand arg0 and arg2 for yourself, section 2 of the Solaris man pages documents most system calls. In a happier, more rose-colored time and place, we'd say all system calls are documented, but DTrace has removed the blush from that rose. Some reading comfort with C programming semantics goes a long way here. And if you're feeling so far out of touch that the anxiety wears on you, there is also an iPhone app for OpenSolaris man pages (search the app store for Solaris). Two bucks, no reviews yet. I'm starting to think I'm the first person to download it who isn't family or a close friend to the developer.

But this is where the power of DTrace really starts to kick in. Included among DTrace's many expressions is an fds[] array, which maps a file descriptor to a fileinfo_t data type. This type exposes legible members of file metadata so we can do this:

mfe@inkling$ dtrace -n 'syscall::read:entry /pid == $target / {@[fds[arg0].fi_pathname] = sum(arg2); }' -c "java Sleeper"
dtrace: description 'syscall::read:entry ' matched 1 probe
dtrace: pid 1184 has exited

  /export/home/mfernest/d/sampleApps/Sleeper.class                510
  /usr/share/lib/zoneinfo/America/Los_Angeles                    1017
  /etc/default/init                                              1536
  /usr/jdk/instances/jdk1.7.0/jre/lib/i386/jvm.cfg               4096
  /usr/jdk/instances/jdk1.7.0/jre/lib/meta-index                 4096
  /proc/1184/lstatus                                            16384
  /usr/jdk/instances/jdk1.7.0/jre/lib/ext/meta-index            16384
  /usr/jdk/instances/jdk1.7.0/jre/lib/rt.jar                  1004153
We've now put the file descriptor held by arg0 into an fds[] array index, which translates that value into a type from which we can select useful data, in this case the value of fi_pathname. We're also seeing every other file that's been read which, judging by the buffer sizes alone, don't seem interesting. We already know how many reads total are involved, so no need to worry that we're reading a byte at a time or something equally molar-grinding.

So to sleep for 30 seconds, we need to cover a storage megabyte's worth of rt.jar, among other things. Looking back to the previous list of all system calls invoked, there are even more interesting things to consider. "Interest" varies, of course, depending on your knowledge, experience and relative skepticism/cynicism of all things computed. But if you thought sleeping was free or merely a quiet activity, think again. And stay tuned! More in the next post.