At a train-the-trainer event for Sun's performance management course, it was suggested that output to the vmstat command, specifically the mf field, was widely misdocumented and misunderstood as "just" minor faults.
A minor fault occurs when the operating system must cannot complete a process' instruction until it fetches some datum from physical memory. This logically occurs before a major fault, fetching the datum from disk, which we'd need to do if memory didn't have our bits. You may hear now and then about micro faults, too. These occur when a fetch attempt misses an on-CPU cache. Major faults are the least-desirable thing, but a necessary consequence of chaining caches from really fast but really small storage (CPU registers) to really slow but really big storage (local/remote disk). On average, one hopes to wash out those very long round trips with many, many shorter ones.
So the rationale given was that some number of minor faults are "really" major faults that haven't happened yet. Or that's what I thought I heard, which could be wrong, teetering on brink of an impending heresy as I was. This news was problematic for me. It sounded as if the speaker meant to say minor faults are a superset that includes major faults. I didn't like the inference, but I also couldn't articulate why. First I went to vmstat's man page and read this under the mf field heading:
mf minor faults - but see the -S option for how
this field is modified.
Well, I guess that's one reason why people take a course, eh? Now, the speaker in this case is an internals guy, so if he says "misdocumented just about everywhere" I can assume he's lumping in the man page author with everyone else. More generally, I take it to mean just about everyone interprets the source code improperly. So I took a look for myself:
298 adjprintf(" %*.0f", 3, (kstat_delta(oldvm, newvm, "hat_fault") +
299 kstat_delta(oldvm, newvm, "as_fault")) / etime);
These lines reside in the function dovmstats(). We see two values, hat_fault and as_fault, added together. The difference between them is that faults originating in a process's address space are managed in a platform-independent manner. Hardware address translation faults (HAT) on the other hand are managed according to the hardware at hand. Not SPARC or x86, mind you, but the memory management unit (MMU) and other implementation details that determine how virtual address space is hashed and mapped to physical memory. So both are treated as minor faults, presumably, because they live in the same time cost range, not because they are the same thing.
Then I pulled down my copies of Solaris Internals. The discussion in both editions is the same, so I cite page numbers here from the second edition. It seems odd now, in light of this question of hierarchy I'm researching, but the text defines major faults first, minor faults second. What further sows the seeds of my discomfort appears on pp. 473-4:
A major page fault occurs when an attempt to access a virtual memory location that is mapped by a segment does not have a physical page of memory mapped to it and the page does not exist in physical memory. ... A minor page fault occurs when an attempt is made to access a virtual memory location that resides within a segment and the page is in physical memory, but no current MMU translation is established to the physical page from the address space that caused the fault.
The first passage tells me, after some reflection, that when a process maps something in which later gets dumped from memory -- i.e., evicted from its cache -- and it's needed again, boom, major fault. The second passage tells me, after quite a bit more reflection and de-booleanizing, that address space faults and HAT faults are logically related events. Or I'm not getting the point, a distinct possibility. My inability to clarify it is compounded by two phenomena: one, the existence of an as_fault() function in the code, but no corresponding hat_fault() function; but two, the existence of both as_fault and hat_fault kstat counters.
The response I got back from the original speaker came with an opinion and a DTrace script. The opinion is that major and minor faults really should be subsets of page faults. A page fault represents access to virtual memory that is not yet or not currently associated with a physical memory location. So, these occur when a mapping has expired and needs recovering, the reason we're reviewing vmstat output in the first place, but also when a process calls a heap space location for the first time. We haven't narrowed things down here, but perhaps I am seeing now the bias that drove the statement that got me to hoist flags. Then the script:
as_fault:entry
{
self-> as_flag =1;
}
page_lookup:return
/self->as_flag && !arg1/
{
@major["major"] = count();
}
page_lookup:return
/self->as_flag && arg1/
{
@minor["minor"] = count();
}
as_fault:return
{
self-> as_flag =0;
}
tick-5sec
{
normalize(@minor, 5);
normalize(@major, 5);
printa("%s\t%@d\n", @minor);
printa("%s\t%@d\n", @major);
clear(@minor);
clear(@major);
}
This script uses as_fault() calls as a predicate for narrowing page_lookup() calls th those of interest. It assumes, that is, that either a major or minor fault is a result of a page_lookup(). The following clauses, which seem to say that major and minor faults can be sorted by the type of return from page_lookup itself, suggests we do know, at an early stage, which is which.
But I'm not ready to disagree with the speaker as I thought I heard him. It might well be that the way the kstat counters do their work happen to muddle this point along the way. Much as I'd like to know, tonight's flight to Boston is looming large and I'm not packed yet. Shazbat.
The moral to this story so far: technical discussions that have more to do with implementation detail than a generalizing principle or concept are often trouble. In subject areas where relatively few people need to tread, such as virtual memory operations, discussion can be problematic. What passes for a reasonable summary to the uninitiated might begin with a code dicusssion, or paraphrasing the code as briefly but fully as one can, or by using documentation of the intentions in a comprehensive framework.
And there are different forms of confusion that can ensue from each of these approaches. Giving the code-averse admin a code sample, as if it is invariably self-evident stuff, can also say "go away for a minute, I'll come back to your sort when it's important." Giving the overqualified student a paraphrase, however mildly inaccurate or overly smoothing in their estimation, distracting from the topic focus to verifying the understanding of the speaker. Then there's your Magic Garden Explained or Solaris Internals or the like, volumes no one is going to traverse quickly, much less absorb, cross-reference and verify for absence of contradiction.
What concerns me as a trainer -- and in the best of times, an educator -- is getting down audience expectations well enough to avoid the autonomic head-butt a surprised nerd will give you (present company included). Speaking a dialect that reassures all listeners they're in the right room, and that the person up front can ably facilitate the proceedings, is the first challenge I'm expected to overcome. What's surprising is that the technical depth of the subject isn't a key factor, but whether the listener feels the speaker regards the material in the proper light, and I do mean on a point by point, dotted i by crossed t basis.
The value of presenting general concepts (lecture) but tested implementation (labs) in these cases has several benefits. None of them reveal exactly how to articulate exactly how something works. But that's also the struggle: to put such things in a way that the principle at hand and the action that can be observed feels consistent, which lets the student walk away feeling they have both got the idea right and can apply it correctly.
Two, the blog title: DTrace Overhead Using SPARC. It isn't quite right. The idea of enabling probes and their impact on the system comes up at the end of the article, mostly in principle. The article nonetheless is still more a pointer to a neat-o observation a superuser can make on DTrace's probe insertion effect. One can't really determine probe cost without generating demand for it. But it has potential to become a hot topic, owing to a research paper with some dubious overreaching (slide deck here if you care for high points only). The paper and presentation discuss NTrace, a "dynamic tracing tool for the Windows kernel, drivers, system libraries, and applications that supports function boundary tracing." The bottom-line claim is that NTrace on Windows/IA-32 is faster than DTrace on Solaris/x86. And with a pioneering application of what I call the faith-based transitive property, the authors claim NTrace on Windows/IA-32 probably matches or creates less overhead than DTrace on Solaris/SPARC.
Maybe it's frustration with the current political climate in America that forces me to pick my battles lately, and so I have not read this paper very carefully. There's a certain amount of baiting and trademark graduate student petulance one expects in such a paper, but I don't have the patience to spare for it. Fortunately, Bryan Cantrill has weighed in with a summation for the Solaris team. And he hasn't yet pointed out any usage errors of mine, so I got that going for me, which is nice.
But it's the third point that pains me most about that blog entry: I completely neglected to point out a key difference in how DTrace works on SPARC and x86. I stared right at the assembler and, well, it stared right back. This week, also, I happened to have both Solaris/SPARC and Solaris/x86 on hand at once, and finally the difference struck me as something worth noting. So let's take another look.
To make things a little more interesting, I'm instrumenting the entry and return points for the kernel function ufs_write() as follows:
dtrace -n 'fbt::ufs_write:entry {}' -n 'fbt::ufs_write:return {}'
The function's assembly code before and after this enabling looks like this in SPARC:
ksh:muc19-host05# mdb -k Loading modules: [ ... ] > ufs_write::dis -n 3 ufs_write: save %sp, -0x120, %sp ufs_write+4: stx %i4, [%sp + 0x8b7] ufs_write+8: sethi %hi(0xfffdfc00), %l0 ufs_write+0xc: sethi %hi(0x20000), %l5 ... ufs_write+0x69c: ret ufs_write+0x6a0: restore > ufs_write::dis ufs_write: ba,a +0x23c5f4 <dt=0x4c> ufs_write+4: stx %i4, [%sp + 0x8b7] ufs_write+8: sethi %hi(0xfffdfc00), %l0 ufs_write+0xc: sethi %hi(0x20000), %l5 ... ufs_write+0x69c: ba,a +0x23bf84 <dt=0x4c> ufs_write+0x6a0: restoreOn SPARC, DTrace patches the
save and ret instructions with a ba,a or branch always instruction. And you can follow that branch if you like. In mdb, just add the offset address given to the symbol/address you're at, then read out the instructions with dis(assemble) dcmd ("dee command").
> ufs_write+0x23c5f4::dis -n 11 ... 0x14d3510: save %sp, -0x120, %sp 0x14d3514: sethi %hi(0x4c00), %o0 0x14d3518: or %o0, 0x17, %o0 0x14d351c: mov %i0, %o1 0x14d3520: mov %i1, %o2 0x14d3524: mov %i2, %o3 0x14d3528: mov %i3, %o4 0x14d352c: mov %i4, %o5 0x14d3530: sethi %hi(0x1296c00), %g1 0x14d3534: call +0x79d49dac <dtrace_probe> 0x14d3538: or %g1, 0x318, %o7 0x14d353c: sethi %hi(0x4c00), %o0 0x14d3540: or %o0, 0x18, %o0 0x14d3544: mov 0x69c, %o1 0x14d3548: call +0x79d49d98 <dtrace_probe> 0x14d354c: mov %i0, %o2 0x14d3550: ret 0x14d3554: restoreJudging by output alone, a plain
::dis dcmd will print out the ten instructions before and after the target address if there isn't a section boundary to anchor to. I suppose I could verify that somewhere, but documentation on dcmds isn't deep (::help dis doesn't). So where's Adam Leventhal now, hm? Anyway, I cheated a little. I used -n 11, but only after counting the actual instructions to the ret instruction by hand, then deleting the prior 0x11 instructions that were also displayed. And that "11" is in hex, mind you. If you want to phrase things decimally, use "0t17" instead. Also, bear in mind if the ufs_write() happens to change in a future release, this count may no longer get you exactly to the end.
Here's what we get by adding the branch offset value displayed at the address of the return instruction:
> ufs_write+0x69c+0x23bf84::dis -n 6 ... 0x14d353c: sethi %hi(0x4c00), %o0 0x14d3540: or %o0, 0x18, %o0 0x14d3544: mov 0x69c, %o1 0x14d3548: call +0x79d49d98 <dtrace_probe> 0x14d354c: mov %i0, %o2 0x14d3550: ret 0x14d3554: restoreThere's not much to see here, really, unless you like disassembled code. But the call to
dtrace_probe() tells us we're in a right place. So upon entry to, and before returning from, the ufs_write() function, the patched instructions jump us to kernel-resident code where DTrace does some work, then restores control to the function. That's a fast way of doing things, by the way. Praise the goto!
On x86, it works differently. Pretty exciting punch-line, huh? Here's the short form for the probe-off, probe-on action in an x86 session:
mfernest@inkling:~# mdb -k Loading modules: [...] > ufs_write::dis -n 3 ufs_write: pushq %rbp ufs_write+1: movq %rsp,%rbp ufs_write+4: subq $0x28,%rsp ufs_write+8: movq %rdi,-0x8(%rbp) > ufs_write::dis -n 3 ufs_write: int $0x3 ufs_write+1: movq %rsp,%rbp ufs_write+4: subq $0x28,%rsp ufs_write+8: movq %rdi,-0x8(%rbp)We're not branching here, we're trapping with an interrupt. Specifically, we're using an interrupt vector that is defined for use as a breakpoint. This approach is, to put it short and sweet, not so fast as branching. From here onward, the whys and wherefores get well past my comfort zone. I did find a slide deck by Frank Hofmann at Sun, but it's not skim-lightly kind of read for me. It could really, really use an audio supplement.
Bad news, huh? Well, maybe, if before now you've felt some real probe effect pain on Solaris/x86 and traced the cause back to the DTrace implementation. The authors of the paper mentioned above did that, using a test case that made 500 million calls to the system call getpid(), with probes enabled and disabled. Sure enough, trap-based handling is more expensive than branch-based handling. More to the point, I think, is that the paper makes certain shock-and-awe noises about something we could have guessed, and I'm sure quantifying the effect is not an actual discovery of any sort. You bang on an instrumentation point, half-a-billion times nonstop, you're going to get some probe effect, and the difference between adding instructions to the stream and halting the stream itself is going to take on its own peculiar dimensions.
The test case also appears to track those half-a-billion probe hits one at a time. I infer this from two points: first, they mention using the dtrace consumer's -q option, ostensibly to remove terminal I/O as a bottleneck. Who would do this with DTrace? We do have aggregations, along with buffer policies and tunings if we need them, to manage data at magnitudes like this one. Second, the authors note in their test case for NTrace/Windows that they wrote their trace data to file on at least one occasion -- and got 7.3 GB to wade through. That amounts to, I surely hope, a record per probe hit. Who cares if NTrace produces less run-time overhead? You're going to give that time back and a whole lot with post-processing. As Cantrill points out, certainly the more prominent performance issue is getting your data as close to a usable form upon dump-ation as you can.
The real point the paper tries to make, in my view, is that NTrace can do on Windows/IA-32 what Solaris cannot do on x86, because the magic -- aka Structured Exception Handling (SEH) -- is in Windows. And only Windows! Suckahs!! But there's a lot of asserting what is important and relevant to make the point, some speculating on improbable comparisons (which boils down to Windows the OS vs. SPARC the processor, hello?) and some heavy perspective-framing on the results to get there. But if you're up for a paper on the current state of German academic elbow-throwing vis-a-vis operating system wars, please to enjoy.
We're working our way through some DTrace providers to better understand what a simple Java application does when it's not doing much. On a Unix platform, a good place to start is with the system calls the application makes. The truss utility and its variants already do this work well. As an example, take a look at this truss-based analysis of the same Java program we're using. So why add DTrace to the mix? In short, greater flexibility, more power (as we learn more about it), and less impact on the system when using it. So we scratched the surface with DTrace's syscall provider in the first entry of this series, then used the proc provider to capture process-level events in the second. Now we'll look at the pid provider and see what it can tell us about our Java app.
The problem with diagnosing Java programs is that they use the JVM, a process-based interpreter, to manage platform details. They therefore sit at a remove from process observability, making many of your Solaris admins' most familiar tools less useful. And while we do have tools that expose Java code to Java practitioners in a Java-meaningful way, the gap between Java-friendly views and admin-friendly views isn't going to make many friends. To see the whole picture, we have to pick our way from system calls to process events to the pid provider. We will go on to a Java program-specific view via the hotspot provider next, but first we acknowledge the larger, process-centric world in which the JVM participates.
The pid provider, alas, can be an unwieldy little helper. Unlike the providers we've seen so far, it exists on a per-process basis. You won't see one until someone or some other process invokes a DTrace program against a named target. When you do see one, it will have a name like pid1244, where 1244 is the target's process ID. And how many probes are available can vary from a very few to a few hundred thousand.
You think I kid? Watch. We'll create our first pid provider using the sample Java program, but with a narrow definition: only the entry and return probes, only the functions defined in the main java code segment. We'll declare the probes separately as well. There isn't a smooth way to express both in one definition, for a reason we'll see shortly:
mfe@inkling$ dtrace -n 'pid$target:java::entry' -n 'pid$target:java::return' -c "java Sleeper" dtrace: description 'pid$target:java::entry' matched 3 probes dtrace: description 'pid$target:java::return' matched 2 probes CPU ID FUNCTION:NAME 0 61950 _start:entry 0 61951 __fsr:entry 0 61953 __fsr:return 1 61945 _start:entry 1 61946 __fsr:entry 1 61948 __fsr:return[ ~30-second pause ]
1 61947 main:entry dtrace: pid 814 has exitedThe output reports which CPU processed each probe firing, the probe's ID, and the probe name. You can suppress this output with the
-q option; using an aggregation, as we've been doing, precludes it. It's beyond our scope to explain why five entry probes fired but only two return probes did, not to mention the mismatch in count between the two names -- but you expected symmetry, right? Now look what happens if we enable all probes:
mfe@inkling$ dtrace -n 'pid$target::: {}' -c "java Sleeper"| wc -l
dtrace: description 'pid$target::: ' matched 247036 probes
[ ~30-second pause ]dtrace: pid 859 has exited 400096Almost a quarter-million probes matched! Some are for each function boundary, yes, but there's also one for each assembly-level instruction, the smallest unit of compiled code, in the program itself. That's right: the pid provider can instrument every instruction, which can impose quite a load on a resource-constrained system. Here's an excerpt from the same output without the counting utility
wc:
mfe@inkling$ dtrace -n 'pid$target::: {}' -c "java Sleeper"
dtrace: description 'pid$target::: ' matched 247036 probes
...
CPU ID FUNCTION:NAME
0 310609 aplist_test:0
0 310608 aplist_test:entry
0 310610 aplist_test:1
0 310611 aplist_test:3
...
0 310649 aplist_test:67
0 310650 aplist_test:68
0 310651 aplist_test:69
0 310652 aplist_test:6a
0 310607 aplist_test:return
Now you see why we can't match just any probe name without big consequences. Most probe names are numbers that indicate an offset in bytes from the function's entry point. The probes aplist_test:0 and aplist_test:entry are therefore synonymous. Note also how the probe ID for aplist_test:return comes first in ID order. Later on we may return to this fact to explain some things about code layout at this level. For now, we'll remind ourselves we can still list individual probes as long as the process remains active:
mfe@inkling$ dtrace -ln 'pid$target::aplist_test:6a {}' -c "java Sleeper"
ID PROVIDER MODULE FUNCTION NAME
62478 pid895 LM1`ld.so.1 aplist_test 6a
The backtick (`) shows that LM1 is a module namespace of sorts within the program code. The DTrace guide says these can range from LM1 to LM9. We'll accept that on faith for the sake of keeping focus. But see how the probe ID is back in five-digit territory? The previous probes disappeared when their process exited and their ID values recouped. With that in mind, beware of reading too much into the "unique value" of these indices. They will not only vary from one release of Solaris to the next, they will vary from one system to the next, even from one boot session to another on the same system. They aren't useful for much, although you can invoke a probe by its ID:
mfe@inkling$ dtrace -ln 4908 ID PROVIDER MODULE FUNCTION NAME 4908 syscall read entryBut we do now know the module name that hosts our probe. The pid provider uses the module component of the tuple to distinguish among the code segments that make up the process' map. These modules refer to all shared-object libraries, such as ld.so.1 or libc.so.1, as well as the main executable segment (in our case, java). But we can also generalize the names for convenience. All main executable segments, for example, can be declared by
a.out. We can also declare a shared-object libraries without their dot-extension qualifiers, so incanting LM1`ld works as well as LM1`ld.so.1:
mfe@inkling$ dtrace -ln 'pid$target:ld:aplist_test:6a {}' -c "java Sleeper"
ID PROVIDER MODULE FUNCTION NAME
62478 pid902 LM1`ld.so.1 aplist_test 6a
With the pid provider, we can drill right down to the nibs of any process. This capability primarily plays to the strengths of debuggers, and for anyone who knows their source code intimately well. Without those skills or that knowledge, however, the pid provider can make a huge amount of hay around any interesting needle. To use it wisely, an idea of general process structure would certainly help.
Once again, good news! Getting a process map on Solaris is easy: use the pmap utility. Get the PID of a target process with the pgrep utility. First we run "java Sleeper" in a separate window, then we incant:
mfe@inkling$ pmap `pgrep java` 977: java Sleeper 08045000 12K rwx-- [ stack ] 08050000 4K r-x-- /usr/jdk/instances/jdk1.7.0/bin/java 08060000 4K rwx-- /usr/jdk/instances/jdk1.7.0/bin/java 08061000 1008K rwx-- [ heap ] B6000000 2416K r-x-- /usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3 B626B000 4K rwx-- /usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3 B6303000 4K rwx-R [ stack tid=12 ] B6354000 4K rwx-R [ stack tid=11 ] B63D5000 4K rwx-R [ stack tid=10 ] B6456000 4K rwx-R [ stack tid=5 ] B6458000 1692K r--s- /usr/jdk/instances/jdk1.7.0/jre/lib/rt.jar B6600000 32K rw--- [ anon ] B6620000 116K rw--- [ anon ] B67E8000 16K rw--- [ anon ] B6821000 4K rw--- [ anon ] B68DB000 4K rwx-R [ stack tid=9 ] B692C000 4K rwx-R [ stack tid=8 ] B697A000 16K rwx-R [ stack tid=7 ] B69FE000 4K rwx-R [ stack tid=4 ] B6A00000 16384K rw--- [ anon ] BAA00000 59392K rw--- [ anon ] F3A00000 8192K rw--- [ anon ] FAC10000 56K r-x-- /usr/lib/locale/common/methods_unicode.so.3 FAC2D000 4K rwx-- /usr/lib/locale/common/methods_unicode.so.3 FAC37000 116K rw--- [ anon ] FAE00000 4096K rwx-- [ anon ] FDE20000 4K rwx-- [ anon ] FDE78000 16K rwx-R [ stack tid=6 ] FDEFC000 4K rwx-R [ stack tid=3 ] FDEFE000 32K rw--- [ anon ] FDF1F000 64K rw--- [ anon ] FDFE0000 4K rwx-- [ anon ] FDFF0000 64K r-x-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libzip.so FE000000 8K rwx-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libzip.so FE010000 32K rw-s- /tmp/hsperfdata_mfernest/977 FE020000 64K rwx-- [ anon ] FE040000 32K r-x-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/native_threads/libhpi.so FE058000 4K rwx-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/native_threads/libhpi.so FE059000 12K rwx-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/native_threads/libhpi.so FE060000 4K rw--- [ anon ] FE070000 4K r---- [ anon ] FE080000 4K rwx-- [ anon ] FE090000 164K r-x-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libjava.so FE0CA000 8K rwx-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libjava.so FE0D0000 80K r-x-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libverify.so FE0F4000 4K rwx-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/libverify.so FE100000 64K rwx-- [ anon ] FE159000 56K rwx-R [ anon ] FE168000 28K rwx-R [ stack tid=2 ] FE170000 64K rw--- [ anon ] FE181000 128K rw--- [ anon ] FE1A2000 4K rwxs- [ anon ] FE1B0000 16K r-x-- /lib/libmp.so.2 FE1C4000 4K rw--- /lib/libmp.so.2 FE1D0000 4K rwx-- [ anon ] FE1E0000 64K r-x-- /lib/libmd.so.1 FE1F0000 4K rw--- /lib/libmd.so.1 FE200000 4K rw--- [ anon ] FE210000 24K r-x-- /lib/libgen.so.1 FE226000 4K rw--- /lib/libgen.so.1 FE230000 4K rwx-- [ anon ] FE240000 28K r-x-- /lib/libuutil.so.1 FE257000 4K rw--- /lib/libuutil.so.1 FE260000 172K r-x-- /lib/libscf.so.1 FE29B000 4K rw--- /lib/libscf.so.1 FE2A0000 624K r-x-- /lib/libnsl.so.1 FE34C000 16K rw--- /lib/libnsl.so.1 FE350000 20K rw--- /lib/libnsl.so.1 FE360000 4K rwx-- [ anon ] FE370000 320K r-x-- /lib/libm.so.2 FE3CF000 8K rwx-- /lib/libm.so.2 FE3E0000 4K rwx-- [ anon ] FE3F0000 4K r-x-- /lib/libdoor.so.1 FE400000 8856K r-x-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/server/libjvm.so FECB6000 356K rwx-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/server/libjvm.so FED0F000 68K rwx-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/server/libjvm.so FED30000 48K r-x-- /usr/lib/libCrun.so.1 FED4B000 8K rwx-- /usr/lib/libCrun.so.1 FED4D000 20K rwx-- /usr/lib/libCrun.so.1 FED60000 8K r-x-- /lib/libm.so.1 FED71000 4K rwx-- /lib/libm.so.1 FED80000 4K rwx-- [ anon ] FED90000 4K r-x-- /usr/lib/libsched.so.1 FEDA0000 56K r-x-- /lib/libsocket.so.1 FEDBE000 4K rw--- /lib/libsocket.so.1 FEDC0000 4K rwx-- [ anon ] FEDD0000 24K rwx-- [ anon ] FEDE0000 1280K r-x-- /usr/lib/libc/libc_hwcap1.so.1 FEF20000 28K rwx-- /usr/lib/libc/libc_hwcap1.so.1 FEF27000 8K rwx-- /usr/lib/libc/libc_hwcap1.so.1 FEF30000 4K r-x-- /lib/libdl.so.1 FEF40000 4K rwx-- [ anon ] FEF50000 92K r-x-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/jli/libjli.so FEF76000 8K rwx-- /usr/jdk/instances/jdk1.7.0/jre/lib/i386/jli/libjli.so FEF80000 12K r-x-- /lib/libthread.so.1 FEF90000 4K rwx-- [ anon ] FEFA0000 4K rw--- [ anon ] FEFB0000 4K rw--- [ anon ] FEFBE000 180K r-x-- /lib/ld.so.1 FEFFB000 8K rwx-- /lib/ld.so.1 FEFFD000 4K rwx-- /lib/ld.so.1 total 106984KYou scrolled right to the bottom of that output, didn't you? I can't blame you. Java processes seem to map the whole world into their process spaces, even for small tasks. Each entry in this map is one segment to the whole: a virtual memory address, an allocation size, the segment's permissions, and the file that backs the mapping. If this final column shows
[ anon ] or [ heap ] or [ stack ], it signifies a run-time allocation of memory that has no backing file. In which case, it is backed by the system's swap space. Some of these are virtual allocations, so the total memory value, about 107 MB, isn't necessarily completely charged against physical memory.
We can correlate a thing or two in this output to the work we've already done. Notice that the [ stack tid=num ] entries match the number of threads the proc provider reports. You might also notice there is no [ stack tid=1 ] in this pile. Care to guess why?
Relating an application's lifecycle to its structure, for starters, is a useful sanity check. Also, it's a reminder of the complexity that an interpreted program can bring to a system-level diagnostician. The Sleeper code is but an argument passed to the whole of what's going on here, and that remove seems to cause many a console-monitoring admin to throw hands in the air and throw reported problems back to you. But now that you, the Java programmer, can see what they see, we can move on to getting you both to see things from your perspective too. When we pick up with the hotspot provider, we'll aim for exactly that, leading to what I hope is a more tolerable limit to creating a lasting peace among admins and programmers.
In my previous post I made some quick noise on a number of topics. For this entry and probably the next few, I want to dig in on various elements of that discussion, starting with DTrace providers and why tracing Java performance, initially, can seem like a hassle.
The JVM effects portability in Java code by managing the necessary interactions with the operating system. You can, up to a point, write Java without considering the details of the platform underneath it. The good news for developers is that most operating system and hardware concerns can even seem irrelevant when writing a lot of Java code. When and where it's necessary to integrate Java code with system libraries written in C/C++ -- DLLs in Windows, shared-object libraries in Unix -- the Java Native Interface (JNI) API provides a bridge for that gap. It's in fact through the JNI interface that many platform-independent elements of the JVM bind to the local system.
Tracing Java code outside the JVM, therefore, can be problematic. Since the JVM is a process -- a first-class object in the eyes of your system -- your Java code is a second-class thing. The OS cares about supplying resources to running processes and doesn't bother with details, such as how the JVM applies those resources. Much less does the OS care what bytecode is running. But that is the deal you ran with: you, the Java programmer don't want to concern yourself with OS management or JVM configuration, unless of course they've conspired to sully the pristine elegance that is your program code with their petulant truancy and irrational exception-throwing.
So what can we do with DTrace? We first have to determine what we're looking for when we examine a running program. Once we know that, we can select the appropriate DTrace provider. Let's say we'd like to start simple and just learn how our Java/JVM duo behave as a Solaris process. For that we can use proc, a modest provider with a mere fifteen probes to its name. The proc provider cares about four basic event types: process creation/termination, lightweight process creation/termination, executing new programs, and signalling. "Executing new programs" may seem redundant with process creation, but in Unix the action of creating a new process object (i.e., a fork) is separate from the action of giving that object a program file to run (i.e., exec). It's also no guarantee that the provided file will actually work, so we have to look out for that condition as well.
Using the same Java code listed here, we'll enable all the proc provider's probes to observe the process-level events of the program's life:
mfe@inkling$ dtrace -n 'proc::: /pid == $target/ { @[probename] = count();}' -c "java Sleeper"
dtrace: description 'proc::: ' matched 15 probes
dtrace: pid 1216 has exited
exec 1
exec-success 1
exit 1
lwp-create 11
lwp-start 11
lwp-exit 12
fault 39
Well, that's illuminating. We execute once; exec-success is one of two events that must follow exec, the other being exec-failure. We exit once, burn through eleven (or twelve?) system-level lightweight processes, and we generate a fault 39 times. What is that? Let's ask by listing that probe name:
mfe@inkling$ dtrace -ln fault ID PROVIDER MODULE FUNCTION NAME 571 proc unix trap faultOur trusty documentation tells us the
fault probe will fire when a machine fault occurs, and that a fault code and some signal information are available. All this faulting, I would guess, occurs in the early going and is probably associated with starting up the process itself. But why guess when we can demonstrate? By looking at events one at a time and marking the time, we can see for ourselves.
To do that, we'll first drop the aggregation and set up two probe definitions, one for all proc probes and another for fault alone. We'll use the expression walltimestamp in the first block. A nanosecond-level timestamp expression is also available, but it's more than we need, and tedious to read in raw form. In the second block, we'll print out the fault code just for grins and show-how. Although I am not yet detailing specific goings-on here, I trust this code is not too mysterious:
proc:::
/ pid == $target /
{
printf("%s fired at %Y\n", probename, walltimestamp);
}
proc:unix:trap:fault
/ pid == $target /
{
printf("Fault code: %d\n", args[0]);
}
When we invoke the tuple proc:::, we're asking for any proc provider probes, regardless of the module they're in, the function they're in, or their names. Colons delimit the four tuple components available, so any full description takes the form:
provider:module:function:probeWhen we don't specify a tuple components, we permit any possible match. If you want to see a particular probe's full definition, follow the example above (but don't bother with
entry or return). The second probe definition allows no ambiguity, but the definition proc:::fault does the same thing. There are no wildcard matches possible and none likely to come into play in a later release. Just the same, it's too early to prefer brevity. I won't unless the alternative is to list a mess of probes, which is rather like importing each class you use on its own line -- potentially tedious to the point of obfuscating your intentions. In DTrace it's done with a comma-separated list like this:
proc:unix:lwp_rtt_initial:lwp-start,
proc:unix:lwp_rtt_initial:start,
proc:unix:trap:fault
{
printf("probe fired: %s"\n, probename);
}
Ok, where were we? Right. So I committed this D program to a file, java_faults.d. We'll invoke it from the command line using the options -qs, the q to quiet default output, the s to read the following file as a DTrace script. Here we go:
mfe@inkling$ dtrace -qs java_faults.d -c "java Sleeper" lwp-start fired at 2009 Dec 31 11:29:17 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 lwp-start fired at 2009 Dec 31 11:29:17 lwp-start fired at 2009 Dec 31 11:29:17 lwp-start fired at 2009 Dec 31 11:29:17 lwp-start fired at 2009 Dec 31 11:29:17 lwp-start fired at 2009 Dec 31 11:29:17 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 lwp-start fired at 2009 Dec 31 11:29:17 lwp-start fired at 2009 Dec 31 11:29:17 lwp-start fired at 2009 Dec 31 11:29:17 lwp-start fired at 2009 Dec 31 11:29:17 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 exec fired at 2009 Dec 31 11:29:17 exec-success fired at 2009 Dec 31 11:29:17 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 lwp-create fired at 2009 Dec 31 11:29:17 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 lwp-create fired at 2009 Dec 31 11:29:17 lwp-create fired at 2009 Dec 31 11:29:17 lwp-create fired at 2009 Dec 31 11:29:17 lwp-create fired at 2009 Dec 31 11:29:17 lwp-create fired at 2009 Dec 31 11:29:17 fault fired at 2009 Dec 31 11:29:17 Fault code: 3 fault fired at 2009 Dec 31 11:29:17 Fault code: 4 lwp-create fired at 2009 Dec 31 11:29:17 lwp-create fired at 2009 Dec 31 11:29:17 lwp-create fired at 2009 Dec 31 11:29:17 lwp-start fired at 2009 Dec 31 11:29:17 lwp-create fired at 2009 Dec 31 11:29:17 lwp-create fired at 2009 Dec 31 11:29:17[note the delay here]
lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47 lwp-exit fired at 2009 Dec 31 11:29:47Sure enough, the faulting all occurs at the beginning of the program. We haven't divined the fault codes 3 and 4 yet, but this entry is already getting long, and time's a-wastin' on the ol' bargain shopping window.
For those of you ready to dive deeper on tracing Solaris faults, this article on debugging with Sun Studio dbx has a fault.d script that is worth a look.
In the next blog entry we'll bring in the pid provider, a potentially overwhelming creature that deserves its own introduction and space. Have a Happy New Year! Get smooched tonight! Or IM your bestest buddy and tell him why no self-respecting Continuum would ever let the Borg have their way with the universe, or do whatever it is someone like you would do after reading something this useless in the real world. See you in 2010!
It's not a small problem that many helpful tutorials in this vein start out pretty damn far, by my lights, with little or no idea that anyone who doesn't already get DTrace might want a boost. I don't want to drill into concepts here too deeply -- that's what my current proposed title (The Book of DTrace) is for. And if I ever write enough of it, my intended just might have cause to talk about it without fear of social ridicule. But if you're ready to try DTrace, it means you've got Solaris, Mac OS X, or hell, maybe even BSD, in front of you. And if not, but that Windows thing you have comes with a decent complement of memory, there's hope by way of VirtualBox. I'll discuss using that in a subsequent post.
DTrace, like Java, is many things at once. It has a programming language (of sorts). It is also a framework for plugging in enablers, albeit on a much more limited number of platforms. And it ships on Solaris with a series of enablers, known as providers, that make it possible to watch and record both application and system activity. Also like Java, D programs begin in a scrutable text form. Upon execution, they are converted to an intermediate format, known as DTrace Intermediate Format (DIF). Unlike Java, this bytecode, if you will, shuttles to a virtual machine that is implemented as a device driver, not a process. If the instructions are deemed safe and complete in that stage, they are passed to the appropriate provider(s), which also take the form of kernel-loaded modules. From that vantage point, it is possible to instrument almost every activity that occurs in the system.
Many other diagnostic and debugging tools do similar work. Those that examine user processes are processes themselves, such as truss. Such tools harness or attach to their process targets and collect data in a way that imposes a considerable tax. If you've had a manager that stops what you're doing to report on your last action, and only then allows you to perform the next, you have the idea. This overhead is called observation effect by the most tactful and generous among us. I prefer intrusion effect, for the same reason that plants I like and want to grow are called flowers, while those that deserve a rude yank from the earth or at least a fatal squirrel chomp are called weeds. If you've ever done work with post-mortem analysis tools like gdb or mdb, you're working with a history of something gone wrong at the very end, so you try to backtrack to the cause. Or perhaps you've learned something about debugging live kernels, and so you know that such intrusions can change system behavior such that the very behavior you want to observe might alter significantly.
DTrace's framework makes it possible to probe application and system alike, often without imposing noticeable cost. It costs nothing when not in use and takes little effort, incantation powers aside, to invoke. The DIF scheme helps squelch unsafe operations before they see the light of a CPU, which is why you may have heard claims that DTrace is completely safe, but also potentially destructive. DTrace is an unprecedented and still-unmatched tool for how it does what it does, but that doesn't mean its engineers and advocates worry over the appearance of total thematic consistency to the literal-minded or the impatient.
We'll skip the usual preliminaries of a careful, rules-based introduction to get right to it. Each DTrace provider offers a set of probes to watch and record process or system data within the provider's given scope. (Mac people who know this shmear refer to it as Instruments, because those people just have to have a different word for everything.) The vast majority of probes mark entry into or return from a function. Others mark events that aren't associated with a single function or a function boundary, such as memory paging, process scheduling, and I/O operations. There is also a time-based provider, called profile, that lets you stipulate a regular interval for taking a look around.
For our first look, we'll concern ourselves with relating Java program activity to a Solaris system. To do that, we're going to use the syscall provider, which supplies entry and return probes at the boundaries of every system call. We can list the provider's probes to get several screens of stuff. Or we can for now pipe the result to a word count program, ask for a line count, and get out:
mfe@inkling$ dtrace -lP syscall | wc -l 471Subtract one line for the column headers, and you've got 470 probes, 235 of them for entry and 235 for return. You might be wondering how this works if a function has multiple return points. Good question! The short answer is it works very well, thanks for asking. A longer answer is that the syscall provider supplies one probe name for all return points, as does any other provider that supplies them. Information on which return has been used, specifically, is available if you need it, but it's not treated as a topic of primary importance.
If you're like most students and consulting clients I've encountered, the first thing you want to know is how DTrace is going to solve your current problem, yes? Well, we have two prior problems to that. One, you're not ready; that's why I showed up. Two, I'm not a trained circus bear; I tend to insist that you learn, then apply. This approach helps keeps my business demand down, see. But what I will show you today is how to look at behavior from a Java program that simply sleeps for about 30 seconds:
public class Sleeper
{
public static void main(String args[])
{
try {
Thread.currentThread().sleep(30000);
}
catch (InterruptedException ie) {
ie.printStackTrace();
}
}
}
"Sleeping code" isn't much of either, as I am sure you know. There's a measurable start cost so that it can request the time to sleep. The system has to check the current time against the request periodically, and shut things down when it's done. This code in fact relies heavily on the system to do all the work, and it expects you to do it without it caring how or even saying thank you. Did I mention that my son has moved in with me?
Using the syscall provider, let's count how many system calls this program makes. Our first DTrace program will be somewhat like truss in one respect, in that we'll attach to the process we're going to observe. Yes, I know I just maligned this approach a second ago, but I will leave it to a later post to detail how much less intrusive DTrace is by doing it. We're invoking our Java program with DTrace so we can refer to it by the variable name $target in our program. This technique saves us some amount of clumsiness we otherwise have to shoulder.
mfe@inkling$ dtrace -P 'syscall /pid == $target/ {@[probefunc] = count(); }' -c "java Sleeper"
dtrace: description 'syscall ' matched 470 probes
dtrace: pid 1167 has exited
rexit 1
exece 2
getcwd 2
getpagesizes 2
gtime 2
lseek 2
lwp_mutex_timedlock 2
lwp_wait 2
mkdir 2
mmap64 2
pathconf 2
setrlimit 2
sigpending 2
uname 2
unlink 2
lwp_exit 3
lxstat 4
sysi86 4
systeminfo 4
yield 4
getrlimit 6
kill 6
access 8
doorfs 8
fsat 8
getdents64 8
lwp_park 8
pset 8
getpid 10
priocntlsys 10
lwp_cond_signal 12
getuid 14
ioctl 16
open64 16
xstat 18
lwp_continue 22
uucopy 22
fcntl 24
lwp_self 24
schedctl 24
fstat64 30
lwp_create 33
setcontext 34
mprotect 38
sysconfig 38
mmapobj 46
memcntl 48
sigaction 54
lwp_sigmask 60
open 70
close 76
resolvepath 76
lwp_cond_wait 86
mmap 102
brk 176
stat64 244
pollsys 1004
llseek 1234
read 1238
There's quite a bit going on here before we even discuss the output, so let's break it down. Using dtrace -P syscall, we've asked the system to enable all 470 syscall probes. The /pid == $target/ line is a predicate. When one of these probes does fire, the predicate evaluates the conditions given at that point. In this case, we ask if the process ID responsible for the current system call matches our target program. If so, we want to continue; otherwise we're not interested.
The {@[probefunc] = count(); } statement block lists the actions we want to process. The @ specifies a data type called an aggregation. Like an associative array, an aggregation accepts one or more arbitrary expressions to generate an index of keys. The probefunc expression resolves to the name of the function just called. The aggregation then ties its keys to some manner of summarizing function, in this case a unit counter. So here we ask to keep a running count of whatever system call functions we enter or return from. The -c "java Sleeper" option invokes the given program and assigns the process ID to the $target variable, which the predicate test uses. Once the Java program terminates, the DTrace program has no more reason to hang around. Upon its own termination, it sorts and prints the aggregation's contents. Et voila.
The single and double quotes in the command-line invocation merely prevent the shell program from parsing our stuff before the dtrace program gets to it. You can see from this simple program that the need to migrate to a file-based program for more sophisticated work comes up pretty quick.
Now a couple of tuning points. It's probably more meaningful to count system call entries, if only to reduce the drama bigger numbers sometimes invite. To do that, we have to specify entry probes by name, which is easy enough. We use the -n option and supply a fully-specified probe description, also known as a tuple:
mfe@inkling$ dtrace -n 'syscall:::entry /pid == $target/ {@[probefunc] = count(); }' -c "java Sleeper"
dtrace: description 'syscall:::entry ' matched 235 probes
dtrace: pid 1172 has exited
...
pollsys 502
llseek 617
read 619
No big changes here, just an acknowledgment that we can specify our tracing all the way down to probe names, even a single probe, when we want to. Now, a typical question might be, what is it we need to read 619 times in 30 seconds? For me, it's a concern that doubles with the nearly-as-many llseek calls, which suggest we're scanning big files, e.g., jars, to get what we need. How can we look at that?
We now know we can call out specific probes in tuple form, such as syscall::read:entry. What we haven't yet discovered is the wealth of information that's available when we get there. And it's a ton of stuff. For starters, we can ask which files are being read, and we can ask how much reading is going on. Check this out:
mfe@inkling$ dtrace -n 'syscall::read:entry /pid == $target / {@[arg0] = sum(arg2); }' -c "java Sleeper"
dtrace: description 'syscall::read:entry ' matched 1 probe
dtrace: pid 1177 has exited
3 4096
4 16384
6 19447
5 1008249
Let's start with the good news. We've changed our aggregation to index on arg0, the first parameter passed to the read system call: this parameter happens to be the file to read from. We've also changed our aggregating function to sum() so that we can total up arg2, or the size of each read buffer. The bad news is that we've keyed on file descriptors, not actual files, and a descriptor may not refer to one file for the life of the program. Also, our aggregated values are useless if, for example, descriptor 5 refers to multiple files.
Incidentally, if you want to understand arg0 and arg2 for yourself, section 2 of the Solaris man pages documents most system calls. In a happier, more rose-colored time and place, we'd say all system calls are documented, but DTrace has removed the blush from that rose. Some reading comfort with C programming semantics goes a long way here. And if you're feeling so far out of touch that the anxiety wears on you, there is also an iPhone app for OpenSolaris man pages (search the app store for Solaris). Two bucks, no reviews yet. I'm starting to think I'm the first person to download it who isn't family or a close friend to the developer.
But this is where the power of DTrace really starts to kick in. Included among DTrace's many expressions is an fds[] array, which maps a file descriptor to a fileinfo_t data type. This type exposes legible members of file metadata so we can do this:
mfe@inkling$ dtrace -n 'syscall::read:entry /pid == $target / {@[fds[arg0].fi_pathname] = sum(arg2); }' -c "java Sleeper"
dtrace: description 'syscall::read:entry ' matched 1 probe
dtrace: pid 1184 has exited
/export/home/mfernest/d/sampleApps/Sleeper.class 510
/usr/share/lib/zoneinfo/America/Los_Angeles 1017
/etc/default/init 1536
/usr/jdk/instances/jdk1.7.0/jre/lib/i386/jvm.cfg 4096
/usr/jdk/instances/jdk1.7.0/jre/lib/meta-index 4096
/proc/1184/lstatus 16384
/usr/jdk/instances/jdk1.7.0/jre/lib/ext/meta-index 16384
/usr/jdk/instances/jdk1.7.0/jre/lib/rt.jar 1004153
We've now put the file descriptor held by arg0 into an fds[] array index, which translates that value into a type from which we can select useful data, in this case the value of fi_pathname. We're also seeing every other file that's been read which, judging by the buffer sizes alone, don't seem interesting. We already know how many reads total are involved, so no need to worry that we're reading a byte at a time or something equally molar-grinding.
So to sleep for 30 seconds, we need to cover a storage megabyte's worth of rt.jar, among other things. Looking back to the previous list of all system calls invoked, there are even more interesting things to consider. "Interest" varies, of course, depending on your knowledge, experience and relative skepticism/cynicism of all things computed. But if you thought sleeping was free or merely a quiet activity, think again. And stay tuned! More in the next post.

