Yesterday, at dtrace.conf, Jarod Jenson gave a presentation on why he thinks dtrace has not seen greater adoption by system administrators across the spectrum of varied IT departments around the world. (Jarod starts his presentation at the 40 minute mark.) At the beginning of his talk, Jarod mentioned he didn’t think dtrace’s syntax was a problem and I largely agree with that. Like any language syntax, dtrace becomes more familiar the more you use it. I believe Jarod hinted upon the correct answer a little later in his presentation but in my opinion he missed the mark a bit.
At my last job, I worked as the manager of a team of Solaris administrators for a large public university in Texas. This was my first exposure to an operating system that was dtrace enabled. At the beginning of my tenure, the shop ran mostly Solaris 9 systems with one or two Solaris 10 boxes hanging around looking cool. By the time I left the university five years later, we had migrated nearly our entire catalog of supported applications into zones running on Solaris 10 with ZFS backing the whole enchilada. During this migration, my cohorts and I rarely practiced the dark arts of dtrace. Only when things went really wonky did we start writing D to figure out why.
Herein lies the mark I think we all missed while discussing Jarod’s points yesterday. First and foremost, yes, dtrace is difficult to learn but so is system administration and that hasn’t stopped a lot of smart people from doing it everyday. Clearly, being difficult is not an insurmountable barrier. There is a second issue which I believe is the real deal breaker for dtrace adoption; the fact that dtrace isn’t truly needed that often. I mean, “hard to learn” and “don’t need it that often” is a hard sell to any resource constrained system administrator. My team at the university and I fell back on dtrace only when all other tools failed to do the job. In our view, the vast majority of system administration problems can be sovled with well worn tools like iostat, prstat, mpstat, snoop, vmstat, mdb, etc.
I believe it was Bryan who mentioned that you become a dtrace convert once it pulls your ass out of a fire once or twice. That was certainly the case for myself. I solved some rather wonky problems and got to be the hero a couple of times. Thusly, I fell in love with all the power dtrace affords and became a convert. Behind the scenes, however, I had to work damn hard to earn a pulls asses out of fires with dtrace achievement. The majority of the time I spent earning the achievement was at home and not during work hours. Things like like LDAP clusters needing attending and new Oracle DB instance builds constantly get in the way of learning.
My dtrace abilities are founded on spending a lot of time with my nose stuck in Richard McDougall and Jim Mauro’s Solaris Internals book. Serious dtracing requires serious understanding of the OS you are instrumenting. I don’t think many professional administrators in companies that are just trying to keep the lights on have the time nor the inclination to climb the wall of dtrace since they can solve the vast majority of their system problems with well worn tools (or hire Jarod). Quite frankly, I don’t believe I’ve ever met another administrator that was interested in learning dtrace that didn’t first have an obsessive devotion to computing that started at a young age. This describes me and probably most of the other folks at dtrace.conf yesterday.
(An entertaining thought entered my head shortly after I wrote this post. If the Oracle guys get their Linux dtrace port working well, we may see a significant uptick in dtrace adoption due to the fact that the well worn Linux stat tools are not as complete as their Solaris counterparts.)
I’d like to hear what other folks in the community think. I’ll open up the comments and make an attempt to keep the spammers down.
I’ve said that for years. It’s not limited to dtrace. How many people “really” know BGP? Or are really sharp at debugging in IOS? Consultants will, since it’s their job day in, day out. But for the folks at many higher-ed institutions and smaller enterprises, they simply wear too many hats during any given day to get a chance to concentrate on exploring (and passing into long-term memory) the richer and more arcane dark arts, er, OS command sets. I made my name on knowing strango, more arcane aspects of IOS and Cat OS waaaay back in the day. They always turned to me to save the day and damn that ego boost was a rush. But now, I’m lucky I remember some of the intermediate “show” commands. In the rush to implement for the sake of project milestones, I’ve lost so much of what I’ve learned. And I simply don’t have the time to properly learn the new OS’es. Sure, I can config a Juniper in a PoC lab but I haven’t had the extra time to play in labs to learn how to debug the damn thing when it breaks. And (this speaks to your 2nd point) they just don’t break often enough for OJT to occur. Besides, when catastrophic failure occurs OJT is the last thing that is on your mind.
And so here I am…I’ve really gotten bad at break-fix in recent years as I’ve taken my accumulated protocol, OS and topology knowledge, mixed it with 15 years experience and become a design guy. I’ve been beating up Dan for the last few years to break up the Network Team into an Maintenance/Operations branch and Design branch just for that purpose. The M&O guys will have to be the debug and break-fix heroes for the infrastructure I build. Makes me feel like a tard sometimes, being out-of-the-loop as I wont to do now. And seeing all the glory go to others that was once mine bruises the pride. But ultimately, I believe most of us mere mortals with limited brain storage, advanced age and lives outside of work have to accept the fact that we can’t learn AND remember it all. So you choose a discipline–in my case, Design over Operations–and own it. Knowing how you are, you’ve chosen House Slytherin for the dark arts of dtrace. That’s my $.02 anyway.
What’s up, Keefers? I can understand the sentiment, you feel less and less connected to your work as you progress from a purely technical role to a more architectural one. Ryan K and I lamented on this very topic shortly before my departure.
Granted, the technician roles are good at providing short bursts of ‘I rule!’ moments but being the architect allows you to play the long game and design something truly awesome and then put said design into practice. Nothing is more satisfying than seeing your creations given life.
You may be using a different definition of “solved”.
Let’s say I take my car to two mechanics. One fixes my issue by replacing the entire engine, the other by replacing a single spark plug. Both mechanics “solve” the issue, but the first at considerable time and cost. This roughly describes pre- and post-DTrace systems performance analysis.
The well worn toolset is great at identifying that resources themselves are bottlenecks. This often leads to solving issues by increasing resources (time and $$$) or by experimenting with tunables and configurations (time). With DTrace, the specific cause can often be identified, including inefficient workloads (eliminate unnecessary work), bugs, or specific tunables. We are solving issues differently now – and saving everyone a ton of time and money.
The real issue here is adoption. You may indeed be right that many sysadmins believe that the well-worn toolset is satisfactorily solving issues. So, one challenge may be awareness of what is now possible with DTrace – when it is appropriate to use it, and what for. I use it for more than half the perf issues I solve (although, the variety of issues I work is already biased towards the harder side).
The other issue you mention is time. I know sysadmins are typically busy and have limited time to learn new things (I used to be a University sysadmin as well, and later taught sysadmin). Some of the latest issues I solved with DTrace included tracing internals of the TCP/IP stack for analysis of bugs and tuning effects, which is too deep for a typical sysadmin to practically do (it takes weeks to get up to speed with TCP/IP stack internals). It was for this reason that I created the DTraceToolkit, so that the value of DTrace could be enjoyed in situations where it was impractical to invest the time. The DTraceToolkit should go much further though (like the DTrace book does), to really empower sysadmins to solve issues in the new way.
I don’t think sysadmin adoption is the real issue. Realistically, I’d expect there to be only one person at each company who could develop DTrace scripts (including taking DTraceToolkit/DTraceBook scripts and tweaking them), which various support teams team would use. That person may be a senior sysadmin, a developer with OS knowledge, or a performance engineer. The problem seems to be adoption among these people. As Jarod said, there are sites where nobody is using DTrace at all.
So, if you are the top performance expert at a site with DTrace – what’s stopping you from using it? One reason may be this perception that it’s not needed so much. What other reasons are there?
Interesting point, why aren’t we seeing adoption among the senior administrators in these fortune 1000 companies Jarod mentioned in his presentation? There is the possibility that his sample is biased since folks requiring his services are not repairing the issues themselves due to a lack of DTrace knowledge. I repaired several frustrating performance problems using DTrace during my time at the university. Jarod may have gotten a call from them had I not been around.
Of course, my own observations and theories in the post are limited to the administrators I’ve interacted with in person and online. I find it odd when accomplished UNIX administrators whom I have great deal of respect for, tell me they simply do not want to learn dtrace. I can honestly say I’m baffled by this sentiment when I run into it.
Come to think of it, I wonder if DTrace has more visibility and uptake than we realize. I mean, Oracle is funding an effort to port DTrace to Linux. If Bryan’s lawnmower theory is true, they wouldn’t be doing that if their customer’s weren’t asking for it. 🙂
The barriers to DTrace are incredibly high – it should be wrapped in fmli, dtksh, or something.
For example, type a command, offer some options, use the options, see some more options, choose some additional options, see a result. Add a flag, see a sample script. Think: sar, vmstat
There are things I would like developers to use DTrace for, to actually do real work, and not for debugging. For example: watch a directory for changed files through DTrace subscriber, pipe into awk to rcp those files to destination machines (in near real time.)
In short, DTrace is designed for Developers, without suggesting developer usage, and pitched at System Administrators, without a simple interface to make it easily usable.
dtksh! Haven’t heard that in a while. But yes, it wouldn’t have been too hard to roll dtksh/dtrace scripts that were interactive. dtksh itself never seemed to catch on (despite my encouragement: http://www.brendangregg.com/dtkshdemos.html), and would need an X11 path to the server – which for me is becoming less common.
Putting a skin on DTrace (and doing it well) is a great way to drive adoption – users become familiar with the power of DTrace and develop a strong desire to use it before they reach for the command line. We did this at Fishworks with Analytics for the Sun ZFS Storage Appliance, and now at Joyent with Cloud Analytics. Those have have great success, but for their user bases. Perhaps the turning point for sysadmins will be a similar killer-app that is generally available for the enterprise OS.