I took much longer to post this than intended. Not that I expect anyone to have been sitting on the edge of their chair waiting for it, but still…
It’s been more than month since my last post, and not a darned thing has changed. Monitoring today still sucks. In the last installment I ranted and moaned about “active” monitoring and how there’s all this information you’re not collecting that is being lost. This time I’ll becry the sorry handling of the data we actually do collect.
Let’s for the sake of this argument pretend that “pinging” a web service is actually a useful thing to do. A typical scenario is this: A monitoring server tries to fetch some URL. If it takes less than a second to respond, it’s considered UP (typically resulting in a calming green row in the monitoring system). Kinda like this:
If it’s more than a second, but less than say 5 seconds, it’s considered WARNING (typically indicated by a yellow row in the monitoring system), or if it hasn’t responded within 5 seconds, it’s considered DOWN (resulting in a red row).
Transitions between states often result in an alert being sent out. These alerts typically contain the actual data point that triggered the transition:
"Oh, noes! HTTP changed state to WARNING. It took 1.455 seconds to respond."
It’s sad really, but the data point mentioned in the alert and the most recent you can see in the monitoring system’s web UI are often the only “record” of these data points. “Sad? Who cares? It’s all in the past!”.. *sigh* No. A wise man once said “those who ignore history are doomed to get bitten in the arse by it at some point” (paraphrasing ever so slightly). Here’s why:
Let’s look at a typical disk usage graph:
Sure, your graphs may be slightly bumpier, but this is basically how these things look. It doesn’t take a ph.d. in statistics to figure out where that blue line is headed (towards the red area, if you hadn’t worked it out).
Say that that’s a graph for the last week. If you imagine you’re extending the line, you can see that the disks will be full in about another week and within the red area just a couple of days from now. Yikes.
The point here is that if you were limited by the temporal tunnelvision of today’s monitoring systems, all you’d have seen was a green row all along. You’d think everything was fine until it suddenly wasn’t. Sadly, lots of people happily ignore this information on a daily basis. Even if they actually do collect this information and make pretty graphs out of it, it’s not something you go and look at very often to see these trends. It’s used mostly as a debugging tool after the fact (“Oh, I just got an alert that the disk on server X is running full… Yup, the graph confirms it.”).
I’m not advocating spending all your precious time sifting through graphs, looking for metrics on a collision course with disaster. Sure, if you only have a few servers, it’s not that big of a deal to look at the disk usage graphs every couple of days and see where they’re headed. If you have a thousand servers, though, it’s a pretty big deal.
So what am I advocating? I want a monitoring system that doesn’t just tell me when a disk has entered the yellow area on the shit-is-about-to-hit-the-fan-o-meter. I want a monitoring system that tells me when the next filesystem is likely to enter the yellow area on said meter. See? Instead of a “current problems list”, I want a “These are the next problems you’re likely to have to deal with” list. I want it to feed into my calendar, so that I don’t accidentally schedule a day off on the same day
/var on my db server is going to run full. I want it to automatically add a TODO item to my Remember the Milk account telling me to buy more/bigger drives for my file server.
It shouldn’t be that hard!