All the things wrong with monitoring today – Part 2

I took much longer to post this than intended. Not that I expect anyone to have been sitting on the edge of their chair waiting for it, but still…

It’s been more than month since my last post, and not a darned thing has changed. Monitoring today still sucks. In the last installment I ranted and moaned about “active” monitoring and how there’s all this information you’re not collecting that is being lost. This time I’ll becry the sorry handling of the data we actually do collect.

Temporal tunnelvision

Let’s for the sake of this argument pretend that “pinging” a web service is actually a useful thing to do. A typical scenario is this: A monitoring server tries to fetch some URL. If it takes less than a second to respond, it’s considered UP (typically resulting in a calming green row in the monitoring system). Kinda like this:

If it’s more than a second, but less than say 5 seconds, it’s considered WARNING (typically indicated by a yellow row in the monitoring system), or if it hasn’t responded within 5 seconds, it’s considered DOWN (resulting in a red row).

Transitions between states often result in an alert being sent out. These alerts typically contain the actual data point that triggered the transition:

"Oh, noes! HTTP changed state to WARNING. It took 1.455 seconds to respond."

It’s sad really, but the data point mentioned in the alert and the most recent you can see in the monitoring system’s web UI are often the only “record” of these data points. “Sad? Who cares? It’s all in the past!”.. *sigh* No. A wise man once said “those who ignore history are doomed to get bitten in the arse by it at some point” (paraphrasing ever so slightly). Here’s why:

Let’s look at a typical disk usage graph:

Sure, your graphs may be slightly bumpier, but this is basically how these things look. It doesn’t take a ph.d. in statistics to figure out where that blue line is headed (towards the red area, if you hadn’t worked it out).

Say that that’s a graph for the last week. If you imagine you’re extending the line, you can see that the disks will be full in about another week and within the red area just a couple of days from now. Yikes.

The point here is that if you were limited by the temporal tunnelvision of today’s monitoring systems, all you’d have seen was a green row all along. You’d think everything was fine until it suddenly wasn’t. Sadly, lots of people happily ignore this information on a daily basis. Even if they actually do collect this information and make pretty graphs out of it, it’s not something you go and look at very often to see these trends. It’s used mostly as a debugging tool after the fact (“Oh, I just got an alert that the disk on server X is running full… Yup, the graph confirms it.”).

I’m not advocating spending all your precious time sifting through graphs, looking for metrics on a collision course with disaster. Sure, if you only have a few servers, it’s not that big of a deal to look at the disk usage graphs every couple of days and see where they’re headed. If you have a thousand servers, though, it’s a pretty big deal.

So what am I advocating? I want a monitoring system that doesn’t just tell me when a disk has entered the yellow area on the shit-is-about-to-hit-the-fan-o-meter. I want a monitoring system that tells me when the next filesystem is likely to enter the yellow area on said meter. See? Instead of a “current problems list”, I want a “These are the next problems you’re likely to have to deal with” list. I want it to feed into my calendar, so that I don’t accidentally schedule a day off on the same day /var on my db server is going to run full. I want it to automatically add a TODO item to my Remember the Milk account telling me to buy more/bigger drives for my file server.

It shouldn’t be that hard!

7 thoughts on “All the things wrong with monitoring today – Part 2

  1. Soren Post author

    Yeah, I’ve seen that. I love rrdtool, but it’s extremely tedious to set up and it’s awkward to hook into a monitoring system.

    Also, filesystem utilisation is a very simplistic use case. It’s almost universally increasing, and quite often at a steady, constant pace. A simple linear regression will probably fail against data that exhibits more erratic pattern with a long-term linear trend or if the data set has a non-linear trend.

  2. Christian

    Also consider that service utilisation is rarely uniformly distributed through the day or the week. The peak levels of some metrics are perhaps as – or even more – valuable than say an hourly or daily average. A regression model with a smoothed fit wouldn’t do well in predicting critical peak levels. On the other hand, an attempting a “tight” fit on e.g. pageviews/second would likely be overfitting and equally useless. Perhaps using the periodical peak levels and modeling the trend of these data points would be beneficial.

    Another thing is possible interrelated metrics. An obvious example: When physical memory utilisation goes beyond 100% and pages start getting swapped out, disk I/O is likely to increase dramatically. Predicting the trend of one metric might benefit from a predicted trend of another metric.

    The possibilities are vast. One could collect data points on a global scale and mine it for association rules and correlations. Or, one could make do with the data available locally and discover interdependencies dynamically. Even a few handcrafted rules would probably do relatively well.

  3. Soren Post author

    I absolutely agree. There are plenty of excellent statistical models that could be applied and plenty of preprocessing of the input that could be done to get even better results.

    My primary point is that it’s high time move on from the 1990’s style monitoring that we’re still doing, and start applying some more maths. The statistical field of prediction is well over 40 years old. Many of the known algorithms aren’t even particularly computationally intensive. There’s really no excuse.

  4. Mark Unwin

    I want a monitoring system that tells me when the next filesystem is likely to enter the yellow area on said meter.

    Open-AudIT v2.0 (OAv2) does exactly this.

    Disclaimer – I am the developer. I am/was a SysAdmin who (still) runs into this exact issue. If you “audit” your systems each day, it will create a report that says in the next XX days, server ABC, partition 123 will reach 100% capacity, based on past useage stats and trends.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>