Tag Archives: monitoring

Facepalm

All the things wrong with monitoring today – Part 3

Erk. I found this sitting around as a draft:

Today’s normality is tomorrow’s abnormality

Last time, we looked at a disk usage graph. This week, we’ll look at CPU usage or something else that goes up and down instead of just up, up, up.

What’s the problem here? The problem is that it’s very hard to set up an alert for this. Some things are simply spiky by nature. Sometimes, that’s perfectly fine. Perhaps the load on this particular application is evenly distributed throughout the day, but at night it runs a bunch of batch processing jobs that pegs the CPU for a couple of hours. For this sort of thing, you have a couple of options in terms of monitoring/alerting.

  • Don’t monitor CPU load.
  • Accept being alerted about this every single night.
  • Ignore CPU load during the time of day when this job runs.

All of these options suck.

  • You can’t just not monitor the CPU load. If you’re suddenly at 100% for an hour during the day, something’s wrong!
  • You don’t want to be alerted by something that is normal. That’s silly. You want your monitoring system to only alert you about stuff that’s worth waking up over.
  • Ignoring the CPU load based on the time of day is a step in the right direction, but this is not an isolated case. You probably have many different services, all with different usage patterns. I also don’t really want to think about what it will do to your configuration files if you had to specify different thresholds for every hour of the day (and every day of the week, etc).

Think about that last option a bit.. What would you use to define expected/acceptable levels? Pure guesswork? Of course not. You’ll use the data you already have. Maybe you’ve run this for a while and have cute graphs that can tell you what is expected. But seriously… Looking at graphs from your monitoring system and using them to type configuration back into your monitoring system? That’s the most ridiculous thing I’ve ever heard (yes, I should probably get out more).

Why can’t the monitoring system just tell me when something is out of the ordinary? It has all the data in the world to make that call. If a metric is unusual for that time of day, on that day of week, at that time of year, let me know. If it’s very unusual, send me a text message. Otherwise, I probably don’t care.

Facepalm

All the things wrong with monitoring today – Part 2

I took much longer to post this than intended. Not that I expect anyone to have been sitting on the edge of their chair waiting for it, but still…

It’s been more than month since my last post, and not a darned thing has changed. Monitoring today still sucks. In the last installment I ranted and moaned about “active” monitoring and how there’s all this information you’re not collecting that is being lost. This time I’ll becry the sorry handling of the data we actually do collect.

Temporal tunnelvision

Let’s for the sake of this argument pretend that “pinging” a web service is actually a useful thing to do. A typical scenario is this: A monitoring server tries to fetch some URL. If it takes less than a second to respond, it’s considered UP (typically resulting in a calming green row in the monitoring system). Kinda like this:

If it’s more than a second, but less than say 5 seconds, it’s considered WARNING (typically indicated by a yellow row in the monitoring system), or if it hasn’t responded within 5 seconds, it’s considered DOWN (resulting in a red row).

Transitions between states often result in an alert being sent out. These alerts typically contain the actual data point that triggered the transition:

"Oh, noes! HTTP changed state to WARNING. It took 1.455 seconds to respond."

It’s sad really, but the data point mentioned in the alert and the most recent you can see in the monitoring system’s web UI are often the only “record” of these data points. “Sad? Who cares? It’s all in the past!”.. *sigh* No. A wise man once said “those who ignore history are doomed to get bitten in the arse by it at some point” (paraphrasing ever so slightly). Here’s why:

Let’s look at a typical disk usage graph:


Sure, your graphs may be slightly bumpier, but this is basically how these things look. It doesn’t take a ph.d. in statistics to figure out where that blue line is headed (towards the red area, if you hadn’t worked it out).

Say that that’s a graph for the last week. If you imagine you’re extending the line, you can see that the disks will be full in about another week and within the red area just a couple of days from now. Yikes.

The point here is that if you were limited by the temporal tunnelvision of today’s monitoring systems, all you’d have seen was a green row all along. You’d think everything was fine until it suddenly wasn’t. Sadly, lots of people happily ignore this information on a daily basis. Even if they actually do collect this information and make pretty graphs out of it, it’s not something you go and look at very often to see these trends. It’s used mostly as a debugging tool after the fact (“Oh, I just got an alert that the disk on server X is running full… Yup, the graph confirms it.”).

I’m not advocating spending all your precious time sifting through graphs, looking for metrics on a collision course with disaster. Sure, if you only have a few servers, it’s not that big of a deal to look at the disk usage graphs every couple of days and see where they’re headed. If you have a thousand servers, though, it’s a pretty big deal.

So what am I advocating? I want a monitoring system that doesn’t just tell me when a disk has entered the yellow area on the shit-is-about-to-hit-the-fan-o-meter. I want a monitoring system that tells me when the next filesystem is likely to enter the yellow area on said meter. See? Instead of a “current problems list”, I want a “These are the next problems you’re likely to have to deal with” list. I want it to feed into my calendar, so that I don’t accidentally schedule a day off on the same day /var on my db server is going to run full. I want it to automatically add a TODO item to my Remember the Milk account telling me to buy more/bigger drives for my file server.

It shouldn’t be that hard!

Facepalm

All the things wrong with monitoring today – Part 1

Monitoring today sucks. Big time. It sucks so bad, it’s not even funny. The amount of time spent configuring stuff, dealing with problems when it’s already too late, and the amount of things your monitoring system could be monitoring, but isn’t, are all staggering. I’ll be spending a couple of posts whining about this. Who knows? Maybe I have a solution up my sleeve by the end of it.

Active vs. passive monitoring

Active monitoring.. It sounds cool. Way cooler than passive. Most of the time, if you have a choice between an active and a passive something, you go with the active one, right? Well, not this time.

The amount of times I’ve seen people set up their monitoring system to access an HTTP URL especially crafted to be useless, to simply respond to the probe as quickly as possible, is ridiculous. It’s surely active, but it’s almost entirely useless. Sure, if this is a service noone uses, it’s probably fine, but if this is a service that has almost any sort of real world use, in the customary 5 minutes between each of these “pings”, there will have been dozens, scores, if not hundreds or thousands of actual requests. Requests that actually did something. Exercised your service at least to some extent. Sadly, this information is almost universally ignored.

Telling Apache to log the amount of time it took to serve a request is trivial. Collecting this information is trivial. Feeding that data to your monitoring system (if not on a per-request basis, just a maximum request time over the last 10 seconds would be a vast improvement) really shouldn’t be too hard. So why don’t you?