Monthly Archives: January 2012

Facepalm

All the things wrong with monitoring today – Part 3

Erk. I found this sitting around as a draft:

Today’s normality is tomorrow’s abnormality

Last time, we looked at a disk usage graph. This week, we’ll look at CPU usage or something else that goes up and down instead of just up, up, up.

What’s the problem here? The problem is that it’s very hard to set up an alert for this. Some things are simply spiky by nature. Sometimes, that’s perfectly fine. Perhaps the load on this particular application is evenly distributed throughout the day, but at night it runs a bunch of batch processing jobs that pegs the CPU for a couple of hours. For this sort of thing, you have a couple of options in terms of monitoring/alerting.

  • Don’t monitor CPU load.
  • Accept being alerted about this every single night.
  • Ignore CPU load during the time of day when this job runs.

All of these options suck.

  • You can’t just not monitor the CPU load. If you’re suddenly at 100% for an hour during the day, something’s wrong!
  • You don’t want to be alerted by something that is normal. That’s silly. You want your monitoring system to only alert you about stuff that’s worth waking up over.
  • Ignoring the CPU load based on the time of day is a step in the right direction, but this is not an isolated case. You probably have many different services, all with different usage patterns. I also don’t really want to think about what it will do to your configuration files if you had to specify different thresholds for every hour of the day (and every day of the week, etc).

Think about that last option a bit.. What would you use to define expected/acceptable levels? Pure guesswork? Of course not. You’ll use the data you already have. Maybe you’ve run this for a while and have cute graphs that can tell you what is expected. But seriously… Looking at graphs from your monitoring system and using them to type configuration back into your monitoring system? That’s the most ridiculous thing I’ve ever heard (yes, I should probably get out more).

Why can’t the monitoring system just tell me when something is out of the ordinary? It has all the data in the world to make that call. If a metric is unusual for that time of day, on that day of week, at that time of year, let me know. If it’s very unusual, send me a text message. Otherwise, I probably don’t care.