Follow me on Twitter:

Starting out: a new approach to systems monitoring.

Posted: October 2nd, 2012 | Author: | Filed under: DevOps | Tags: , , | 2 Comments »

OK, not new to some. Circonus does it this way, and so do some very large sites like Netflix.

But new to me, and certainly new to anyone currently using nagios/zenoss/zabbix/etc. Here’s the story:

The Idea


At work (Krux), we have graphite and tons of graphs on the wall. We can see application-level response times in the same view as cache hit/miss rates and requests per second. That’s nice. It’s also not very proactive.


We also have cloudkick (think: nagios with an API). We have tons of plugins checking thresholds, running locally on each box. We recently re-evaluated our monitoring solution, and ultimately decided to write our own loosely coupled monitoring infrastructure using a variety of awesome tools. We migrated from cloudkick to collectd with a bunch of plugins we wrote, using a custom python library I wrote, called monitorlib (collectd and pagerduty parts). The functionality is basically the same: run scripts on each node every 60 seconds, check if some threshold is met, and alert directly to pagerduty. meh.


What I really want is a decision engine.

I want applications to push events, when they know something a poll-based monitoring script doesn’t.
I want to suppress threshold-based alerts, based on a set of rules, and only alert some people.
I want to check the load balancer to see how many nodes are healthy, before alerting that a single node went down.
I want to check response-time graphs in graphite, by polling the holt-winters confidence bands, and then alert based on configured rules.

Basically, we are in a world where we have great graphs, and old-school threshold-based alerts. I want to alert on graphs, but also much more – I want to combine multiple bits of information before paging someone at 2am.

How to get there

Going to the next level requires processing events, accepting event data from multiple sources, and configuring rules.

This blog post has some good ideas and it outlines a few options.

Basically, I want *something* to sit and listen for events. I want all my collectd scripts to send data via HTTP POST (JSON), or protobufs, along with the status (ok, warn, error) every minute. Then, the *thing* that’s receiving these events, will decide – based on state it knows or gathers by polling graphite/load balancers/etc – whether to alert, update a status board, neither, or both.

Building that *thing* is the hard part. There are Complex Event Processing (CEP) frameworks available, most notably Esper, written in Java. Using Esper requires writing a lot of Java. There is a google open source thing, which seems like a bundle of code published but not maintained, called rocksteady. Using rocksteady may help the “ugh, don’t want to Java” aspect.

Then there is Riemann – this is what I’m starting with first. After learning a bit of Clojure, it should provide immediate benefit. And it’s actively developed and the author is very responsive. We’ll see how it goes!

Final notes

I think what I’m trying to do is a bit different than most.

I don’t want to send all my data (graphite metrics – we do around 150K metrics/sec to our graphite cluster) through this decision engine. I want it to get *events* which would historically have been something to page or email about. Then, it needs to make decisions: check graphs as another source of data; check load balancers; re-check to make sure it’s still a problem; maybe even spin up new EC2 instances. I may also want to poll graphite periodically to check various things, perhaps with graphite-tattle.

At this point, I don’t know what else it can/should do. The first step is to send all alerts to the decision engine, and define rules. It shall grow from there 🙂



2 Comments on “Starting out: a new approach to systems monitoring.”

  1. 1 Ron Yorgason said at 16:23 on October 2nd, 2012:

    I’m not sure what your entire cluster is doing, but you might consider Moab by Adaptive Computing (disclaimer, I work for them).

    Moab is used as the brain of a cluster. It takes in all sorts of information from Resource Managers (programs that collect information and pass it to Moab in a way it can understand) and makes decisions based on that information. It’s also used as a scheduler in HPC clusters, to find places to run jobs on certain machines for a certain period of time and consuming resources necessary for the job (mem, cpu, disk space, etc…).

    It sounds like you’re more interested generic events though. You can have various events configured that can be triggered, and custom scripts would be written to handle each event. I haven’t dealt so much with these, but here’s a link to some documentation on it:

    I don’t know if you can have event dependencies built in, such that the trigger only fires if two events are met. But you might be able to have the script that gets triggered on an event check for the other events, and then only send out the notification. There’s certainly a way to customize it to do what you’re looking for. And if it’s a feature other customers might use, we could build it into the product roadmap for a future release.

  2. 2 charlie said at 21:09 on October 2nd, 2012:

    To be honest, Moab sounds cool, but events and alerting, an afterthought – to say nothing about CEP. And event processing & querying of state is a whole domain itself that I don’t expect any multi-purpose product to ever do well.

    You’re right – I want many sources of information (more than I talked about), and I really, really, really want pluggable architecture. Each piece of software doing what it does well, communicating over HTTP 🙂

    A quick read through the comments of the sample riemann config shows some examples (but not quite all I want):

Leave a Reply