Skip to content
danryan edited this page May 25, 2011 · 6 revisions

Checks are the bread and butter of Overwatch. A check runs the Snapshot of a given node through one or more rules to detect whether the snapshot is within acceptable bounds. If a check finds the node's snapshot is not what it expected, an alert is triggered.

Rules

Rules can be simple, like the ones you've seen before if you've used another monitoring service (Is the HTTP service running?). Rules can also be used in conjunction with one another, chaining together rules to create a more complicated requirement (Is the HTTP service running and is its response time less than 100ms?) Behind the scenes, rules are written as plain Ruby.

A rule has two basic pieces: the attribute to check on the snapshot (load_average.one_minute), and one or more condition for which the attribute should return true. Attributes are selected using dot notation, i.e. if your snapshot hash looks like

{
  :one => {
    :two => {
      :three => "3"
    }
  }
}

and you wanted to check the value of :three, you would access it via one.two.three.

Example 1

check = Check.new
check.rules << Rule.new(:attr => "load_average.one_minute").less_than(4)
check.rules << Rule.new(:attr => "httpd.state").is("running") 
check.rules << Rule.new(:attr => "services.httpd.response_time").less_than(100)
check.rules << Rule.new(:attr => "node.last_updated_at).since(5.minutes.ago)
check.run(snapshot)

If every rule returns true, the check passes; if at any point a rule returns false, the check fails and an event is created.

Other monitoring applications have you set up failure conditions. Overwatch, on the other hand, expects success. The end result is the same (if a failure occurs, you'll be alerted) but it changes your thinking slightly.

Clone this wiki locally