Skip to content

monitoring

Thomas Mangin edited this page Apr 8, 2026 · 1 revision

Pre-Alpha. This page describes behavior that may change.

Ze gives you three different ways to watch a running daemon: an auto-refreshing peer dashboard for the operator at the keyboard, a live event stream for the engineer chasing a bug, and a Prometheus endpoint for everything else. They all read the same in-process state, so a number you see in the dashboard matches a metric you scrape from Prometheus.

The live peer dashboard

ze cli monitor bgp

Auto-refreshing every two seconds. Sortable, color-coded peer table with router identity, state, uptime, and update rates. Navigate with j and k, sort with s or S, hit Enter to drop into a peer's detail page, and Esc to exit.

This is the screen you leave open on a side monitor when you are about to touch something risky. It is also the fastest way to confirm at a glance that a session you just brought up is actually moving messages.

The event stream

ze cli monitor event

The raw firehose. Filter it before it overwhelms you.

ze cli monitor event peer upstream event update direction received
ze cli monitor event event state                # Just up and down

The filters compose. peer <selector> narrows to one peer, event <type>[,<type>] selects which event types to include, and direction received or direction sent narrows to one side of the conversation. There is no exclude capability; you specify which types to include via event. The recognised event types are update, open, notification, keepalive, refresh, state, negotiated, eor, and rpki.

For scripts, pipe the stream through | json and parse it. Every event uses the same envelope: a peer block, a message block with id, direction, and type, and the per-event payload (update carries next-hop, as-path, local-preference, and the per-family NLRI lists; state carries the new FSM state).

Prometheus

Ze exposes a Prometheus endpoint when you set telemetry { prometheus { ... } } in the config. Metrics refresh every ten seconds.

The metrics you actually need most of the time are these.

Metric Why you care
ze_peer_state{peer} 3 means Established. Anything else is a graphable problem.
ze_peer_messages_received_total{peer,type} Update rate, broken down by message type.
ze_peer_messages_sent_total{peer,type} Same for outbound.
ze_bgp_prefix_count{peer,family} Current per-family prefix count for a peer.
ze_bgp_prefix_ratio{peer,family} count / maximum. Alert above 0.9.
ze_bgp_prefix_warning_exceeded{peer,family} 1 if the warning threshold has been crossed.
ze_bgp_pool_used_ratio Forwarding pool utilisation. Anything above 0.8 means you are close to congestion teardown.
ze_uptime_seconds The reactor's uptime.
ze_info{version,router_id,local_as} Tag for joins.

There are also histograms for connection timing (ze_peer_dial_seconds, ze_peer_connect_attempt_seconds, ze_peer_backoff_seconds), per-peer overflow counters, and prefix-limit teardown counters. The full list lives in the in-tree monitoring guide.

Plugin event subscription

Plugins are first-class consumers of the same event stream. A plugin binding under a peer declares which events it wants:

peer upstream {
    process my-plugin {
        receive [ update state ]
    }
}

The plugin gets each event through its OnEvent callback. The set of recognised event types is the same as the CLI filter list, plus any extra types a plugin registers itself.

See also

Adapted from main/docs/guide/monitoring.md.

Home

About

First Steps

Configuration

Operation

Interfaces

Plugins

Plugin Development

Chaos Testing

Blueprints

Development

Reference

Clone this wiki locally