Aggregated metrics for payjoin-service (with native OTLP)#1327
Aggregated metrics for payjoin-service (with native OTLP)#1327spacebear21 merged 7 commits intopayjoin:masterfrom
Conversation
Pull Request Test Coverage Report for Build 22003578659Details
💛 - Coveralls |
|
Are we including logs and traces in the information stored by the central instance. i prefer the previous approach because prometheus is entirely metrics focus but if we ever need logs and traces in the central instance then that becomes a problem. I don't see a situation where we would need logs/traces from directory runners though. This is a lot more straight forward to reason about |
The problem with Prometheus is that it's "pull" only. For a Grafana dashboard to aggregate metrics from multiple servers, each operator would have to make their 9090 port publicly reachable, and we'd need to register each operator's IP/domain as a scrape target in Grafana. On the other hand, OpenTelemetry can "push" metrics (and traces/logs optionally) directly to the target because outbound connections don't need additional configuration. Operators just need an auth token with write access to the central instance. So the options are:
My rationale for including those is based on Yuval's comment here. I agree this is sensitive though and share Dan's sentiment that we should "be very specific with shared metrics rather than enable logs in general." Both PRs include logs/traces, but we can choose to remove them from the emitted telemetry. But even if we only emit metrics, we still need the OTel dependency somewhere due to the push/pull issue I described above. |
DanGould
left a comment
There was a problem hiding this comment.
Kicking myself for not reviewing this first thing yesterday. This approach makes way more sense to me than what had been proposed before. It's just a lot simpler.
I see we've deleted the health test on /metrics which'd be nice to have a functional replacement for, which I recommended below. But I wouldn't block this merge on that. The fact that provider wasn't used in that first commit being smelly tipped me off that we were missing a test or something, even if the next commit did use the provider.
| pub async fn serve(config: Config) -> anyhow::Result<()> { | ||
| let sentinel_tag = generate_sentinel_tag(); | ||
| let metrics = MetricsService::new()?; | ||
| let metrics = MetricsService::new(None); |
There was a problem hiding this comment.
This is always set to None ? Overly complex implementation?
There was a problem hiding this comment.
It's a transitory commit, the next commit passes an optional value.
|
|
||
| #[tokio::test] | ||
| async fn metrics_endpoint_works() { | ||
| let cert = local_cert_key(); | ||
| let cert_der = cert.cert.der().to_vec(); | ||
| let key_der = cert.signing_key.serialize_der(); | ||
|
|
||
| let (port, metrics_port, _handle, _tempdir) = | ||
| start_service(cert_der.clone(), key_der).await; | ||
|
|
||
| let client = Arc::new(http_agent(cert_der).unwrap()); | ||
| let base_url = format!("https://localhost:{}", port); | ||
| wait_for_service_ready(&base_url, client.clone()).await.unwrap(); | ||
|
|
||
| let metrics_url = format!("http://localhost:{}/metrics", metrics_port); | ||
| let http_client = reqwest::Client::new(); | ||
| let response = | ||
| http_client.get(&metrics_url).send().await.expect("metrics request should work"); | ||
|
|
||
| assert_eq!(response.status(), axum::http::StatusCode::OK); | ||
| let body = response.text().await.unwrap(); | ||
| assert!(body.contains("http_request_total")); | ||
| assert!(body.contains("active_connections")); | ||
| } |
There was a problem hiding this comment.
It'd be good to can keep a plain "it runs" sanity check. Probably in metrics.rs
Requires opentelemetry_sdk = { version = "0.31", features = ["testing"] }.
| #[tokio::test] | |
| async fn metrics_endpoint_works() { | |
| let cert = local_cert_key(); | |
| let cert_der = cert.cert.der().to_vec(); | |
| let key_der = cert.signing_key.serialize_der(); | |
| let (port, metrics_port, _handle, _tempdir) = | |
| start_service(cert_der.clone(), key_der).await; | |
| let client = Arc::new(http_agent(cert_der).unwrap()); | |
| let base_url = format!("https://localhost:{}", port); | |
| wait_for_service_ready(&base_url, client.clone()).await.unwrap(); | |
| let metrics_url = format!("http://localhost:{}/metrics", metrics_port); | |
| let http_client = reqwest::Client::new(); | |
| let response = | |
| http_client.get(&metrics_url).send().await.expect("metrics request should work"); | |
| assert_eq!(response.status(), axum::http::StatusCode::OK); | |
| let body = response.text().await.unwrap(); | |
| assert!(body.contains("http_request_total")); | |
| assert!(body.contains("active_connections")); | |
| } | |
| #[test] | |
| fn metrics_are_recorded() { | |
| use opentelemetry_sdk::metrics::{InMemoryMetricExporter, PeriodicReader, SdkMeterProvider}; | |
| let exporter = InMemoryMetricExporter::default(); | |
| let reader = PeriodicReader::builder(exporter.clone()).build(); | |
| let provider = SdkMeterProvider::builder().with_reader(reader).build(); | |
| let svc = MetricsService::new(Some(provider.clone())); | |
| svc.record_http_request("directory", "POST", 200); | |
| svc.record_connection_open(); | |
| svc.record_connection_close(); | |
| provider.force_flush().expect("flush failed"); | |
| let finished = exporter.get_finished_metrics().expect("metrics"); | |
| let metric_names: Vec<&str> = finished | |
| .iter() | |
| .flat_map(|rm| rm.scope_metrics()) | |
| .flat_map(|sm| sm.metrics()) | |
| .map(|m| m.name()) | |
| .collect(); | |
| assert!(metric_names.contains(&"http_request_total"), "missing http_request_total"); | |
| assert!(metric_names.contains(&"total_connections"), "missing total_connections"); | |
| assert!(metric_names.contains(&"active_connections"), "missing active_connections"); | |
| } |
16f8df1 to
adde247
Compare
adde247 to
e50cf0b
Compare
Push instead of Pull. Swap the `prometheus` crate for `opentelemetry` + `opentelemetry_sdk` Metrics instruments use the OTel Metrics API. Remove the standalone `/metrics` Prometheus endpoint, its dedicated listener/port, and all related code no longer needed for OTLP push.
This enables structured log output and configures exporters for OpenTelemetry.
e50cf0b to
309ac37
Compare
Instead of relying on obfuscated default env variables, make telemetry configurable via config.toml or `PJ_` env variables.
|
The latest push adds a metrics test per Dan's feedback. I also removed the logs/traces exporters, and moved telemetry config to the payjoin-service I also deployed this to lets.payjo.in via the docker image created in the PR build artifact, h/t @benalleng for this awesome workflow. Metrics are flowing to Grafana. Kind of fun to look at scrapers trying to steal secrets.
|
DanGould
left a comment
There was a problem hiding this comment.
This is elite. Won't work on Core without exposing port 80 but that's a problem for later.
nix / docker / inline config is very convenient. Great!
| use opentelemetry_sdk::metrics::{InMemoryMetricExporter, PeriodicReader, SdkMeterProvider}; | ||
| use payjoin_test_utils::{http_agent, local_cert_key, wait_for_service_ready}; | ||
| use rustls::pki_types::CertificateDer; | ||
| use rustls::RootCertStore; | ||
| use tempfile::tempdir; | ||
|
|
||
| use super::*; | ||
| use crate::metrics::{ACTIVE_CONNECTIONS, HTTP_REQUESTS, TOTAL_CONNECTIONS}; |
There was a problem hiding this comment.
nit: importing these as crate::metrics is a good sign this belongs in metrics.rs
| pub(crate) const TOTAL_CONNECTIONS: &str = "total_connections"; | ||
| pub(crate) const ACTIVE_CONNECTIONS: &str = "active_connections"; | ||
| pub(crate) const HTTP_REQUESTS: &str = "http_request_total"; | ||
|
|
There was a problem hiding this comment.
Along with these requiring pub(crate)
| opentelemetry-otlp = { version = "0.31", optional = true, features = [ | ||
| "reqwest-rustls", | ||
| ] } | ||
| opentelemetry_sdk = "0.31" |
There was a problem hiding this comment.
Moving Config requires a new feature because of "WithHttpConfig" ?
There was a problem hiding this comment.
Because .with_endpoint() does extra validation and rejects https without that feature.

This PR sketches an alternative approach to #1323. I realized that for a distributed ecosystem of payjoin-service operators, OpenTelemetry's "push" model is more suited than Prometheus's "pull" approach. Instead of setting up a OTel Collector sidecar that scrapes /metrics and pushes that to a Grafana instance, this approach removes Prometheus entirely and collects metrics with the OpenTelemetry API to push them from the payjoin-service app directly, no sidecar needed.
Architecture diagram for comparison with the one in #1323:
Unless there is a really good reason for us to serve
/metricson a separate interface, I think this approach is much more straightforward to conceptualize, and is simpler for operators to run since everything remains encapsulated in the payjoin-service binary (compare this PR's README with the other PR's). If for some reason an operator wants to expose a Prometheus/metricsinterface, they could still do so by ingesting the OTLP traffic in their Prometheus instance.AFAICT the only downside is that the otel API is somewhat less concise than the Prometheus API, e.g.
self.total_connections.add(1, &[]);instead ofself.total_connections.inc();.I prompted Opus 4.6 for a no-sidecar approach and had it do the rewrite.
Pull Request Checklist
Please confirm the following before requesting review:
AI
in the body of this PR.