Skip to content

Aggregated metrics for payjoin-service (with native OTLP)#1327

Merged
spacebear21 merged 7 commits intopayjoin:masterfrom
spacebear21:open-telemetry-pure-metrics
Feb 14, 2026
Merged

Aggregated metrics for payjoin-service (with native OTLP)#1327
spacebear21 merged 7 commits intopayjoin:masterfrom
spacebear21:open-telemetry-pure-metrics

Conversation

@spacebear21
Copy link
Copy Markdown
Collaborator

@spacebear21 spacebear21 commented Feb 12, 2026

This PR sketches an alternative approach to #1323. I realized that for a distributed ecosystem of payjoin-service operators, OpenTelemetry's "push" model is more suited than Prometheus's "pull" approach. Instead of setting up a OTel Collector sidecar that scrapes /metrics and pushes that to a Grafana instance, this approach removes Prometheus entirely and collects metrics with the OpenTelemetry API to push them from the payjoin-service app directly, no sidecar needed.

Architecture diagram for comparison with the one in #1323:

  ┌───────────────────┐
  │  Operator Server A│
  │  ┌───────────────┐│
  │  │ payjoin-      ││     OTLP/HTTP
  │  │ service       ├┼──────────────────────┐
  │  │               ││                      │
  │  └───────────────┘│                      │
  └───────────────────┘                      │
                                             ▼
  ┌───────────────────┐         ┌────────────────────────┐
  │  Operator Server B│         │  Grafana Cloud         │
  │  ┌───────────────┐│         │                        │
  │  │ payjoin-      ││  OTLP   │  Mimir  (metrics)      │
  │  │ service       ├┼────────►│  Loki   (logs)         │
  │  │               ││         │  Tempo  (traces)       │
  │  └───────────────┘│         │                        │
  └───────────────────┘         │  Grafana (dashboards)  │
                                └────────────────────────┘

Unless there is a really good reason for us to serve /metrics on a separate interface, I think this approach is much more straightforward to conceptualize, and is simpler for operators to run since everything remains encapsulated in the payjoin-service binary (compare this PR's README with the other PR's). If for some reason an operator wants to expose a Prometheus /metrics interface, they could still do so by ingesting the OTLP traffic in their Prometheus instance.

AFAICT the only downside is that the otel API is somewhat less concise than the Prometheus API, e.g. self.total_connections.add(1, &[]); instead of self.total_connections.inc();.

I prompted Opus 4.6 for a no-sidecar approach and had it do the rewrite.

Pull Request Checklist

Please confirm the following before requesting review:

@spacebear21 spacebear21 changed the title Replace Prometheus with OpenTelemetry Aggregated metrics for payjoin-service (with native OTLP) Feb 12, 2026
@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Feb 12, 2026

Pull Request Test Coverage Report for Build 22003578659

Details

  • 73 of 114 (64.04%) changed or added relevant lines in 5 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.08%) to 83.18%

Changes Missing Coverage Covered Lines Changed/Added Lines %
payjoin-service/src/config.rs 3 5 60.0%
payjoin-service/src/lib.rs 41 50 82.0%
payjoin-service/src/main.rs 0 30 0.0%
Files with Coverage Reduction New Missed Lines %
payjoin-service/src/lib.rs 2 76.21%
payjoin-service/src/main.rs 2 0.0%
Totals Coverage Status
Change from base Build 21997882549: -0.08%
Covered Lines: 10217
Relevant Lines: 12283

💛 - Coveralls

@zealsham
Copy link
Copy Markdown
Collaborator

zealsham commented Feb 12, 2026

Are we including logs and traces in the information stored by the central instance. i prefer the previous approach because prometheus is entirely metrics focus but if we ever need logs and traces in the central instance then that becomes a problem. I don't see a situation where we would need logs/traces from directory runners though.

This is a lot more straight forward to reason about

@spacebear21
Copy link
Copy Markdown
Collaborator Author

spacebear21 commented Feb 12, 2026

i prefer the previous approach because prometheus is entirely metrics focus but if we ever need logs and traces in the central instance then that becomes a problem.

The problem with Prometheus is that it's "pull" only. For a Grafana dashboard to aggregate metrics from multiple servers, each operator would have to make their 9090 port publicly reachable, and we'd need to register each operator's IP/domain as a scrape target in Grafana. On the other hand, OpenTelemetry can "push" metrics (and traces/logs optionally) directly to the target because outbound connections don't need additional configuration. Operators just need an auth token with write access to the central instance.

So the options are:

I don't see a situation where we would need logs/traces from directory runners though.

My rationale for including those is based on Yuval's comment here. I agree this is sensitive though and share Dan's sentiment that we should "be very specific with shared metrics rather than enable logs in general."

Both PRs include logs/traces, but we can choose to remove them from the emitted telemetry. But even if we only emit metrics, we still need the OTel dependency somewhere due to the push/pull issue I described above.

Copy link
Copy Markdown
Contributor

@DanGould DanGould left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kicking myself for not reviewing this first thing yesterday. This approach makes way more sense to me than what had been proposed before. It's just a lot simpler.

I see we've deleted the health test on /metrics which'd be nice to have a functional replacement for, which I recommended below. But I wouldn't block this merge on that. The fact that provider wasn't used in that first commit being smelly tipped me off that we were missing a test or something, even if the next commit did use the provider.

Comment thread payjoin-service/src/lib.rs Outdated
pub async fn serve(config: Config) -> anyhow::Result<()> {
let sentinel_tag = generate_sentinel_tag();
let metrics = MetricsService::new()?;
let metrics = MetricsService::new(None);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always set to None ? Overly complex implementation?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a transitory commit, the next commit passes an optional value.

Comment on lines -382 to -405

#[tokio::test]
async fn metrics_endpoint_works() {
let cert = local_cert_key();
let cert_der = cert.cert.der().to_vec();
let key_der = cert.signing_key.serialize_der();

let (port, metrics_port, _handle, _tempdir) =
start_service(cert_der.clone(), key_der).await;

let client = Arc::new(http_agent(cert_der).unwrap());
let base_url = format!("https://localhost:{}", port);
wait_for_service_ready(&base_url, client.clone()).await.unwrap();

let metrics_url = format!("http://localhost:{}/metrics", metrics_port);
let http_client = reqwest::Client::new();
let response =
http_client.get(&metrics_url).send().await.expect("metrics request should work");

assert_eq!(response.status(), axum::http::StatusCode::OK);
let body = response.text().await.unwrap();
assert!(body.contains("http_request_total"));
assert!(body.contains("active_connections"));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to can keep a plain "it runs" sanity check. Probably in metrics.rs

Requires opentelemetry_sdk = { version = "0.31", features = ["testing"] }.

Suggested change
#[tokio::test]
async fn metrics_endpoint_works() {
let cert = local_cert_key();
let cert_der = cert.cert.der().to_vec();
let key_der = cert.signing_key.serialize_der();
let (port, metrics_port, _handle, _tempdir) =
start_service(cert_der.clone(), key_der).await;
let client = Arc::new(http_agent(cert_der).unwrap());
let base_url = format!("https://localhost:{}", port);
wait_for_service_ready(&base_url, client.clone()).await.unwrap();
let metrics_url = format!("http://localhost:{}/metrics", metrics_port);
let http_client = reqwest::Client::new();
let response =
http_client.get(&metrics_url).send().await.expect("metrics request should work");
assert_eq!(response.status(), axum::http::StatusCode::OK);
let body = response.text().await.unwrap();
assert!(body.contains("http_request_total"));
assert!(body.contains("active_connections"));
}
#[test]
fn metrics_are_recorded() {
use opentelemetry_sdk::metrics::{InMemoryMetricExporter, PeriodicReader, SdkMeterProvider};
let exporter = InMemoryMetricExporter::default();
let reader = PeriodicReader::builder(exporter.clone()).build();
let provider = SdkMeterProvider::builder().with_reader(reader).build();
let svc = MetricsService::new(Some(provider.clone()));
svc.record_http_request("directory", "POST", 200);
svc.record_connection_open();
svc.record_connection_close();
provider.force_flush().expect("flush failed");
let finished = exporter.get_finished_metrics().expect("metrics");
let metric_names: Vec<&str> = finished
.iter()
.flat_map(|rm| rm.scope_metrics())
.flat_map(|sm| sm.metrics())
.map(|m| m.name())
.collect();
assert!(metric_names.contains(&"http_request_total"), "missing http_request_total");
assert!(metric_names.contains(&"total_connections"), "missing total_connections");
assert!(metric_names.contains(&"active_connections"), "missing active_connections");
}

@spacebear21 spacebear21 force-pushed the open-telemetry-pure-metrics branch from 16f8df1 to adde247 Compare February 13, 2026 21:28
@spacebear21 spacebear21 marked this pull request as ready for review February 13, 2026 21:29
@spacebear21 spacebear21 force-pushed the open-telemetry-pure-metrics branch from adde247 to e50cf0b Compare February 13, 2026 21:31
Push instead of Pull.

Swap the `prometheus` crate for `opentelemetry` + `opentelemetry_sdk`
Metrics instruments use the OTel Metrics API.

Remove the standalone `/metrics` Prometheus endpoint, its dedicated
listener/port, and all related code no longer needed for OTLP push.
This enables structured log output and configures exporters for
OpenTelemetry.
@spacebear21 spacebear21 force-pushed the open-telemetry-pure-metrics branch from e50cf0b to 309ac37 Compare February 13, 2026 21:32
Instead of relying on obfuscated default env variables, make telemetry
configurable via config.toml or `PJ_` env variables.
@spacebear21
Copy link
Copy Markdown
Collaborator Author

The latest push adds a metrics test per Dan's feedback. I also removed the logs/traces exporters, and moved telemetry config to the payjoin-service Config instead of a dedicated .env file. Each of these is a standalone commit even though they might be squashed, but I think this is easier to review and to revert individual changes if needed (e.g. adding logs/traces back at some point).

I also deployed this to lets.payjo.in via the docker image created in the PR build artifact, h/t @benalleng for this awesome workflow. Metrics are flowing to Grafana. Kind of fun to look at scrapers trying to steal secrets.

Screenshot 2026-02-13 at 17 19 35

Copy link
Copy Markdown
Contributor

@DanGould DanGould left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is elite. Won't work on Core without exposing port 80 but that's a problem for later.

nix / docker / inline config is very convenient. Great!

Comment on lines +245 to +252
use opentelemetry_sdk::metrics::{InMemoryMetricExporter, PeriodicReader, SdkMeterProvider};
use payjoin_test_utils::{http_agent, local_cert_key, wait_for_service_ready};
use rustls::pki_types::CertificateDer;
use rustls::RootCertStore;
use tempfile::tempdir;

use super::*;
use crate::metrics::{ACTIVE_CONNECTIONS, HTTP_REQUESTS, TOTAL_CONNECTIONS};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: importing these as crate::metrics is a good sign this belongs in metrics.rs

Comment on lines +5 to 8
pub(crate) const TOTAL_CONNECTIONS: &str = "total_connections";
pub(crate) const ACTIVE_CONNECTIONS: &str = "active_connections";
pub(crate) const HTTP_REQUESTS: &str = "http_request_total";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with these requiring pub(crate)

Comment on lines +37 to 40
opentelemetry-otlp = { version = "0.31", optional = true, features = [
"reqwest-rustls",
] }
opentelemetry_sdk = "0.31"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving Config requires a new feature because of "WithHttpConfig" ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because .with_endpoint() does extra validation and rejects https without that feature.

@spacebear21 spacebear21 merged commit 4959ff0 into payjoin:master Feb 14, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants