Aggregated metrics for payjoin-service (with native OTLP) by spacebear21 · Pull Request #1327 · payjoin/rust-payjoin

spacebear21 · 2026-02-12T01:54:18Z

This PR sketches an alternative approach to #1323. I realized that for a distributed ecosystem of payjoin-service operators, OpenTelemetry's "push" model is more suited than Prometheus's "pull" approach. Instead of setting up a OTel Collector sidecar that scrapes /metrics and pushes that to a Grafana instance, this approach removes Prometheus entirely and collects metrics with the OpenTelemetry API to push them from the payjoin-service app directly, no sidecar needed.

Architecture diagram for comparison with the one in #1323:

  ┌───────────────────┐
  │  Operator Server A│
  │  ┌───────────────┐│
  │  │ payjoin-      ││     OTLP/HTTP
  │  │ service       ├┼──────────────────────┐
  │  │               ││                      │
  │  └───────────────┘│                      │
  └───────────────────┘                      │
                                             ▼
  ┌───────────────────┐         ┌────────────────────────┐
  │  Operator Server B│         │  Grafana Cloud         │
  │  ┌───────────────┐│         │                        │
  │  │ payjoin-      ││  OTLP   │  Mimir  (metrics)      │
  │  │ service       ├┼────────►│  Loki   (logs)         │
  │  │               ││         │  Tempo  (traces)       │
  │  └───────────────┘│         │                        │
  └───────────────────┘         │  Grafana (dashboards)  │
                                └────────────────────────┘

Unless there is a really good reason for us to serve /metrics on a separate interface, I think this approach is much more straightforward to conceptualize, and is simpler for operators to run since everything remains encapsulated in the payjoin-service binary (compare this PR's README with the other PR's). If for some reason an operator wants to expose a Prometheus /metrics interface, they could still do so by ingesting the OTLP traffic in their Prometheus instance.

AFAICT the only downside is that the otel API is somewhat less concise than the Prometheus API, e.g. self.total_connections.add(1, &[]); instead of self.total_connections.inc();.

I prompted Opus 4.6 for a no-sidecar approach and had it do the rewrite.

Pull Request Checklist

Please confirm the following before requesting review:

I have disclosed my use of
AI
in the body of this PR.
I have read CONTRIBUTING.md and rebased my branch to produce hygienic commits.

coveralls · 2026-02-12T02:00:20Z

Pull Request Test Coverage Report for Build 22003578659

Details

73 of 114 (64.04%) changed or added relevant lines in 5 files are covered.
4 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.08%) to 83.18%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
payjoin-service/src/config.rs	3	5	60.0%
payjoin-service/src/lib.rs	41	50	82.0%
payjoin-service/src/main.rs	0	30	0.0%

Files with Coverage Reduction	New Missed Lines	%
payjoin-service/src/lib.rs	2	76.21%
payjoin-service/src/main.rs	2	0.0%

Totals
Change from base Build 21997882549:	-0.08%
Covered Lines:	10217
Relevant Lines:	12283

💛 - Coveralls

zealsham · 2026-02-12T16:14:35Z

Are we including logs and traces in the information stored by the central instance. i prefer the previous approach because prometheus is entirely metrics focus but if we ever need logs and traces in the central instance then that becomes a problem. I don't see a situation where we would need logs/traces from directory runners though.

This is a lot more straight forward to reason about

spacebear21 · 2026-02-12T18:28:23Z

i prefer the previous approach because prometheus is entirely metrics focus but if we ever need logs and traces in the central instance then that becomes a problem.

The problem with Prometheus is that it's "pull" only. For a Grafana dashboard to aggregate metrics from multiple servers, each operator would have to make their 9090 port publicly reachable, and we'd need to register each operator's IP/domain as a scrape target in Grafana. On the other hand, OpenTelemetry can "push" metrics (and traces/logs optionally) directly to the target because outbound connections don't need additional configuration. Operators just need an auth token with write access to the central instance.

So the options are:

Make the binary push metrics itself (collectorless OTLP approach in this PR)
Keep Prometheus and use a bridge process (Collector sidecar approach in Aggregated metrics for payjoin-service (with OTel Collector sidecar) #1323)

I don't see a situation where we would need logs/traces from directory runners though.

My rationale for including those is based on Yuval's comment here. I agree this is sensitive though and share Dan's sentiment that we should "be very specific with shared metrics rather than enable logs in general."

Both PRs include logs/traces, but we can choose to remove them from the emitted telemetry. But even if we only emit metrics, we still need the OTel dependency somewhere due to the push/pull issue I described above.

DanGould

Kicking myself for not reviewing this first thing yesterday. This approach makes way more sense to me than what had been proposed before. It's just a lot simpler.

I see we've deleted the health test on /metrics which'd be nice to have a functional replacement for, which I recommended below. But I wouldn't block this merge on that. The fact that provider wasn't used in that first commit being smelly tipped me off that we were missing a test or something, even if the next commit did use the provider.

DanGould · 2026-02-13T04:12:10Z

 pub async fn serve(config: Config) -> anyhow::Result<()> {
    let sentinel_tag = generate_sentinel_tag();
-    let metrics = MetricsService::new()?;
+    let metrics = MetricsService::new(None);


This is always set to None ? Overly complex implementation?

It's a transitory commit, the next commit passes an optional value.

DanGould · 2026-02-13T04:28:05Z

-
-    #[tokio::test]
-    async fn metrics_endpoint_works() {
-        let cert = local_cert_key();
-        let cert_der = cert.cert.der().to_vec();
-        let key_der = cert.signing_key.serialize_der();
-
-        let (port, metrics_port, _handle, _tempdir) =
-            start_service(cert_der.clone(), key_der).await;
-
-        let client = Arc::new(http_agent(cert_der).unwrap());
-        let base_url = format!("https://localhost:{}", port);
-        wait_for_service_ready(&base_url, client.clone()).await.unwrap();
-
-        let metrics_url = format!("http://localhost:{}/metrics", metrics_port);
-        let http_client = reqwest::Client::new();
-        let response =
-            http_client.get(&metrics_url).send().await.expect("metrics request should work");
-
-        assert_eq!(response.status(), axum::http::StatusCode::OK);
-        let body = response.text().await.unwrap();
-        assert!(body.contains("http_request_total"));
-        assert!(body.contains("active_connections"));
-    }


It'd be good to can keep a plain "it runs" sanity check. Probably in metrics.rs

Requires opentelemetry_sdk = { version = "0.31", features = ["testing"] }.

Suggested change

#[tokio::test]

async fn metrics_endpoint_works() {

let cert = local_cert_key();

let cert_der = cert.cert.der().to_vec();

let key_der = cert.signing_key.serialize_der();

let (port, metrics_port, _handle, _tempdir) =

start_service(cert_der.clone(), key_der).await;

let client = Arc::new(http_agent(cert_der).unwrap());

let base_url = format!("https://localhost:{}", port);

wait_for_service_ready(&base_url, client.clone()).await.unwrap();

let metrics_url = format!("http://localhost:{}/metrics", metrics_port);

let http_client = reqwest::Client::new();

let response =

http_client.get(&metrics_url).send().await.expect("metrics request should work");

assert_eq!(response.status(), axum::http::StatusCode::OK);

let body = response.text().await.unwrap();

assert!(body.contains("http_request_total"));

assert!(body.contains("active_connections"));

}

#[test]

fn metrics_are_recorded() {

use opentelemetry_sdk::metrics::{InMemoryMetricExporter, PeriodicReader, SdkMeterProvider};

let exporter = InMemoryMetricExporter::default();

let reader = PeriodicReader::builder(exporter.clone()).build();

let provider = SdkMeterProvider::builder().with_reader(reader).build();

let svc = MetricsService::new(Some(provider.clone()));

svc.record_http_request("directory", "POST", 200);

svc.record_connection_open();

svc.record_connection_close();

provider.force_flush().expect("flush failed");

let finished = exporter.get_finished_metrics().expect("metrics");

let metric_names: Vec<&str> = finished

.iter()

.flat_map(|rm| rm.scope_metrics())

.flat_map(|sm| sm.metrics())

.map(|m| m.name())

.collect();

assert!(metric_names.contains(&"http_request_total"), "missing http_request_total");

assert!(metric_names.contains(&"total_connections"), "missing total_connections");

assert!(metric_names.contains(&"active_connections"), "missing active_connections");

}

Push instead of Pull. Swap the `prometheus` crate for `opentelemetry` + `opentelemetry_sdk` Metrics instruments use the OTel Metrics API. Remove the standalone `/metrics` Prometheus endpoint, its dedicated listener/port, and all related code no longer needed for OTLP push.

This enables structured log output and configures exporters for OpenTelemetry.

Instead of relying on obfuscated default env variables, make telemetry configurable via config.toml or `PJ_` env variables.

spacebear21 · 2026-02-13T22:20:09Z

The latest push adds a metrics test per Dan's feedback. I also removed the logs/traces exporters, and moved telemetry config to the payjoin-service Config instead of a dedicated .env file. Each of these is a standalone commit even though they might be squashed, but I think this is easier to review and to revert individual changes if needed (e.g. adding logs/traces back at some point).

I also deployed this to lets.payjo.in via the docker image created in the PR build artifact, h/t @benalleng for this awesome workflow. Metrics are flowing to Grafana. Kind of fun to look at scrapers trying to steal secrets.

DanGould

This is elite. Won't work on Core without exposing port 80 but that's a problem for later.

nix / docker / inline config is very convenient. Great!

DanGould · 2026-02-14T09:53:08Z

+    use opentelemetry_sdk::metrics::{InMemoryMetricExporter, PeriodicReader, SdkMeterProvider};
    use payjoin_test_utils::{http_agent, local_cert_key, wait_for_service_ready};
    use rustls::pki_types::CertificateDer;
    use rustls::RootCertStore;
    use tempfile::tempdir;

    use super::*;
+    use crate::metrics::{ACTIVE_CONNECTIONS, HTTP_REQUESTS, TOTAL_CONNECTIONS};


nit: importing these as crate::metrics is a good sign this belongs in metrics.rs

DanGould · 2026-02-14T09:53:42Z

+pub(crate) const TOTAL_CONNECTIONS: &str = "total_connections";
+pub(crate) const ACTIVE_CONNECTIONS: &str = "active_connections";
+pub(crate) const HTTP_REQUESTS: &str = "http_request_total";



Along with these requiring pub(crate)

DanGould · 2026-02-14T09:55:32Z

+opentelemetry-otlp = { version = "0.31", optional = true, features = [
+  "reqwest-rustls",
+] }
 opentelemetry_sdk = "0.31"


Moving Config requires a new feature because of "WithHttpConfig" ?

Because .with_endpoint() does extra validation and rejects https without that feature.

spacebear21 changed the title ~~Replace Prometheus with OpenTelemetry~~ Aggregated metrics for payjoin-service (with native OTLP) Feb 12, 2026

DanGould approved these changes Feb 13, 2026

View reviewed changes

spacebear21 mentioned this pull request Feb 13, 2026

Aggregated metrics for payjoin-service (with OTel Collector sidecar) #1323

Closed

2 tasks

spacebear21 force-pushed the open-telemetry-pure-metrics branch from 16f8df1 to adde247 Compare February 13, 2026 21:28

spacebear21 marked this pull request as ready for review February 13, 2026 21:29

spacebear21 force-pushed the open-telemetry-pure-metrics branch from adde247 to e50cf0b Compare February 13, 2026 21:31

spacebear21 added 4 commits February 13, 2026 16:32

Add optional telemetry for payjoin-service

bec6e06

This enables structured log output and configures exporters for OpenTelemetry.

Remove traces/logs exporters

248fbd7

Add simple metrics test

134cc3f

spacebear21 force-pushed the open-telemetry-pure-metrics branch from e50cf0b to 309ac37 Compare February 13, 2026 21:32

spacebear21 added 3 commits February 13, 2026 16:32

Add telemetry settings to config

6fce76d

Instead of relying on obfuscated default env variables, make telemetry configurable via config.toml or `PJ_` env variables.

Expand payjoin-service README

1d489d4

Add docker-compose.yml for payjoin-service

309ac37

spacebear21 requested review from DanGould and zealsham February 13, 2026 22:21

DanGould approved these changes Feb 14, 2026

View reviewed changes

spacebear21 merged commit 4959ff0 into payjoin:master Feb 14, 2026
16 checks passed

Conversation

spacebear21 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 22003578659

Details

💛 - Coveralls

Uh oh!

zealsham commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spacebear21 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DanGould left a comment

Choose a reason for hiding this comment

Uh oh!

DanGould Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

spacebear21 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

DanGould Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

spacebear21 commented Feb 13, 2026

Uh oh!

DanGould left a comment

Choose a reason for hiding this comment

Uh oh!

DanGould Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

DanGould Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

DanGould Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

spacebear21 Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

spacebear21 commented Feb 12, 2026 •

edited

Loading

coveralls commented Feb 12, 2026 •

edited

Loading

zealsham commented Feb 12, 2026 •

edited

Loading

spacebear21 commented Feb 12, 2026 •

edited

Loading