Skip to content

[feat(infra)] Add Prometheus and Grafana observability stack#14

Merged
nxdun merged 19 commits into
mainfrom
nadun/observe-telem
May 12, 2026
Merged

[feat(infra)] Add Prometheus and Grafana observability stack#14
nxdun merged 19 commits into
mainfrom
nadun/observe-telem

Conversation

@nxdun
Copy link
Copy Markdown
Owner

@nxdun nxdun commented May 11, 2026

Description

Introduces a full observability layer using Prometheus and Grafana across local dev and production environments. This PR adds deep instrumentation to both core services (ytdlp) and security middleware (captcha), exposed via a protected /metrics endpoint.

Key additions:

  • Observability Framework: Added metrics and metrics-exporter-prometheus crates; extracted telemetry setup into src/telemetry.rs.
  • YTDLP Instrumentation: Track job lifecycle (enqueued, active, completed) and processing duration.
  • Captcha & Security Monitoring: instrumented middleware with captcha_check_total (labels: status, reason) and captcha_verify_duration_seconds to monitor hits, API key bypasses, and validation failures.
  • Automated Dashboards: Provisioned "Nadzu API - Worker Health" and "Captcha & Security Dashboard" as code.
  • Infra Automation: Updated Makefile, Terraform, and Cloud-init to handle S3 uploads and automated fetching of all observability artifacts (configs + dashboards).
  • Caddy Refactor: Implemented global @internal_only matcher to protect /nadun/fs/*, /nadun/grafana/*, and /metrics across all environments.
  • Dev Workflow Standardization: Standardized docker-compose.dev.yml to use production-style image builds, resolving previous binary staleness and linking issues in Alpine/musl environments.

Types of changes

  • Feature
  • Infra
  • Refactor

Checklist

  • Confirmed Grafana provisioning configs are valid and dashboards load automatically.
  • Verified captcha_check_total metrics are correctly ingested and labeled in Prometheus.
  • Tested /metrics scrape endpoint is reachable by Prometheus and blocked externally via @internal_only.
  • Rebuilt local stack using image-based deployment to verify fix for stale volume mounts.
  • Updated ChangeLog

Summary by CodeRabbit

New Features

  • Added integrated observability platform with metrics collection and visualization dashboards
  • Exposed /metrics endpoint for monitoring system health
  • Added worker health dashboard displaying success rates, active workers, error counts, and latency metrics
  • Added security dashboard for monitoring verification and authentication events
  • Implemented comprehensive metrics tracking for job processing and request verification

Infrastructure & DevOps

  • Updated routing configuration with access controls for monitoring endpoints
  • Enhanced development environment with monitoring services
  • Improved build environment dependencies for cryptographic operations

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds centralized tracing and a /metrics endpoint, instruments yt-dlp and CAPTCHA middleware with Prometheus metrics, provisions Prometheus and Grafana (datasource + dashboards), protects observability routes in Caddy, and wires Makefile/terraform/cloud-init to publish and fetch presigned provisioning artifacts.

Changes

Observability and Monitoring System

Layer / File(s) Summary
Build deps & packaging
Cargo.toml, Dockerfile, Dockerfile.dev, Makefile
Added metrics and metrics-exporter-prometheus crates; added OpenSSL dev/static libs to Alpine build; updated Makefile to publish observability assets to S3.
Telemetry module
src/telemetry.rs, src/lib.rs
New module: init_tracing() initializes tracing subscriber; AppTraceLayer type and build_trace_layer() configure HTTP tracing; setup_metrics_router() installs Prometheus recorder and exposes GET /metrics.
App startup wiring
src/app.rs
App startup now calls telemetry::init_tracing() and telemetry::build_trace_layer() and merges the telemetry /metrics router into the main router.
Ytdlp job instrumentation
src/services/ytdlp/mod.rs
Recorded metrics: ytdlp_jobs_enqueued_total, ytdlp_active_jobs, ytdlp_job_duration_seconds, ytdlp_jobs_completed_total{status=...} and decrements on job finish; measures job start time and sets success flag on successful file moves.
Captcha middleware instrumentation
src/middleware/captcha.rs
Adds counters and a histogram for CAPTCHA verification outcomes and latency; labels emitted for bypass, missing/config errors, upstream/parse failures, invalid and success cases.
Prometheus config
infra/common/prometheus.yml
Added Prometheus scrape job nadzu-api targeting app:8080 with 10s global scrape interval.
Grafana provisioning
infra/common/grafana/provisioning/*
Configured Grafana datasource for Prometheus (http://prometheus:9090) and file-based dashboard provider; added ytdlp-health.json and captcha-security.json dashboards.
Production & Local Caddy routing
Caddyfile, Caddyfile.local
Parameterized production domain, enabled trusted_proxies/client_ip_headers, added @internal_only matcher, protected /nadun/fs/*, /nadun/grafana/*, /metrics, and added analogous local dev routes and redirects.
Cloud-init, Terraform & startup
infra/common/cloud-init.template, infra/digitalocean/...
Added presigned URL and Grafana credential variables, wired them into module inputs and cloud-init locals, tightened generated file permissions, created provisioning dirs in startup.sh, and fetches presigned provisioning artifacts.
Makefile S3 publishing
Makefile
Upload Prometheus config, Grafana datasource/provider, and dashboard JSON to S3 and export presigned TF_VAR_* URLs for Terraform.
Local docker-compose & tooling
docker-compose.dev.yml, .env.example, postman/*, .gemini/settings.json, README.md, docker-compose.yml
Switch local app to nadzu:local, source secrets from env, add prometheus/grafana services, add Postman metrics request, update env example, add .gemini grafana entry, README note, and compose header comment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • nxdun/rust-codebase#13: Both PRs modify src/services/ytdlp/mod.rs—this PR adds Prometheus metrics instrumentation while that PR changes ytdlp job handling and reporting.

Poem

🐰 Metrics bloom where the workers run,
Prometheus counts each job that's done,
Grafana paints the rabbit's view,
Caddy guards the paths anew,
Little hops for observability fun!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding Prometheus and Grafana observability stack to the infrastructure, which aligns with the comprehensive set of changes across monitoring setup, metrics instrumentation, and observability configuration.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nadun/observe-telem

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nxdun nxdun self-assigned this May 11, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (11)
Cargo.toml (1)

18-19: ⚡ Quick win

Update dependency versions to the latest stable releases.

The current versions (metrics = "0.22" and metrics-exporter-prometheus = "0.13") are outdated. Consider updating to the latest versions: metrics = "0.24.5" and metrics-exporter-prometheus = "0.18.3". This is low-effort maintenance that keeps the project on current, supported releases.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Cargo.toml` around lines 18 - 19, Update the two crate versions in
Cargo.toml: change the metrics dependency from "0.22" to "0.24.5" and change
metrics-exporter-prometheus from "0.13" to "0.18.3", then run cargo update /
cargo build to ensure compatibility and resolve any resulting dependency issues;
target the dependency entries named metrics and metrics-exporter-prometheus when
making the edits.
Caddyfile.local (1)

26-33: ⚡ Quick win

Consider adding IP-based access restrictions for internal monitoring paths.

The local Caddyfile exposes /nadun/grafana/* and /metrics without any access control, whereas the PR objectives mention a @internal_only IP matcher protecting these paths in production. Even in local development, exposing metrics and dashboards can leak sensitive operational data if the ports are accidentally exposed or in shared development environments.

Consider adding a matcher similar to production:

+@internal_only {
+    remote_ip 127.0.0.1 ::1
+}
+
 # 2. Local Grafana Access
 handle /nadun/grafana/* {
+    `@internal_only`
     reverse_proxy nadzu-grafana:3000
 }

 # 3. Local Metrics Access
 handle /metrics {
+    `@internal_only`
     reverse_proxy app:8080
 }

This ensures consistency between local and production configurations and prevents accidental exposure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Caddyfile.local` around lines 26 - 33, Add IP-based access restrictions to
the local Caddy routing for the Grafana and metrics endpoints by applying the
same `@internal_only` matcher used in production: update the handle blocks for
"handle /nadun/grafana/*" and "handle /metrics" to include the `@internal_only`
matcher (or an equivalent local matcher that limits to loopback/private IPs) so
only internal IPs can reach those routes; ensure the matcher name matches the
production matcher ("@internal_only") or define a local equivalent at the top of
the Caddyfile so routing is consistent and protected in development.
docker-compose.dev.yml (1)

5-6: 💤 Low value

Document the required build step for local development.

Line 6 changes from a build directive to image: nadzu:local, which means developers must explicitly run make bd (build Docker) before running make rd (run Docker compose). This is a workflow change that could cause confusion.

Consider adding a comment or updating documentation to clarify this dependency:

   app:
-    # Build Docker image from local Dockerfile (make bd)
+    # Requires pre-built image: run `make bd` first to build nadzu:local
     image: nadzu:local

Alternatively, you could keep the build context alongside the image tag, which allows Docker Compose to build if the image doesn't exist.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docker-compose.dev.yml` around lines 5 - 6, The docker-compose change
replaced a build context with a static image ("image: nadzu:local") which
requires developers to run the build step (make bd) before make rd; update the
repo to document this by either adding a clear comment near the "image:
nadzu:local" line in docker-compose.dev.yml explaining "run make bd before make
rd" and referencing the Makefile target, or restore a build context alongside
the image (e.g., add a "build:" block for the same service while keeping "image:
nadzu:local") so docker-compose can build the image automatically when missing;
modify README or developer docs to mention the required make bd step and the
intended workflow.
infra/common/grafana/provisioning/dashboards/ytdlp-health.json (1)

43-56: ⚖️ Poor tradeoff

Hardcoded max saturation value may not match runtime configuration.

Line 46 hardcodes "max": 10 for the active workers gauge, but the actual MAX_CONCURRENT_DOWNLOADS is a configurable variable (see infra/digitalocean/components/variables.tf line 94). If the runtime configuration differs from 10, the gauge visualization will be misleading.

Consider either:

  1. Documenting that this dashboard assumes max=10, or
  2. Making the dashboard max value configurable via a Grafana variable that reads from a metric or annotation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json` around lines
43 - 56, The dashboard currently hardcodes "max": 10 in the active workers gauge
under "fieldConfig" -> "defaults", which can misrepresent the real
MAX_CONCURRENT_DOWNLOADS runtime value; update the dashboard JSON to use a
Grafana templating variable (e.g., add a new dashboard variable like
"max_concurrent_downloads" and reference it as the max value) or remove the
hardcoded max and let Grafana autoscale; specifically modify the
"fieldConfig.defaults.max" entry in ytdlp-health.json to reference the new
variable (or delete it) and add the corresponding variable definition in the
dashboard's "templating" section so the UI reflects the actual configured
concurrency.
src/services/ytdlp/mod.rs (1)

228-228: ⚖️ Poor tradeoff

Consider whether duration should include queue wait time.

Line 228 starts the timer immediately upon entering run_job(), but the semaphore is acquired on line 231, meaning the duration includes queue wait time (when all worker slots are occupied).

This depends on your observability goals:

  • Current behavior: Measures end-to-end latency from user perspective (enqueue → completion)
  • Alternative: Start timer after acquiring the semaphore to measure only processing time

If you want to distinguish queue time from processing time, consider emitting two separate histograms:

let queue_start = Instant::now();
let _permit = self.semaphore.acquire().await.expect("semaphore closed");
histogram!("ytdlp_job_queue_duration_seconds").record(queue_start.elapsed().as_secs_f64());

let processing_start = Instant::now();
// ... job processing ...
histogram!("ytdlp_job_processing_duration_seconds").record(processing_start.elapsed().as_secs_f64());

This would provide better visibility into whether latency is due to queuing vs. slow downloads.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/services/ytdlp/mod.rs` at line 228, The current timer in run_job() starts
before acquiring the semaphore (start_time = Instant::now()), so measured
duration includes queue wait time; to separate queue vs processing latency,
record two histograms: capture queue_start before semaphore.acquire().await and
record ytdlp_job_queue_duration_seconds after acquiring, then set
processing_start = Instant::now() and record
ytdlp_job_processing_duration_seconds at the end of processing; update
references in run_job(), start_time, and semaphore.acquire() logic and ensure
labels/units match existing histogram usage.
Makefile (1)

186-191: 💤 Low value

Consider preserving original file paths in S3 for clarity.

Lines 186, 188, and 190 rename files when uploading to S3:

  • grafana/provisioning/datasources/prometheus.ymlgrafana_datasource.yml
  • grafana/provisioning/dashboards/default.ymlgrafana_provider.yml
  • Original path structure is lost

This renaming could cause confusion when debugging or auditing the S3 bucket. Consider either:

  1. Preserve the directory structure in S3:

    aws s3 cp infra/common/grafana/provisioning/datasources/prometheus.yml \
      "s3://$AWS_S3_BUCKET_NAME/terraform/data/grafana/provisioning/datasources/prometheus.yml"
  2. Document the naming mapping in a comment:

    # Upload Grafana datasource config (datasources/prometheus.yml → grafana_datasource.yml)

The current approach saves a few characters but reduces traceability between local files and S3 objects.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 186 - 191, The S3 uploads in the Makefile currently
flatten filenames (the aws s3 cp lines that set TF_VAR_GRAFANA_DATASOURCE_URL,
TF_VAR_GRAFANA_PROVIDER_URL, and TF_VAR_YTDLP_DASHBOARD_URL) losing the original
infra/common/grafana/... directory structure; update the destination keys in
those aws s3 cp commands to preserve the full path (e.g., use
terraform/data/infra/common/grafana/provisioning/...) and update the
corresponding aws s3 presign invocations that set TF_VAR_GRAFANA_DATASOURCE_URL,
TF_VAR_GRAFANA_PROVIDER_URL, and TF_VAR_YTDLP_DASHBOARD_URL to reference the new
object keys, or alternatively add a clear comment above each aws s3 cp
explaining the mapping if you intentionally want to keep the flattened names.
infra/common/cloud-init.template (3)

242-246: ⚡ Quick win

Presigned-URL fetches are single-shot on first boot — re-provisioning after URL expiry will fail.

These curl calls run as part of cloud-init runcmd, which executes only on first boot. If the instance is rebuilt or the cloud-init template re-runs after the presigned URLs expire (typical TTL is hours to a few days for S3), the script aborts via set -euo pipefail and the box never reaches docker compose up. Two cheap mitigations worth considering:

  1. Add a small retry/backoff per fetch (e.g., curl --retry 5 --retry-delay 3 --retry-connrefused) for transient network blips.
  2. Document/enforce in the Terraform tf target that these URLs must have a TTL longer than the longest expected provisioning window, and skip the fetch if the artifact already exists locally (so manual recovery is easy).
-      curl -sSfL "${PROMETHEUS_CONFIG_URL}" -o /opt/app/prometheus/prometheus.yml
-      curl -sSfL "${GRAFANA_DATASOURCE_URL}" -o /opt/app/grafana/provisioning/datasources/prometheus.yml
-      curl -sSfL "${GRAFANA_PROVIDER_URL}" -o /opt/app/grafana/provisioning/dashboards/default.yml
-      curl -sSfL "${YTDLP_DASHBOARD_URL}" -o /opt/app/grafana/provisioning/dashboards/ytdlp-health.json
+      curl -sSfL --retry 5 --retry-delay 3 --retry-connrefused "${PROMETHEUS_CONFIG_URL}" -o /opt/app/prometheus/prometheus.yml
+      curl -sSfL --retry 5 --retry-delay 3 --retry-connrefused "${GRAFANA_DATASOURCE_URL}" -o /opt/app/grafana/provisioning/datasources/prometheus.yml
+      curl -sSfL --retry 5 --retry-delay 3 --retry-connrefused "${GRAFANA_PROVIDER_URL}" -o /opt/app/grafana/provisioning/dashboards/default.yml
+      curl -sSfL --retry 5 --retry-delay 3 --retry-connrefused "${YTDLP_DASHBOARD_URL}" -o /opt/app/grafana/provisioning/dashboards/ytdlp-health.json
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/common/cloud-init.template` around lines 242 - 246, The cloud-init
runcmd curl fetches for PROMETHEUS_CONFIG_URL, GRAFANA_DATASOURCE_URL,
GRAFANA_PROVIDER_URL and YTDLP_DASHBOARD_URL are single-shot and will fail the
whole boot if presigned URLs expire; update the cloud-init snippet that runs
these curl commands to (1) skip each download if the target file already exists
(check /opt/app/... paths before calling curl) and (2) add robust curl retry
flags (e.g., --retry, --retry-delay, --retry-connrefused) so transient/network
errors don’t abort due to set -euo pipefail, and ensure the commands return
non-fatal status when skipping or after retries.

58-75: 💤 Low value

Grafana has no persistent volume — dashboards, alerts, API keys, and users vanish on container restart.

Only /etc/grafana/provisioning is bind-mounted; Grafana's data directory (/var/lib/grafana, SQLite DB, alert state, dashboard versions, API tokens, user sessions) is ephemeral. On every docker compose up cycle, alert history and any non-provisioned state is lost. If you intend "everything is code via provisioning" that is defensible — but at least the alert state and rendered images are usually worth keeping:

💾 Suggested addition
         grafana:
           image: grafana/grafana:latest
           restart: ${DOCKER_RESTART_POLICY}
           environment:
             - GF_AUTH_ANONYMOUS_ENABLED=true
             - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
             - GF_SERVER_SERVE_FROM_SUB_PATH=true
             - GF_SERVER_ROOT_URL=https://${PRODUCTION_DOMAIN}/nadun/grafana/
           volumes:
             - /opt/app/grafana/provisioning:/etc/grafana/provisioning
+            - grafana-data:/var/lib/grafana
           depends_on:
             - prometheus
+
+      volumes:
+        grafana-data:

Same consideration applies to Prometheus (TSDB at /prometheus) — scrape history is lost on restart, which can hurt incident retrospectives.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/common/cloud-init.template` around lines 58 - 75, The Grafana service
declaration lacks a persistent data volume so its runtime state (SQLite DB,
dashboards, alerts, API keys, sessions) is lost on container restart; update the
grafana service in the cloud-init.template (service name "grafana") to add a
bind or named volume mapping the Grafana data dir (/var/lib/grafana) to a host
path or Docker volume and do the same for Prometheus by mapping Prometheus TSDB
(/prometheus) to persist scrape history; ensure the volumes entries and any
necessary ownership/permissions are set so Grafana and Prometheus can write to
those paths.

58-62: 💤 Low value

Prometheus should depend on app so the initial scrape target resolves.

prometheus has no depends_on, so it can start before app and log connection errors against app:8080 until the app comes up. Not fatal (Prometheus retries), but the cleaner ordering matches what grafana already does for prometheus:

         prometheus:
           image: prom/prometheus:latest
           restart: ${DOCKER_RESTART_POLICY}
           volumes:
             - /opt/app/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
+          depends_on:
+            - app
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/common/cloud-init.template` around lines 58 - 62, Add a depends_on
entry to the prometheus service so it waits for the app service like grafana
does; specifically update the prometheus service block (service name
"prometheus") to include a depends_on: - app line so Prometheus starts after the
"app" service and its initial scrape target app:8080 resolves cleanly (mirror
the existing pattern used by the "grafana" service).
src/telemetry.rs (1)

45-56: ⚡ Quick win

Consider splitting global recorder install from router construction.

setup_metrics_router has two responsibilities: it installs the global Prometheus recorder (a one-shot, process-wide side effect) and returns a router. Calling it twice (e.g., in an integration test, a re-init path, or future code that builds multiple routers) will panic via expect, and the name does not signal that side effect. Splitting them keeps the surface honest and testable:

♻️ Suggested refactor
-/// Initializes the Prometheus metrics registry and returns the router.
-pub fn setup_metrics_router() -> Router<AppState> {
-    let builder = PrometheusBuilder::new();
-
-    // Install the global recorder
-    #[allow(clippy::expect_used)]
-    let handle = builder
-        .install_recorder()
-        .expect("Failed to install Prometheus recorder");
-
-    // Route that Prometheus will scrape
-    Router::new().route("/metrics", get(move || std::future::ready(handle.render())))
-}
+/// Installs the global Prometheus recorder. Must be called exactly once at startup.
+#[must_use]
+pub fn init_metrics() -> PrometheusHandle {
+    #[allow(clippy::expect_used)]
+    PrometheusBuilder::new()
+        .install_recorder()
+        .expect("Failed to install Prometheus recorder")
+}
+
+/// Builds the router exposing `GET /metrics` for Prometheus scraping.
+#[must_use]
+pub fn metrics_router(handle: PrometheusHandle) -> Router<AppState> {
+    Router::new().route("/metrics", get(move || std::future::ready(handle.render())))
+}

You'll also need use metrics_exporter_prometheus::{PrometheusBuilder, PrometheusHandle}; and to update the call site in app.rs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/telemetry.rs` around lines 45 - 56, setup_metrics_router currently both
installs the global Prometheus recorder and builds the Router, causing a
process-wide side effect and a panic if called twice; extract the recorder
installation into a separate function (e.g., install_prometheus_recorder) that
calls PrometheusBuilder::new().install_recorder() and returns the
PrometheusHandle (use metrics_exporter_prometheus::{PrometheusBuilder,
PrometheusHandle}), then change setup_metrics_router to only construct and
return Router<AppState> that takes a PrometheusHandle (or captures it) for the
/metrics handler; update callers (e.g., in app.rs) to call
install_prometheus_recorder once at process init and pass the handle into
setup_metrics_router.
Caddyfile (1)

18-20: 💤 Low value

Document expected format and consider validation for SSH_ALLOWED_IPS to prevent configuration errors.

SSH_ALLOWED_IPS is implicitly space-separated via join(" ", var.SSH_ALLOWED_IPS) in locals.tf, but the Terraform variable description does not specify this format or explain that values should be CIDR blocks or IP addresses. Additionally, there is no validation preventing empty arrays; if var.SSH_ALLOWED_IPS = [], the Caddyfile matcher becomes not client_ip (empty), which will fail to parse at startup. More problematically, if the array contains syntactically valid but stale/non-matching CIDR blocks, the matcher silently denies all external traffic without obvious indication until runtime, making it hard to detect configuration mistakes.

Consider either:

  • Adding a variable constraint in Terraform to reject empty arrays
  • Documenting the expected format in the variable description (space-separated CIDR blocks/IPs)
  • Adding a startup validation check to confirm at least one IP is reachable
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Caddyfile` around lines 18 - 20, The Caddyfile matcher '@internal_only' is
built from locals.tf using join(" ", var.SSH_ALLOWED_IPS) and can produce an
empty/invalid matcher when SSH_ALLOWED_IPS is empty or malformed; add a
Terraform variable validation block for SSH_ALLOWED_IPS to reject empty lists
and validate each entry as a CIDR/IP (use a validation rule with
alltrue/can(regexmatch(...)) or a CIDR helper), update the SSH_ALLOWED_IPS
variable description to document the expected space-separated CIDR/IP format,
and change the Caddyfile rendering logic to conditionally emit the "not
client_ip ${SSH_ALLOWED_IPS}" matcher only when SSH_ALLOWED_IPS is non-empty to
avoid producing "not client_ip " at startup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Caddyfile`:
- Around line 48-51: The /metrics handler hardcodes reverse_proxy target
"app:8080" causing divergence from the rest of the config; update the
reverse_proxy in the handle /metrics block to use the same APP_PORT placeholder
used elsewhere (replace the literal 8080 with ${APP_PORT}) so the /metrics proxy
remains consistent with the public API proxy (also mirror this change in
Caddyfile.local and infra/common/cloud-init.template as noted).

In `@infra/common/cloud-init.template`:
- Around line 248-249: The command "chmod -R 644" in the diff removes the
execute bit from directories (breaking Grafana provisioning); replace it with a
mode that preserves execute on directories, e.g. "chmod -R a=rwX" against the
prometheus and grafana provisioning directories or use the explicit pair of find
commands to set dirs to 755 and files to 644, and optionally add "chown -R
472:472" for the grafana provisioning tree to ensure the Grafana UID can access
the files.
- Around line 64-75: The Grafana service in the template currently enables
anonymous access with an Admin role (GF_AUTH_ANONYMOUS_ENABLED=true and
GF_AUTH_ANONYMOUS_ORG_ROLE=Admin) which is unsafe; update the grafana service
environment to either disable anonymous access (set
GF_AUTH_ANONYMOUS_ENABLED=false) or at minimum change GF_AUTH_ANONYMOUS_ORG_ROLE
to Viewer, and remove any Admin privileges for anonymous users; ensure the
change is applied to the grafana block that sets GF_SERVER_ROOT_URL and volumes,
and update any deployment notes/config to require authenticated access (and
validate Caddy `@internal_only` controls still protect admin endpoints).

In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Around line 18-31: The "noValue" property inside the JSON block under
fieldConfig.defaults is currently a string ("100") which prevents Grafana from
doing numeric threshold comparisons; change the noValue value to a numeric
literal (100) so panels like the stat panel will evaluate the thresholds
correctly — locate the fieldConfig -> defaults -> noValue entry and replace the
quoted string with a numeric value.
- Around line 117-128: The dashboard queries currently reference
ytdlp_job_duration_seconds with quantile labels which won't exist because the
metric is emitted as a histogram via histogram!() in src/services/ytdlp/mod.rs;
update the grafana panel targets (the expressions that reference
ytdlp_job_duration_seconds) to use PromQL histogram_quantile, e.g.
histogram_quantile(0.5, rate(ytdlp_job_duration_seconds_bucket[5m])) and
histogram_quantile(0.99, rate(ytdlp_job_duration_seconds_bucket[5m])) so they
read from the _bucket series, or alternatively change the metric in
src/services/ytdlp/mod.rs from a histogram to a summary if you really want
direct quantile labels.

In `@src/telemetry.rs`:
- Around line 45-52: The metrics route created by setup_metrics_router
(returning Router<AppState>) is merged into the main app without authentication,
exposing /metrics; protect it by either adding authentication middleware to the
Router returned by setup_metrics_router (apply the existing auth middleware
function/method used elsewhere before merging), or stop merging it into the
public router and instead serve that Router on a separate internal listener
(bind a dedicated internal port) or gate it behind an env-driven check that only
enables metrics when Caddy/internal proxy is present; update the code that
merges the returned Router into the main app to use the protected Router (or
move the merge to the internal listener) so /metrics is no longer directly
accessible.

---

Nitpick comments:
In `@Caddyfile`:
- Around line 18-20: The Caddyfile matcher '@internal_only' is built from
locals.tf using join(" ", var.SSH_ALLOWED_IPS) and can produce an empty/invalid
matcher when SSH_ALLOWED_IPS is empty or malformed; add a Terraform variable
validation block for SSH_ALLOWED_IPS to reject empty lists and validate each
entry as a CIDR/IP (use a validation rule with alltrue/can(regexmatch(...)) or a
CIDR helper), update the SSH_ALLOWED_IPS variable description to document the
expected space-separated CIDR/IP format, and change the Caddyfile rendering
logic to conditionally emit the "not client_ip ${SSH_ALLOWED_IPS}" matcher only
when SSH_ALLOWED_IPS is non-empty to avoid producing "not client_ip " at
startup.

In `@Caddyfile.local`:
- Around line 26-33: Add IP-based access restrictions to the local Caddy routing
for the Grafana and metrics endpoints by applying the same `@internal_only`
matcher used in production: update the handle blocks for "handle
/nadun/grafana/*" and "handle /metrics" to include the `@internal_only` matcher
(or an equivalent local matcher that limits to loopback/private IPs) so only
internal IPs can reach those routes; ensure the matcher name matches the
production matcher ("@internal_only") or define a local equivalent at the top of
the Caddyfile so routing is consistent and protected in development.

In `@Cargo.toml`:
- Around line 18-19: Update the two crate versions in Cargo.toml: change the
metrics dependency from "0.22" to "0.24.5" and change
metrics-exporter-prometheus from "0.13" to "0.18.3", then run cargo update /
cargo build to ensure compatibility and resolve any resulting dependency issues;
target the dependency entries named metrics and metrics-exporter-prometheus when
making the edits.

In `@docker-compose.dev.yml`:
- Around line 5-6: The docker-compose change replaced a build context with a
static image ("image: nadzu:local") which requires developers to run the build
step (make bd) before make rd; update the repo to document this by either adding
a clear comment near the "image: nadzu:local" line in docker-compose.dev.yml
explaining "run make bd before make rd" and referencing the Makefile target, or
restore a build context alongside the image (e.g., add a "build:" block for the
same service while keeping "image: nadzu:local") so docker-compose can build the
image automatically when missing; modify README or developer docs to mention the
required make bd step and the intended workflow.

In `@infra/common/cloud-init.template`:
- Around line 242-246: The cloud-init runcmd curl fetches for
PROMETHEUS_CONFIG_URL, GRAFANA_DATASOURCE_URL, GRAFANA_PROVIDER_URL and
YTDLP_DASHBOARD_URL are single-shot and will fail the whole boot if presigned
URLs expire; update the cloud-init snippet that runs these curl commands to (1)
skip each download if the target file already exists (check /opt/app/... paths
before calling curl) and (2) add robust curl retry flags (e.g., --retry,
--retry-delay, --retry-connrefused) so transient/network errors don’t abort due
to set -euo pipefail, and ensure the commands return non-fatal status when
skipping or after retries.
- Around line 58-75: The Grafana service declaration lacks a persistent data
volume so its runtime state (SQLite DB, dashboards, alerts, API keys, sessions)
is lost on container restart; update the grafana service in the
cloud-init.template (service name "grafana") to add a bind or named volume
mapping the Grafana data dir (/var/lib/grafana) to a host path or Docker volume
and do the same for Prometheus by mapping Prometheus TSDB (/prometheus) to
persist scrape history; ensure the volumes entries and any necessary
ownership/permissions are set so Grafana and Prometheus can write to those
paths.
- Around line 58-62: Add a depends_on entry to the prometheus service so it
waits for the app service like grafana does; specifically update the prometheus
service block (service name "prometheus") to include a depends_on: - app line so
Prometheus starts after the "app" service and its initial scrape target app:8080
resolves cleanly (mirror the existing pattern used by the "grafana" service).

In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Around line 43-56: The dashboard currently hardcodes "max": 10 in the active
workers gauge under "fieldConfig" -> "defaults", which can misrepresent the real
MAX_CONCURRENT_DOWNLOADS runtime value; update the dashboard JSON to use a
Grafana templating variable (e.g., add a new dashboard variable like
"max_concurrent_downloads" and reference it as the max value) or remove the
hardcoded max and let Grafana autoscale; specifically modify the
"fieldConfig.defaults.max" entry in ytdlp-health.json to reference the new
variable (or delete it) and add the corresponding variable definition in the
dashboard's "templating" section so the UI reflects the actual configured
concurrency.

In `@Makefile`:
- Around line 186-191: The S3 uploads in the Makefile currently flatten
filenames (the aws s3 cp lines that set TF_VAR_GRAFANA_DATASOURCE_URL,
TF_VAR_GRAFANA_PROVIDER_URL, and TF_VAR_YTDLP_DASHBOARD_URL) losing the original
infra/common/grafana/... directory structure; update the destination keys in
those aws s3 cp commands to preserve the full path (e.g., use
terraform/data/infra/common/grafana/provisioning/...) and update the
corresponding aws s3 presign invocations that set TF_VAR_GRAFANA_DATASOURCE_URL,
TF_VAR_GRAFANA_PROVIDER_URL, and TF_VAR_YTDLP_DASHBOARD_URL to reference the new
object keys, or alternatively add a clear comment above each aws s3 cp
explaining the mapping if you intentionally want to keep the flattened names.

In `@src/services/ytdlp/mod.rs`:
- Line 228: The current timer in run_job() starts before acquiring the semaphore
(start_time = Instant::now()), so measured duration includes queue wait time; to
separate queue vs processing latency, record two histograms: capture queue_start
before semaphore.acquire().await and record ytdlp_job_queue_duration_seconds
after acquiring, then set processing_start = Instant::now() and record
ytdlp_job_processing_duration_seconds at the end of processing; update
references in run_job(), start_time, and semaphore.acquire() logic and ensure
labels/units match existing histogram usage.

In `@src/telemetry.rs`:
- Around line 45-56: setup_metrics_router currently both installs the global
Prometheus recorder and builds the Router, causing a process-wide side effect
and a panic if called twice; extract the recorder installation into a separate
function (e.g., install_prometheus_recorder) that calls
PrometheusBuilder::new().install_recorder() and returns the PrometheusHandle
(use metrics_exporter_prometheus::{PrometheusBuilder, PrometheusHandle}), then
change setup_metrics_router to only construct and return Router<AppState> that
takes a PrometheusHandle (or captures it) for the /metrics handler; update
callers (e.g., in app.rs) to call install_prometheus_recorder once at process
init and pass the handle into setup_metrics_router.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: f66c7dd5-b030-4a63-9bb0-0a08ea105651

📥 Commits

Reviewing files that changed from the base of the PR and between 4d4d399 and d28fa0a.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (19)
  • Caddyfile
  • Caddyfile.local
  • Cargo.toml
  • Dockerfile
  • Makefile
  • docker-compose.dev.yml
  • infra/common/cloud-init.template
  • infra/common/grafana/provisioning/dashboards/default.yml
  • infra/common/grafana/provisioning/dashboards/ytdlp-health.json
  • infra/common/grafana/provisioning/datasources/prometheus.yml
  • infra/common/prometheus.yml
  • infra/digitalocean/accounts/naduns-team/main.tf
  • infra/digitalocean/accounts/naduns-team/variables.tf
  • infra/digitalocean/components/locals.tf
  • infra/digitalocean/components/variables.tf
  • src/app.rs
  • src/lib.rs
  • src/services/ytdlp/mod.rs
  • src/telemetry.rs

Comment thread Caddyfile
Comment thread infra/common/cloud-init.template
Comment thread infra/common/cloud-init.template Outdated
Comment thread infra/common/grafana/provisioning/dashboards/ytdlp-health.json
Comment thread infra/common/grafana/provisioning/dashboards/ytdlp-health.json
Comment thread src/telemetry.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@infra/common/cloud-init.template`:
- Around line 239-244: The presigned URL curl downloads (invocations using
CADDY_CUSTOM_BROWSE_FILE_URL, PROMETHEUS_CONFIG_URL, GRAFANA_DATASOURCE_URL,
GRAFANA_PROVIDER_URL, YTDLP_DASHBOARD_URL that write to /opt/app/...) must be
made resilient: replace the direct curl -sSfL calls with a small retry/backoff
wrapper (either use curl's --retry/--retry-delay/--retry-max-time flags or a
bash function that retries with exponential backoff) that attempts each download
several times before failing; ensure each download logs attempts and exits
non‑zero only after retries are exhausted so provisioning can fail fast if
permanently unreachable.
- Around line 70-71: The generated docker-compose file currently embeds
GF_SECURITY_ADMIN_PASSWORD and ends up world-readable; update
cloud-init.template so that after rendering the compose content (including
GF_SECURITY_ADMIN_PASSWORD) you set strict ownership and permissions—chown
root:root and chmod 0600 on the resulting /opt/app/docker-compose.yml (or write
the secret into a root-owned env file with 0600 and reference it from the
compose file); alternatively switch to using Docker secrets, but at minimum
ensure the file containing GF_SECURITY_ADMIN_PASSWORD is created with root-only
read/write (0600) and owned by root.

In `@infra/digitalocean/accounts/naduns-team/variables.tf`:
- Around line 203-216: The three Terraform variables PROMETHEUS_CONFIG_URL,
GRAFANA_DATASOURCE_URL, and GRAFANA_PROVIDER_URL should be marked sensitive so
their presigned URL values are redacted in plan/apply output; update each
variable declaration (variable "PROMETHEUS_CONFIG_URL", variable
"GRAFANA_DATASOURCE_URL", variable "GRAFANA_PROVIDER_URL" and any other similar
presigned-URL variables) to include sensitive = true in the block and keep the
description/type as-is.

In `@infra/digitalocean/components/variables.tf`:
- Around line 170-183: The Terraform variable blocks for PROMETHEUS_CONFIG_URL,
GRAFANA_DATASOURCE_URL, and GRAFANA_PROVIDER_URL should be marked sensitive to
avoid exposing presigned URLs; edit the variable declarations (variable
"PROMETHEUS_CONFIG_URL", variable "GRAFANA_DATASOURCE_URL", variable
"GRAFANA_PROVIDER_URL") and add sensitive = true to each block, and apply the
same change to the other presigned URL variables referenced in the file (the
blocks around lines 196-199) so all presigned URL inputs are declared sensitive.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6fdbea62-f911-4f15-b145-758a4f7af67d

📥 Commits

Reviewing files that changed from the base of the PR and between d28fa0a and 0d3dc0a.

📒 Files selected for processing (8)
  • .env.example
  • docker-compose.dev.yml
  • infra/common/cloud-init.template
  • infra/common/grafana/provisioning/dashboards/ytdlp-health.json
  • infra/digitalocean/accounts/naduns-team/main.tf
  • infra/digitalocean/accounts/naduns-team/variables.tf
  • infra/digitalocean/components/locals.tf
  • infra/digitalocean/components/variables.tf
🚧 Files skipped from review as they are similar to previous changes (3)
  • docker-compose.dev.yml
  • infra/digitalocean/components/locals.tf
  • infra/common/grafana/provisioning/dashboards/ytdlp-health.json

Comment thread infra/common/cloud-init.template
Comment thread infra/common/cloud-init.template Outdated
Comment thread infra/digitalocean/accounts/naduns-team/variables.tf
Comment thread infra/digitalocean/components/variables.tf
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
infra/common/grafana/provisioning/dashboards/ytdlp-health.json (1)

120-125: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use histogram_quantile() over _bucket series for P50/P99 latency.

The current expressions expect quantile labels on ytdlp_job_duration_seconds, which is not how Prometheus histograms are queried. This will produce empty series unless the metric is a summary.

🔧 Proposed fix
-          "expr": "ytdlp_job_duration_seconds{quantile=\"0.5\"}",
+          "expr": "histogram_quantile(0.5, sum(rate(ytdlp_job_duration_seconds_bucket[5m])) by (le))",
           "legendFormat": "P50 (Median)"
@@
-          "expr": "ytdlp_job_duration_seconds{quantile=\"0.99\"}",
+          "expr": "histogram_quantile(0.99, sum(rate(ytdlp_job_duration_seconds_bucket[5m])) by (le))",
           "legendFormat": "P99 (Worst Case)"
#!/bin/bash
# Verify metric type emitted by app code (histogram vs summary) and dashboard query usage.
rg -n -C2 'histogram!\(|summary!\(|ytdlp_job_duration_seconds' src || true
rg -n -C2 'ytdlp_job_duration_seconds\{quantile=|histogram_quantile|_bucket' infra/common/grafana/provisioning/dashboards/ytdlp-health.json
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json` around lines
120 - 125, The dashboard queries use a non-existent quantile label on
ytdlp_job_duration_seconds (summary-style) and should instead use Prometheus
histogram queries: replace the exprs that reference
ytdlp_job_duration_seconds{quantile="0.5"} and {quantile="0.99"} with
histogram_quantile(0.5, ...) and histogram_quantile(0.99, ...) over the
ytdlp_job_duration_seconds_bucket series (typically wrapping a
sum(rate(...[5m])) by (le, <other_labels>)); ensure you aggregate/group by the
same identifying labels you want in the legend and keep legendFormat as "P50
(Median)" / "P99" accordingly so the panels return data from the histogram
buckets.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Line 14: The current PromQL expression uses a raw division of two
sum(rate(...)) calls causing NaN/Inf when there are zero completions; update the
expression that references the metric ytdlp_jobs_completed_total so the
denominator is protected with clamp_min (e.g., use
clamp_min(sum(rate(ytdlp_jobs_completed_total[5m])), 1e-9)) before dividing and
then multiply by 100, ensuring the success-rate calculation never divides by
zero and yields 0% when there are no successes.

---

Duplicate comments:
In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Around line 120-125: The dashboard queries use a non-existent quantile label
on ytdlp_job_duration_seconds (summary-style) and should instead use Prometheus
histogram queries: replace the exprs that reference
ytdlp_job_duration_seconds{quantile="0.5"} and {quantile="0.99"} with
histogram_quantile(0.5, ...) and histogram_quantile(0.99, ...) over the
ytdlp_job_duration_seconds_bucket series (typically wrapping a
sum(rate(...[5m])) by (le, <other_labels>)); ensure you aggregate/group by the
same identifying labels you want in the legend and keep legendFormat as "P50
(Median)" / "P99" accordingly so the panels return data from the histogram
buckets.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 26fb9a99-5669-4d4e-a303-77f1dba885a4

📥 Commits

Reviewing files that changed from the base of the PR and between 667071c and 1a8eb73.

📒 Files selected for processing (4)
  • docker-compose.dev.yml
  • infra/common/cloud-init.template
  • infra/common/grafana/provisioning/dashboards/ytdlp-health.json
  • postman/collections/Nadzu API/Metrics.request.yaml
✅ Files skipped from review due to trivial changes (1)
  • postman/collections/Nadzu API/Metrics.request.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • infra/common/cloud-init.template
  • docker-compose.dev.yml

Comment thread infra/common/grafana/provisioning/dashboards/ytdlp-health.json Outdated
@nxdun
Copy link
Copy Markdown
Owner Author

nxdun commented May 12, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
infra/common/grafana/provisioning/dashboards/ytdlp-health.json (1)

120-120: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Past issue unresolved: queries still use direct quantile labels instead of histogram_quantile().

The past review (marked as addressed in commit 0d3dc0a) flagged that these queries reference ytdlp_job_duration_seconds{quantile="0.5"} and {quantile="0.99"} directly, but the metric is emitted as a histogram!() in src/services/ytdlp/mod.rs. Histograms don't expose quantile labels by default—they export _bucket, _sum, and _count suffixes.

The correct PromQL pattern for histogram quantiles is:

histogram_quantile(0.5, rate(ytdlp_job_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(ytdlp_job_duration_seconds_bucket[5m]))

Alternatively, if quantiles are intended to be exposed as labels, change the metric type from histogram!() to summary!() in the Rust code.

🔧 Proposed fix using histogram_quantile
         {
           "datasource": "Prometheus",
-          "expr": "ytdlp_job_duration_seconds{quantile=\"0.5\"}",
+          "expr": "histogram_quantile(0.5, rate(ytdlp_job_duration_seconds_bucket[5m]))",
           "legendFormat": "P50 (Median)"
         },
         {
           "datasource": "Prometheus",
-          "expr": "ytdlp_job_duration_seconds{quantile=\"0.99\"}",
+          "expr": "histogram_quantile(0.99, rate(ytdlp_job_duration_seconds_bucket[5m]))",
           "legendFormat": "P99 (Worst Case)"
         }

Also applies to: 125-125

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json` at line 120,
The dashboard query is incorrectly using label-based quantiles for the emitted
histogram metric ytdlp_job_duration_seconds; update the panels that use
ytdlp_job_duration_seconds{quantile="0.5"} and {quantile="0.99"} to use
histogram_quantile with the _bucket series and a rate window, e.g. replace those
expressions with histogram_quantile(0.5,
rate(ytdlp_job_duration_seconds_bucket[5m])) and histogram_quantile(0.99,
rate(ytdlp_job_duration_seconds_bucket[5m])); alternatively, if you intend to
expose quantile labels instead, change the metric definition in
src/services/ytdlp/mod.rs from histogram!() to summary!() so the quantile labels
exist.
🧹 Nitpick comments (2)
Cargo.toml (1)

18-19: Update metrics and metrics-exporter-prometheus to the latest stable versions.

The current versions (metrics 0.22 and metrics-exporter-prometheus 0.13) are outdated. Latest stable versions are metrics 0.24.5 and metrics-exporter-prometheus 0.18.3. While no known security vulnerabilities exist for the current versions, updating to the latest releases ensures access to bug fixes and improvements.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Cargo.toml` around lines 18 - 19, Update the Cargo.toml dependency entries
for metrics and metrics-exporter-prometheus by bumping the versions: set metrics
= "0.24.5" and metrics-exporter-prometheus = "0.18.3" (replace the existing
metrics = "0.22" and metrics-exporter-prometheus = "0.13" lines). Ensure
Cargo.toml still validates and run cargo update / cargo build to verify there
are no new compatibility issues.
src/telemetry.rs (1)

45-57: ⚡ Quick win

Past security concern mitigated by infrastructure layer.

The past review flagged that /metrics is exposed without authentication at the application layer (critical issue). While the endpoint remains unauthenticated in-app, the PR's infrastructure changes add protection:

  • Caddyfile and infra/common/cloud-init.template implement @internal_only matchers that restrict /metrics access to internal clients only
  • The production deployment routes external traffic through Caddy, which blocks unauthorized /metrics requests before they reach the app

This is a valid defense-in-depth approach for observability endpoints. However, if the app is ever deployed without the Caddy reverse proxy (or if the Caddy config is misconfigured), /metrics would be publicly exposed.

Consider adding an environment-driven toggle or request source validation in-app as a defense-in-depth measure, especially if the app might be deployed in environments where the reverse proxy protection isn't guaranteed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/telemetry.rs` around lines 45 - 57, The setup currently exposes /metrics
unauthenticated in setup_metrics_router (Router::new().route("/metrics",
get(move || std::future::ready(handle.render())))); add an in-app, opt-in
defense-in-depth by reading an env toggle (e.g., METRICS_INTERNAL_ONLY or
REQUIRE_METRICS_AUTH) during setup_metrics_router and, when enabled, wrap the
metrics handler with a simple request validator that checks either the
X-Forwarded-For/remote addr or a shared internal header/token and returns 403
for disallowed requests; implement the validation as a small async handler (or
middleware) that calls handle.render() only when the check passes so deployments
without the Caddy proxy remain protected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docker-compose.dev.yml`:
- Around line 49-52: Remove the hardcoded Grafana admin credentials in
docker-compose.dev.yml (GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD);
instead read them from an external secret source (e.g., a .env file or Docker
secrets) or omit them so Grafana generates a random admin password on first
start, and update any local docs/scripts to load ADMIN_USER/ADMIN_PASSWORD from
that secure source rather than committing values into the repo.

In `@docker-compose.yml`:
- Line 1: Fix the typo in the top-of-file comment in docker-compose.yml: change
the word "Intergrated" to "Integrated" in the comment line that currently reads
"Intergrated WARP Proxy." to correct the spelling.

In `@infra/common/cloud-init.template`:
- Around line 243-246: The fetch_url retry helper currently calls curl without
timeouts which can hang; update the fetch_url function (the curl invocation that
uses flags -sSfL "$1" -o "$2") to include explicit connect and overall timeouts
(e.g. --connect-timeout 10 and --max-time 30) so each attempt cannot block
indefinitely while keeping the existing retry loop intact.
- Around line 59-66: Replace the mutable :latest tags for the Prometheus and
Grafana services in cloud-init.template by pinning to explicit versioned image
tags or using environment variables; specifically update the image lines
referencing "prom/prometheus:latest" and "grafana/grafana:latest" to either
hard-versioned tags (e.g., prom/prometheus:vX.Y.Z and grafana/grafana:vA.B.C) or
to variables like ${PROMETHEUS_IMAGE} and ${GRAFANA_IMAGE} and
document/propagate those environment variables for controlled version
management.

In `@Makefile`:
- Around line 183-193: The tr -d '\\r' usages are inconsistently escaped;
standardize all tr -d invocations to use a single backslash escape (tr -d '\r')
inside the Makefile command substitutions so they behave the same in the bash
-lc double-quoted context. Update the tr calls associated with
TF_VAR_CADDY_CUSTOM_BROWSE_FILE_URL, TF_VAR_PROMETHEUS_CONFIG_URL,
TF_VAR_GRAFANA_DATASOURCE_URL, TF_VAR_GRAFANA_PROVIDER_URL,
TF_VAR_YTDLP_DASHBOARD_URL and TF_VAR_CAPTCHA_SECURITY_DASHBOARD_URL to use tr
-d '\r' (or document why any deviation is required).

---

Duplicate comments:
In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Line 120: The dashboard query is incorrectly using label-based quantiles for
the emitted histogram metric ytdlp_job_duration_seconds; update the panels that
use ytdlp_job_duration_seconds{quantile="0.5"} and {quantile="0.99"} to use
histogram_quantile with the _bucket series and a rate window, e.g. replace those
expressions with histogram_quantile(0.5,
rate(ytdlp_job_duration_seconds_bucket[5m])) and histogram_quantile(0.99,
rate(ytdlp_job_duration_seconds_bucket[5m])); alternatively, if you intend to
expose quantile labels instead, change the metric definition in
src/services/ytdlp/mod.rs from histogram!() to summary!() so the quantile labels
exist.

---

Nitpick comments:
In `@Cargo.toml`:
- Around line 18-19: Update the Cargo.toml dependency entries for metrics and
metrics-exporter-prometheus by bumping the versions: set metrics = "0.24.5" and
metrics-exporter-prometheus = "0.18.3" (replace the existing metrics = "0.22"
and metrics-exporter-prometheus = "0.13" lines). Ensure Cargo.toml still
validates and run cargo update / cargo build to verify there are no new
compatibility issues.

In `@src/telemetry.rs`:
- Around line 45-57: The setup currently exposes /metrics unauthenticated in
setup_metrics_router (Router::new().route("/metrics", get(move ||
std::future::ready(handle.render())))); add an in-app, opt-in defense-in-depth
by reading an env toggle (e.g., METRICS_INTERNAL_ONLY or REQUIRE_METRICS_AUTH)
during setup_metrics_router and, when enabled, wrap the metrics handler with a
simple request validator that checks either the X-Forwarded-For/remote addr or a
shared internal header/token and returns 403 for disallowed requests; implement
the validation as a small async handler (or middleware) that calls
handle.render() only when the check passes so deployments without the Caddy
proxy remain protected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c5af420d-d4e8-43b4-a133-2dc70d5040ab

📥 Commits

Reviewing files that changed from the base of the PR and between 4d4d399 and 36afb34.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (29)
  • .env.example
  • .gemini/settings.json
  • Caddyfile
  • Caddyfile.local
  • Cargo.toml
  • Dockerfile
  • Dockerfile.dev
  • Makefile
  • README.md
  • docker-compose.dev.yml
  • docker-compose.yml
  • infra/common/cloud-init.template
  • infra/common/grafana/provisioning/dashboards/captcha-security.json
  • infra/common/grafana/provisioning/dashboards/default.yml
  • infra/common/grafana/provisioning/dashboards/ytdlp-health.json
  • infra/common/grafana/provisioning/datasources/prometheus.yml
  • infra/common/prometheus.yml
  • infra/digitalocean/accounts/naduns-team/main.tf
  • infra/digitalocean/accounts/naduns-team/variables.tf
  • infra/digitalocean/components/locals.tf
  • infra/digitalocean/components/variables.tf
  • postman/collections/Nadzu API/Metrics.request.yaml
  • postman/collections/Nadzu API/YT-DLP Enqueue.request.yaml
  • postman/environments/Nadzu Local.environment.yaml
  • src/app.rs
  • src/lib.rs
  • src/middleware/captcha.rs
  • src/services/ytdlp/mod.rs
  • src/telemetry.rs

Comment thread docker-compose.dev.yml
Comment thread docker-compose.yml Outdated
Comment thread infra/common/cloud-init.template
Comment on lines +243 to +246
if curl -sSfL "$1" -o "$2"; then return 0; fi
sleep $((2**i))
done
return 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add curl timeouts in fetch_url to avoid bootstrap hangs.

Retries are present, but without connect/overall timeouts a single attempt can block for a long time and stall provisioning.

🔧 Suggested fix
-          if curl -sSfL "$1" -o "$2"; then return 0; fi
+          if curl -sSfL --connect-timeout 10 --max-time 120 "$1" -o "$2"; then return 0; fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if curl -sSfL "$1" -o "$2"; then return 0; fi
sleep $((2**i))
done
return 1
if curl -sSfL --connect-timeout 10 --max-time 120 "$1" -o "$2"; then return 0; fi
sleep $((2**i))
done
return 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/common/cloud-init.template` around lines 243 - 246, The fetch_url retry
helper currently calls curl without timeouts which can hang; update the
fetch_url function (the curl invocation that uses flags -sSfL "$1" -o "$2") to
include explicit connect and overall timeouts (e.g. --connect-timeout 10 and
--max-time 30) so each attempt cannot block indefinitely while keeping the
existing retry loop intact.

Comment thread Makefile
Comment thread src/services/ytdlp/mod.rs
Comment on lines +154 to +155
counter!("ytdlp_jobs_enqueued_total").increment(1);
gauge!("ytdlp_active_jobs").increment(1.0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Finalize job metrics on all exit paths.

run_job returns early on setup/spawn failures, so ytdlp_active_jobs is never decremented and failure duration/completion metrics are skipped for those jobs.

Suggested patch
@@
     async fn run_job(
@@
-        let start_time = Instant::now();
+        let start_time = Instant::now();
@@
         let temp_dir = PathBuf::from(&output_dir).join("tmp").join(&id);
         if let Err(err) = fs::create_dir_all(&temp_dir).await {
             self.mark_job_failed(&id, format!("failed to create temp directory: {err}"));
+            self.record_job_metrics(start_time, false);
             return;
         }
@@
         let mut child = match cmd.spawn() {
             Ok(child) => child,
             Err(err) => {
                 self.mark_job_failed(&id, format!("failed to spawn yt-dlp: {err}"));
                 error!("failed ytdlp job id={id}: {err}");
                 let _ = fs::remove_dir_all(&temp_dir).await;
+                self.record_job_metrics(start_time, false);
                 return;
             }
         };
@@
-        histogram!("ytdlp_job_duration_seconds").record(start_time.elapsed().as_secs_f64());
-        if job_success {
-            counter!("ytdlp_jobs_completed_total", "status" => "success").increment(1);
-        } else {
-            counter!("ytdlp_jobs_completed_total", "status" => "error").increment(1);
-        }
-        gauge!("ytdlp_active_jobs").decrement(1.0);
+        self.record_job_metrics(start_time, job_success);
     }
+
+    fn record_job_metrics(&self, start_time: Instant, job_success: bool) {
+        histogram!("ytdlp_job_duration_seconds").record(start_time.elapsed().as_secs_f64());
+        if job_success {
+            counter!("ytdlp_jobs_completed_total", "status" => "success").increment(1);
+        } else {
+            counter!("ytdlp_jobs_completed_total", "status" => "error").increment(1);
+        }
+        gauge!("ytdlp_active_jobs").decrement(1.0);
+    }

Also applies to: 228-239, 309-316, 400-407

@nxdun nxdun merged commit 0faa6e2 into main May 12, 2026
2 checks passed
@nxdun nxdun deleted the nadun/observe-telem branch May 12, 2026 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant