[feat(infra)] Add Prometheus and Grafana observability stack#14
Conversation
…hanced monitoring
…guration and add Prometheus datasource
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds centralized tracing and a /metrics endpoint, instruments yt-dlp and CAPTCHA middleware with Prometheus metrics, provisions Prometheus and Grafana (datasource + dashboards), protects observability routes in Caddy, and wires Makefile/terraform/cloud-init to publish and fetch presigned provisioning artifacts. ChangesObservability and Monitoring System
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (11)
Cargo.toml (1)
18-19: ⚡ Quick winUpdate dependency versions to the latest stable releases.
The current versions (
metrics = "0.22"andmetrics-exporter-prometheus = "0.13") are outdated. Consider updating to the latest versions:metrics = "0.24.5"andmetrics-exporter-prometheus = "0.18.3". This is low-effort maintenance that keeps the project on current, supported releases.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Cargo.toml` around lines 18 - 19, Update the two crate versions in Cargo.toml: change the metrics dependency from "0.22" to "0.24.5" and change metrics-exporter-prometheus from "0.13" to "0.18.3", then run cargo update / cargo build to ensure compatibility and resolve any resulting dependency issues; target the dependency entries named metrics and metrics-exporter-prometheus when making the edits.Caddyfile.local (1)
26-33: ⚡ Quick winConsider adding IP-based access restrictions for internal monitoring paths.
The local Caddyfile exposes
/nadun/grafana/*and/metricswithout any access control, whereas the PR objectives mention a@internal_onlyIP matcher protecting these paths in production. Even in local development, exposing metrics and dashboards can leak sensitive operational data if the ports are accidentally exposed or in shared development environments.Consider adding a matcher similar to production:
+@internal_only { + remote_ip 127.0.0.1 ::1 +} + # 2. Local Grafana Access handle /nadun/grafana/* { + `@internal_only` reverse_proxy nadzu-grafana:3000 } # 3. Local Metrics Access handle /metrics { + `@internal_only` reverse_proxy app:8080 }This ensures consistency between local and production configurations and prevents accidental exposure.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Caddyfile.local` around lines 26 - 33, Add IP-based access restrictions to the local Caddy routing for the Grafana and metrics endpoints by applying the same `@internal_only` matcher used in production: update the handle blocks for "handle /nadun/grafana/*" and "handle /metrics" to include the `@internal_only` matcher (or an equivalent local matcher that limits to loopback/private IPs) so only internal IPs can reach those routes; ensure the matcher name matches the production matcher ("@internal_only") or define a local equivalent at the top of the Caddyfile so routing is consistent and protected in development.docker-compose.dev.yml (1)
5-6: 💤 Low valueDocument the required build step for local development.
Line 6 changes from a
builddirective toimage: nadzu:local, which means developers must explicitly runmake bd(build Docker) before runningmake rd(run Docker compose). This is a workflow change that could cause confusion.Consider adding a comment or updating documentation to clarify this dependency:
app: - # Build Docker image from local Dockerfile (make bd) + # Requires pre-built image: run `make bd` first to build nadzu:local image: nadzu:localAlternatively, you could keep the
buildcontext alongside theimagetag, which allows Docker Compose to build if the image doesn't exist.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docker-compose.dev.yml` around lines 5 - 6, The docker-compose change replaced a build context with a static image ("image: nadzu:local") which requires developers to run the build step (make bd) before make rd; update the repo to document this by either adding a clear comment near the "image: nadzu:local" line in docker-compose.dev.yml explaining "run make bd before make rd" and referencing the Makefile target, or restore a build context alongside the image (e.g., add a "build:" block for the same service while keeping "image: nadzu:local") so docker-compose can build the image automatically when missing; modify README or developer docs to mention the required make bd step and the intended workflow.infra/common/grafana/provisioning/dashboards/ytdlp-health.json (1)
43-56: ⚖️ Poor tradeoffHardcoded max saturation value may not match runtime configuration.
Line 46 hardcodes
"max": 10for the active workers gauge, but the actualMAX_CONCURRENT_DOWNLOADSis a configurable variable (seeinfra/digitalocean/components/variables.tfline 94). If the runtime configuration differs from 10, the gauge visualization will be misleading.Consider either:
- Documenting that this dashboard assumes max=10, or
- Making the dashboard max value configurable via a Grafana variable that reads from a metric or annotation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json` around lines 43 - 56, The dashboard currently hardcodes "max": 10 in the active workers gauge under "fieldConfig" -> "defaults", which can misrepresent the real MAX_CONCURRENT_DOWNLOADS runtime value; update the dashboard JSON to use a Grafana templating variable (e.g., add a new dashboard variable like "max_concurrent_downloads" and reference it as the max value) or remove the hardcoded max and let Grafana autoscale; specifically modify the "fieldConfig.defaults.max" entry in ytdlp-health.json to reference the new variable (or delete it) and add the corresponding variable definition in the dashboard's "templating" section so the UI reflects the actual configured concurrency.src/services/ytdlp/mod.rs (1)
228-228: ⚖️ Poor tradeoffConsider whether duration should include queue wait time.
Line 228 starts the timer immediately upon entering
run_job(), but the semaphore is acquired on line 231, meaning the duration includes queue wait time (when all worker slots are occupied).This depends on your observability goals:
- Current behavior: Measures end-to-end latency from user perspective (enqueue → completion)
- Alternative: Start timer after acquiring the semaphore to measure only processing time
If you want to distinguish queue time from processing time, consider emitting two separate histograms:
let queue_start = Instant::now(); let _permit = self.semaphore.acquire().await.expect("semaphore closed"); histogram!("ytdlp_job_queue_duration_seconds").record(queue_start.elapsed().as_secs_f64()); let processing_start = Instant::now(); // ... job processing ... histogram!("ytdlp_job_processing_duration_seconds").record(processing_start.elapsed().as_secs_f64());This would provide better visibility into whether latency is due to queuing vs. slow downloads.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/ytdlp/mod.rs` at line 228, The current timer in run_job() starts before acquiring the semaphore (start_time = Instant::now()), so measured duration includes queue wait time; to separate queue vs processing latency, record two histograms: capture queue_start before semaphore.acquire().await and record ytdlp_job_queue_duration_seconds after acquiring, then set processing_start = Instant::now() and record ytdlp_job_processing_duration_seconds at the end of processing; update references in run_job(), start_time, and semaphore.acquire() logic and ensure labels/units match existing histogram usage.Makefile (1)
186-191: 💤 Low valueConsider preserving original file paths in S3 for clarity.
Lines 186, 188, and 190 rename files when uploading to S3:
grafana/provisioning/datasources/prometheus.yml→grafana_datasource.ymlgrafana/provisioning/dashboards/default.yml→grafana_provider.yml- Original path structure is lost
This renaming could cause confusion when debugging or auditing the S3 bucket. Consider either:
Preserve the directory structure in S3:
aws s3 cp infra/common/grafana/provisioning/datasources/prometheus.yml \ "s3://$AWS_S3_BUCKET_NAME/terraform/data/grafana/provisioning/datasources/prometheus.yml"Document the naming mapping in a comment:
# Upload Grafana datasource config (datasources/prometheus.yml → grafana_datasource.yml)The current approach saves a few characters but reduces traceability between local files and S3 objects.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Makefile` around lines 186 - 191, The S3 uploads in the Makefile currently flatten filenames (the aws s3 cp lines that set TF_VAR_GRAFANA_DATASOURCE_URL, TF_VAR_GRAFANA_PROVIDER_URL, and TF_VAR_YTDLP_DASHBOARD_URL) losing the original infra/common/grafana/... directory structure; update the destination keys in those aws s3 cp commands to preserve the full path (e.g., use terraform/data/infra/common/grafana/provisioning/...) and update the corresponding aws s3 presign invocations that set TF_VAR_GRAFANA_DATASOURCE_URL, TF_VAR_GRAFANA_PROVIDER_URL, and TF_VAR_YTDLP_DASHBOARD_URL to reference the new object keys, or alternatively add a clear comment above each aws s3 cp explaining the mapping if you intentionally want to keep the flattened names.infra/common/cloud-init.template (3)
242-246: ⚡ Quick winPresigned-URL fetches are single-shot on first boot — re-provisioning after URL expiry will fail.
These
curlcalls run as part of cloud-initruncmd, which executes only on first boot. If the instance is rebuilt or the cloud-init template re-runs after the presigned URLs expire (typical TTL is hours to a few days for S3), the script aborts viaset -euo pipefailand the box never reachesdocker compose up. Two cheap mitigations worth considering:
- Add a small retry/backoff per fetch (e.g.,
curl --retry 5 --retry-delay 3 --retry-connrefused) for transient network blips.- Document/enforce in the Terraform
tftarget that these URLs must have a TTL longer than the longest expected provisioning window, and skip the fetch if the artifact already exists locally (so manual recovery is easy).- curl -sSfL "${PROMETHEUS_CONFIG_URL}" -o /opt/app/prometheus/prometheus.yml - curl -sSfL "${GRAFANA_DATASOURCE_URL}" -o /opt/app/grafana/provisioning/datasources/prometheus.yml - curl -sSfL "${GRAFANA_PROVIDER_URL}" -o /opt/app/grafana/provisioning/dashboards/default.yml - curl -sSfL "${YTDLP_DASHBOARD_URL}" -o /opt/app/grafana/provisioning/dashboards/ytdlp-health.json + curl -sSfL --retry 5 --retry-delay 3 --retry-connrefused "${PROMETHEUS_CONFIG_URL}" -o /opt/app/prometheus/prometheus.yml + curl -sSfL --retry 5 --retry-delay 3 --retry-connrefused "${GRAFANA_DATASOURCE_URL}" -o /opt/app/grafana/provisioning/datasources/prometheus.yml + curl -sSfL --retry 5 --retry-delay 3 --retry-connrefused "${GRAFANA_PROVIDER_URL}" -o /opt/app/grafana/provisioning/dashboards/default.yml + curl -sSfL --retry 5 --retry-delay 3 --retry-connrefused "${YTDLP_DASHBOARD_URL}" -o /opt/app/grafana/provisioning/dashboards/ytdlp-health.json🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/common/cloud-init.template` around lines 242 - 246, The cloud-init runcmd curl fetches for PROMETHEUS_CONFIG_URL, GRAFANA_DATASOURCE_URL, GRAFANA_PROVIDER_URL and YTDLP_DASHBOARD_URL are single-shot and will fail the whole boot if presigned URLs expire; update the cloud-init snippet that runs these curl commands to (1) skip each download if the target file already exists (check /opt/app/... paths before calling curl) and (2) add robust curl retry flags (e.g., --retry, --retry-delay, --retry-connrefused) so transient/network errors don’t abort due to set -euo pipefail, and ensure the commands return non-fatal status when skipping or after retries.
58-75: 💤 Low valueGrafana has no persistent volume — dashboards, alerts, API keys, and users vanish on container restart.
Only
/etc/grafana/provisioningis bind-mounted; Grafana's data directory (/var/lib/grafana, SQLite DB, alert state, dashboard versions, API tokens, user sessions) is ephemeral. On everydocker compose upcycle, alert history and any non-provisioned state is lost. If you intend "everything is code via provisioning" that is defensible — but at least the alert state and rendered images are usually worth keeping:💾 Suggested addition
grafana: image: grafana/grafana:latest restart: ${DOCKER_RESTART_POLICY} environment: - GF_AUTH_ANONYMOUS_ENABLED=true - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin - GF_SERVER_SERVE_FROM_SUB_PATH=true - GF_SERVER_ROOT_URL=https://${PRODUCTION_DOMAIN}/nadun/grafana/ volumes: - /opt/app/grafana/provisioning:/etc/grafana/provisioning + - grafana-data:/var/lib/grafana depends_on: - prometheus + + volumes: + grafana-data:Same consideration applies to Prometheus (TSDB at
/prometheus) — scrape history is lost on restart, which can hurt incident retrospectives.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/common/cloud-init.template` around lines 58 - 75, The Grafana service declaration lacks a persistent data volume so its runtime state (SQLite DB, dashboards, alerts, API keys, sessions) is lost on container restart; update the grafana service in the cloud-init.template (service name "grafana") to add a bind or named volume mapping the Grafana data dir (/var/lib/grafana) to a host path or Docker volume and do the same for Prometheus by mapping Prometheus TSDB (/prometheus) to persist scrape history; ensure the volumes entries and any necessary ownership/permissions are set so Grafana and Prometheus can write to those paths.
58-62: 💤 Low valuePrometheus should depend on
appso the initial scrape target resolves.
prometheushas nodepends_on, so it can start beforeappand log connection errors againstapp:8080until the app comes up. Not fatal (Prometheus retries), but the cleaner ordering matches whatgrafanaalready does forprometheus:prometheus: image: prom/prometheus:latest restart: ${DOCKER_RESTART_POLICY} volumes: - /opt/app/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml + depends_on: + - app🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/common/cloud-init.template` around lines 58 - 62, Add a depends_on entry to the prometheus service so it waits for the app service like grafana does; specifically update the prometheus service block (service name "prometheus") to include a depends_on: - app line so Prometheus starts after the "app" service and its initial scrape target app:8080 resolves cleanly (mirror the existing pattern used by the "grafana" service).src/telemetry.rs (1)
45-56: ⚡ Quick winConsider splitting global recorder install from router construction.
setup_metrics_routerhas two responsibilities: it installs the global Prometheus recorder (a one-shot, process-wide side effect) and returns a router. Calling it twice (e.g., in an integration test, a re-init path, or future code that builds multiple routers) will panic viaexpect, and the name does not signal that side effect. Splitting them keeps the surface honest and testable:♻️ Suggested refactor
-/// Initializes the Prometheus metrics registry and returns the router. -pub fn setup_metrics_router() -> Router<AppState> { - let builder = PrometheusBuilder::new(); - - // Install the global recorder - #[allow(clippy::expect_used)] - let handle = builder - .install_recorder() - .expect("Failed to install Prometheus recorder"); - - // Route that Prometheus will scrape - Router::new().route("/metrics", get(move || std::future::ready(handle.render()))) -} +/// Installs the global Prometheus recorder. Must be called exactly once at startup. +#[must_use] +pub fn init_metrics() -> PrometheusHandle { + #[allow(clippy::expect_used)] + PrometheusBuilder::new() + .install_recorder() + .expect("Failed to install Prometheus recorder") +} + +/// Builds the router exposing `GET /metrics` for Prometheus scraping. +#[must_use] +pub fn metrics_router(handle: PrometheusHandle) -> Router<AppState> { + Router::new().route("/metrics", get(move || std::future::ready(handle.render()))) +}You'll also need
use metrics_exporter_prometheus::{PrometheusBuilder, PrometheusHandle};and to update the call site inapp.rs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/telemetry.rs` around lines 45 - 56, setup_metrics_router currently both installs the global Prometheus recorder and builds the Router, causing a process-wide side effect and a panic if called twice; extract the recorder installation into a separate function (e.g., install_prometheus_recorder) that calls PrometheusBuilder::new().install_recorder() and returns the PrometheusHandle (use metrics_exporter_prometheus::{PrometheusBuilder, PrometheusHandle}), then change setup_metrics_router to only construct and return Router<AppState> that takes a PrometheusHandle (or captures it) for the /metrics handler; update callers (e.g., in app.rs) to call install_prometheus_recorder once at process init and pass the handle into setup_metrics_router.Caddyfile (1)
18-20: 💤 Low valueDocument expected format and consider validation for
SSH_ALLOWED_IPSto prevent configuration errors.
SSH_ALLOWED_IPSis implicitly space-separated viajoin(" ", var.SSH_ALLOWED_IPS)in locals.tf, but the Terraform variable description does not specify this format or explain that values should be CIDR blocks or IP addresses. Additionally, there is no validation preventing empty arrays; ifvar.SSH_ALLOWED_IPS = [], the Caddyfile matcher becomesnot client_ip(empty), which will fail to parse at startup. More problematically, if the array contains syntactically valid but stale/non-matching CIDR blocks, the matcher silently denies all external traffic without obvious indication until runtime, making it hard to detect configuration mistakes.Consider either:
- Adding a variable constraint in Terraform to reject empty arrays
- Documenting the expected format in the variable description (space-separated CIDR blocks/IPs)
- Adding a startup validation check to confirm at least one IP is reachable
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Caddyfile` around lines 18 - 20, The Caddyfile matcher '@internal_only' is built from locals.tf using join(" ", var.SSH_ALLOWED_IPS) and can produce an empty/invalid matcher when SSH_ALLOWED_IPS is empty or malformed; add a Terraform variable validation block for SSH_ALLOWED_IPS to reject empty lists and validate each entry as a CIDR/IP (use a validation rule with alltrue/can(regexmatch(...)) or a CIDR helper), update the SSH_ALLOWED_IPS variable description to document the expected space-separated CIDR/IP format, and change the Caddyfile rendering logic to conditionally emit the "not client_ip ${SSH_ALLOWED_IPS}" matcher only when SSH_ALLOWED_IPS is non-empty to avoid producing "not client_ip " at startup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Caddyfile`:
- Around line 48-51: The /metrics handler hardcodes reverse_proxy target
"app:8080" causing divergence from the rest of the config; update the
reverse_proxy in the handle /metrics block to use the same APP_PORT placeholder
used elsewhere (replace the literal 8080 with ${APP_PORT}) so the /metrics proxy
remains consistent with the public API proxy (also mirror this change in
Caddyfile.local and infra/common/cloud-init.template as noted).
In `@infra/common/cloud-init.template`:
- Around line 248-249: The command "chmod -R 644" in the diff removes the
execute bit from directories (breaking Grafana provisioning); replace it with a
mode that preserves execute on directories, e.g. "chmod -R a=rwX" against the
prometheus and grafana provisioning directories or use the explicit pair of find
commands to set dirs to 755 and files to 644, and optionally add "chown -R
472:472" for the grafana provisioning tree to ensure the Grafana UID can access
the files.
- Around line 64-75: The Grafana service in the template currently enables
anonymous access with an Admin role (GF_AUTH_ANONYMOUS_ENABLED=true and
GF_AUTH_ANONYMOUS_ORG_ROLE=Admin) which is unsafe; update the grafana service
environment to either disable anonymous access (set
GF_AUTH_ANONYMOUS_ENABLED=false) or at minimum change GF_AUTH_ANONYMOUS_ORG_ROLE
to Viewer, and remove any Admin privileges for anonymous users; ensure the
change is applied to the grafana block that sets GF_SERVER_ROOT_URL and volumes,
and update any deployment notes/config to require authenticated access (and
validate Caddy `@internal_only` controls still protect admin endpoints).
In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Around line 18-31: The "noValue" property inside the JSON block under
fieldConfig.defaults is currently a string ("100") which prevents Grafana from
doing numeric threshold comparisons; change the noValue value to a numeric
literal (100) so panels like the stat panel will evaluate the thresholds
correctly — locate the fieldConfig -> defaults -> noValue entry and replace the
quoted string with a numeric value.
- Around line 117-128: The dashboard queries currently reference
ytdlp_job_duration_seconds with quantile labels which won't exist because the
metric is emitted as a histogram via histogram!() in src/services/ytdlp/mod.rs;
update the grafana panel targets (the expressions that reference
ytdlp_job_duration_seconds) to use PromQL histogram_quantile, e.g.
histogram_quantile(0.5, rate(ytdlp_job_duration_seconds_bucket[5m])) and
histogram_quantile(0.99, rate(ytdlp_job_duration_seconds_bucket[5m])) so they
read from the _bucket series, or alternatively change the metric in
src/services/ytdlp/mod.rs from a histogram to a summary if you really want
direct quantile labels.
In `@src/telemetry.rs`:
- Around line 45-52: The metrics route created by setup_metrics_router
(returning Router<AppState>) is merged into the main app without authentication,
exposing /metrics; protect it by either adding authentication middleware to the
Router returned by setup_metrics_router (apply the existing auth middleware
function/method used elsewhere before merging), or stop merging it into the
public router and instead serve that Router on a separate internal listener
(bind a dedicated internal port) or gate it behind an env-driven check that only
enables metrics when Caddy/internal proxy is present; update the code that
merges the returned Router into the main app to use the protected Router (or
move the merge to the internal listener) so /metrics is no longer directly
accessible.
---
Nitpick comments:
In `@Caddyfile`:
- Around line 18-20: The Caddyfile matcher '@internal_only' is built from
locals.tf using join(" ", var.SSH_ALLOWED_IPS) and can produce an empty/invalid
matcher when SSH_ALLOWED_IPS is empty or malformed; add a Terraform variable
validation block for SSH_ALLOWED_IPS to reject empty lists and validate each
entry as a CIDR/IP (use a validation rule with alltrue/can(regexmatch(...)) or a
CIDR helper), update the SSH_ALLOWED_IPS variable description to document the
expected space-separated CIDR/IP format, and change the Caddyfile rendering
logic to conditionally emit the "not client_ip ${SSH_ALLOWED_IPS}" matcher only
when SSH_ALLOWED_IPS is non-empty to avoid producing "not client_ip " at
startup.
In `@Caddyfile.local`:
- Around line 26-33: Add IP-based access restrictions to the local Caddy routing
for the Grafana and metrics endpoints by applying the same `@internal_only`
matcher used in production: update the handle blocks for "handle
/nadun/grafana/*" and "handle /metrics" to include the `@internal_only` matcher
(or an equivalent local matcher that limits to loopback/private IPs) so only
internal IPs can reach those routes; ensure the matcher name matches the
production matcher ("@internal_only") or define a local equivalent at the top of
the Caddyfile so routing is consistent and protected in development.
In `@Cargo.toml`:
- Around line 18-19: Update the two crate versions in Cargo.toml: change the
metrics dependency from "0.22" to "0.24.5" and change
metrics-exporter-prometheus from "0.13" to "0.18.3", then run cargo update /
cargo build to ensure compatibility and resolve any resulting dependency issues;
target the dependency entries named metrics and metrics-exporter-prometheus when
making the edits.
In `@docker-compose.dev.yml`:
- Around line 5-6: The docker-compose change replaced a build context with a
static image ("image: nadzu:local") which requires developers to run the build
step (make bd) before make rd; update the repo to document this by either adding
a clear comment near the "image: nadzu:local" line in docker-compose.dev.yml
explaining "run make bd before make rd" and referencing the Makefile target, or
restore a build context alongside the image (e.g., add a "build:" block for the
same service while keeping "image: nadzu:local") so docker-compose can build the
image automatically when missing; modify README or developer docs to mention the
required make bd step and the intended workflow.
In `@infra/common/cloud-init.template`:
- Around line 242-246: The cloud-init runcmd curl fetches for
PROMETHEUS_CONFIG_URL, GRAFANA_DATASOURCE_URL, GRAFANA_PROVIDER_URL and
YTDLP_DASHBOARD_URL are single-shot and will fail the whole boot if presigned
URLs expire; update the cloud-init snippet that runs these curl commands to (1)
skip each download if the target file already exists (check /opt/app/... paths
before calling curl) and (2) add robust curl retry flags (e.g., --retry,
--retry-delay, --retry-connrefused) so transient/network errors don’t abort due
to set -euo pipefail, and ensure the commands return non-fatal status when
skipping or after retries.
- Around line 58-75: The Grafana service declaration lacks a persistent data
volume so its runtime state (SQLite DB, dashboards, alerts, API keys, sessions)
is lost on container restart; update the grafana service in the
cloud-init.template (service name "grafana") to add a bind or named volume
mapping the Grafana data dir (/var/lib/grafana) to a host path or Docker volume
and do the same for Prometheus by mapping Prometheus TSDB (/prometheus) to
persist scrape history; ensure the volumes entries and any necessary
ownership/permissions are set so Grafana and Prometheus can write to those
paths.
- Around line 58-62: Add a depends_on entry to the prometheus service so it
waits for the app service like grafana does; specifically update the prometheus
service block (service name "prometheus") to include a depends_on: - app line so
Prometheus starts after the "app" service and its initial scrape target app:8080
resolves cleanly (mirror the existing pattern used by the "grafana" service).
In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Around line 43-56: The dashboard currently hardcodes "max": 10 in the active
workers gauge under "fieldConfig" -> "defaults", which can misrepresent the real
MAX_CONCURRENT_DOWNLOADS runtime value; update the dashboard JSON to use a
Grafana templating variable (e.g., add a new dashboard variable like
"max_concurrent_downloads" and reference it as the max value) or remove the
hardcoded max and let Grafana autoscale; specifically modify the
"fieldConfig.defaults.max" entry in ytdlp-health.json to reference the new
variable (or delete it) and add the corresponding variable definition in the
dashboard's "templating" section so the UI reflects the actual configured
concurrency.
In `@Makefile`:
- Around line 186-191: The S3 uploads in the Makefile currently flatten
filenames (the aws s3 cp lines that set TF_VAR_GRAFANA_DATASOURCE_URL,
TF_VAR_GRAFANA_PROVIDER_URL, and TF_VAR_YTDLP_DASHBOARD_URL) losing the original
infra/common/grafana/... directory structure; update the destination keys in
those aws s3 cp commands to preserve the full path (e.g., use
terraform/data/infra/common/grafana/provisioning/...) and update the
corresponding aws s3 presign invocations that set TF_VAR_GRAFANA_DATASOURCE_URL,
TF_VAR_GRAFANA_PROVIDER_URL, and TF_VAR_YTDLP_DASHBOARD_URL to reference the new
object keys, or alternatively add a clear comment above each aws s3 cp
explaining the mapping if you intentionally want to keep the flattened names.
In `@src/services/ytdlp/mod.rs`:
- Line 228: The current timer in run_job() starts before acquiring the semaphore
(start_time = Instant::now()), so measured duration includes queue wait time; to
separate queue vs processing latency, record two histograms: capture queue_start
before semaphore.acquire().await and record ytdlp_job_queue_duration_seconds
after acquiring, then set processing_start = Instant::now() and record
ytdlp_job_processing_duration_seconds at the end of processing; update
references in run_job(), start_time, and semaphore.acquire() logic and ensure
labels/units match existing histogram usage.
In `@src/telemetry.rs`:
- Around line 45-56: setup_metrics_router currently both installs the global
Prometheus recorder and builds the Router, causing a process-wide side effect
and a panic if called twice; extract the recorder installation into a separate
function (e.g., install_prometheus_recorder) that calls
PrometheusBuilder::new().install_recorder() and returns the PrometheusHandle
(use metrics_exporter_prometheus::{PrometheusBuilder, PrometheusHandle}), then
change setup_metrics_router to only construct and return Router<AppState> that
takes a PrometheusHandle (or captures it) for the /metrics handler; update
callers (e.g., in app.rs) to call install_prometheus_recorder once at process
init and pass the handle into setup_metrics_router.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: f66c7dd5-b030-4a63-9bb0-0a08ea105651
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (19)
CaddyfileCaddyfile.localCargo.tomlDockerfileMakefiledocker-compose.dev.ymlinfra/common/cloud-init.templateinfra/common/grafana/provisioning/dashboards/default.ymlinfra/common/grafana/provisioning/dashboards/ytdlp-health.jsoninfra/common/grafana/provisioning/datasources/prometheus.ymlinfra/common/prometheus.ymlinfra/digitalocean/accounts/naduns-team/main.tfinfra/digitalocean/accounts/naduns-team/variables.tfinfra/digitalocean/components/locals.tfinfra/digitalocean/components/variables.tfsrc/app.rssrc/lib.rssrc/services/ytdlp/mod.rssrc/telemetry.rs
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@infra/common/cloud-init.template`:
- Around line 239-244: The presigned URL curl downloads (invocations using
CADDY_CUSTOM_BROWSE_FILE_URL, PROMETHEUS_CONFIG_URL, GRAFANA_DATASOURCE_URL,
GRAFANA_PROVIDER_URL, YTDLP_DASHBOARD_URL that write to /opt/app/...) must be
made resilient: replace the direct curl -sSfL calls with a small retry/backoff
wrapper (either use curl's --retry/--retry-delay/--retry-max-time flags or a
bash function that retries with exponential backoff) that attempts each download
several times before failing; ensure each download logs attempts and exits
non‑zero only after retries are exhausted so provisioning can fail fast if
permanently unreachable.
- Around line 70-71: The generated docker-compose file currently embeds
GF_SECURITY_ADMIN_PASSWORD and ends up world-readable; update
cloud-init.template so that after rendering the compose content (including
GF_SECURITY_ADMIN_PASSWORD) you set strict ownership and permissions—chown
root:root and chmod 0600 on the resulting /opt/app/docker-compose.yml (or write
the secret into a root-owned env file with 0600 and reference it from the
compose file); alternatively switch to using Docker secrets, but at minimum
ensure the file containing GF_SECURITY_ADMIN_PASSWORD is created with root-only
read/write (0600) and owned by root.
In `@infra/digitalocean/accounts/naduns-team/variables.tf`:
- Around line 203-216: The three Terraform variables PROMETHEUS_CONFIG_URL,
GRAFANA_DATASOURCE_URL, and GRAFANA_PROVIDER_URL should be marked sensitive so
their presigned URL values are redacted in plan/apply output; update each
variable declaration (variable "PROMETHEUS_CONFIG_URL", variable
"GRAFANA_DATASOURCE_URL", variable "GRAFANA_PROVIDER_URL" and any other similar
presigned-URL variables) to include sensitive = true in the block and keep the
description/type as-is.
In `@infra/digitalocean/components/variables.tf`:
- Around line 170-183: The Terraform variable blocks for PROMETHEUS_CONFIG_URL,
GRAFANA_DATASOURCE_URL, and GRAFANA_PROVIDER_URL should be marked sensitive to
avoid exposing presigned URLs; edit the variable declarations (variable
"PROMETHEUS_CONFIG_URL", variable "GRAFANA_DATASOURCE_URL", variable
"GRAFANA_PROVIDER_URL") and add sensitive = true to each block, and apply the
same change to the other presigned URL variables referenced in the file (the
blocks around lines 196-199) so all presigned URL inputs are declared sensitive.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 6fdbea62-f911-4f15-b145-758a4f7af67d
📒 Files selected for processing (8)
.env.exampledocker-compose.dev.ymlinfra/common/cloud-init.templateinfra/common/grafana/provisioning/dashboards/ytdlp-health.jsoninfra/digitalocean/accounts/naduns-team/main.tfinfra/digitalocean/accounts/naduns-team/variables.tfinfra/digitalocean/components/locals.tfinfra/digitalocean/components/variables.tf
🚧 Files skipped from review as they are similar to previous changes (3)
- docker-compose.dev.yml
- infra/digitalocean/components/locals.tf
- infra/common/grafana/provisioning/dashboards/ytdlp-health.json
…iguration and security
…rometheus query expressions in Grafana dashboard
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
infra/common/grafana/provisioning/dashboards/ytdlp-health.json (1)
120-125:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse
histogram_quantile()over_bucketseries for P50/P99 latency.The current expressions expect
quantilelabels onytdlp_job_duration_seconds, which is not how Prometheus histograms are queried. This will produce empty series unless the metric is a summary.🔧 Proposed fix
- "expr": "ytdlp_job_duration_seconds{quantile=\"0.5\"}", + "expr": "histogram_quantile(0.5, sum(rate(ytdlp_job_duration_seconds_bucket[5m])) by (le))", "legendFormat": "P50 (Median)" @@ - "expr": "ytdlp_job_duration_seconds{quantile=\"0.99\"}", + "expr": "histogram_quantile(0.99, sum(rate(ytdlp_job_duration_seconds_bucket[5m])) by (le))", "legendFormat": "P99 (Worst Case)"#!/bin/bash # Verify metric type emitted by app code (histogram vs summary) and dashboard query usage. rg -n -C2 'histogram!\(|summary!\(|ytdlp_job_duration_seconds' src || true rg -n -C2 'ytdlp_job_duration_seconds\{quantile=|histogram_quantile|_bucket' infra/common/grafana/provisioning/dashboards/ytdlp-health.json🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json` around lines 120 - 125, The dashboard queries use a non-existent quantile label on ytdlp_job_duration_seconds (summary-style) and should instead use Prometheus histogram queries: replace the exprs that reference ytdlp_job_duration_seconds{quantile="0.5"} and {quantile="0.99"} with histogram_quantile(0.5, ...) and histogram_quantile(0.99, ...) over the ytdlp_job_duration_seconds_bucket series (typically wrapping a sum(rate(...[5m])) by (le, <other_labels>)); ensure you aggregate/group by the same identifying labels you want in the legend and keep legendFormat as "P50 (Median)" / "P99" accordingly so the panels return data from the histogram buckets.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Line 14: The current PromQL expression uses a raw division of two
sum(rate(...)) calls causing NaN/Inf when there are zero completions; update the
expression that references the metric ytdlp_jobs_completed_total so the
denominator is protected with clamp_min (e.g., use
clamp_min(sum(rate(ytdlp_jobs_completed_total[5m])), 1e-9)) before dividing and
then multiply by 100, ensuring the success-rate calculation never divides by
zero and yields 0% when there are no successes.
---
Duplicate comments:
In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Around line 120-125: The dashboard queries use a non-existent quantile label
on ytdlp_job_duration_seconds (summary-style) and should instead use Prometheus
histogram queries: replace the exprs that reference
ytdlp_job_duration_seconds{quantile="0.5"} and {quantile="0.99"} with
histogram_quantile(0.5, ...) and histogram_quantile(0.99, ...) over the
ytdlp_job_duration_seconds_bucket series (typically wrapping a
sum(rate(...[5m])) by (le, <other_labels>)); ensure you aggregate/group by the
same identifying labels you want in the legend and keep legendFormat as "P50
(Median)" / "P99" accordingly so the panels return data from the histogram
buckets.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 26fb9a99-5669-4d4e-a303-77f1dba885a4
📒 Files selected for processing (4)
docker-compose.dev.ymlinfra/common/cloud-init.templateinfra/common/grafana/provisioning/dashboards/ytdlp-health.jsonpostman/collections/Nadzu API/Metrics.request.yaml
✅ Files skipped from review due to trivial changes (1)
- postman/collections/Nadzu API/Metrics.request.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- infra/common/cloud-init.template
- docker-compose.dev.yml
… rate expression in Grafana dashboard
…for CAPTCHA verification
…additional insights
…onymous access settings
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (1)
infra/common/grafana/provisioning/dashboards/ytdlp-health.json (1)
120-120:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPast issue unresolved: queries still use direct
quantilelabels instead ofhistogram_quantile().The past review (marked as addressed in commit 0d3dc0a) flagged that these queries reference
ytdlp_job_duration_seconds{quantile="0.5"}and{quantile="0.99"}directly, but the metric is emitted as ahistogram!()insrc/services/ytdlp/mod.rs. Histograms don't exposequantilelabels by default—they export_bucket,_sum, and_countsuffixes.The correct PromQL pattern for histogram quantiles is:
histogram_quantile(0.5, rate(ytdlp_job_duration_seconds_bucket[5m])) histogram_quantile(0.99, rate(ytdlp_job_duration_seconds_bucket[5m]))Alternatively, if quantiles are intended to be exposed as labels, change the metric type from
histogram!()tosummary!()in the Rust code.🔧 Proposed fix using histogram_quantile
{ "datasource": "Prometheus", - "expr": "ytdlp_job_duration_seconds{quantile=\"0.5\"}", + "expr": "histogram_quantile(0.5, rate(ytdlp_job_duration_seconds_bucket[5m]))", "legendFormat": "P50 (Median)" }, { "datasource": "Prometheus", - "expr": "ytdlp_job_duration_seconds{quantile=\"0.99\"}", + "expr": "histogram_quantile(0.99, rate(ytdlp_job_duration_seconds_bucket[5m]))", "legendFormat": "P99 (Worst Case)" }Also applies to: 125-125
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json` at line 120, The dashboard query is incorrectly using label-based quantiles for the emitted histogram metric ytdlp_job_duration_seconds; update the panels that use ytdlp_job_duration_seconds{quantile="0.5"} and {quantile="0.99"} to use histogram_quantile with the _bucket series and a rate window, e.g. replace those expressions with histogram_quantile(0.5, rate(ytdlp_job_duration_seconds_bucket[5m])) and histogram_quantile(0.99, rate(ytdlp_job_duration_seconds_bucket[5m])); alternatively, if you intend to expose quantile labels instead, change the metric definition in src/services/ytdlp/mod.rs from histogram!() to summary!() so the quantile labels exist.
🧹 Nitpick comments (2)
Cargo.toml (1)
18-19: Updatemetricsandmetrics-exporter-prometheusto the latest stable versions.The current versions (
metrics0.22 andmetrics-exporter-prometheus0.13) are outdated. Latest stable versions aremetrics0.24.5 andmetrics-exporter-prometheus0.18.3. While no known security vulnerabilities exist for the current versions, updating to the latest releases ensures access to bug fixes and improvements.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Cargo.toml` around lines 18 - 19, Update the Cargo.toml dependency entries for metrics and metrics-exporter-prometheus by bumping the versions: set metrics = "0.24.5" and metrics-exporter-prometheus = "0.18.3" (replace the existing metrics = "0.22" and metrics-exporter-prometheus = "0.13" lines). Ensure Cargo.toml still validates and run cargo update / cargo build to verify there are no new compatibility issues.src/telemetry.rs (1)
45-57: ⚡ Quick winPast security concern mitigated by infrastructure layer.
The past review flagged that
/metricsis exposed without authentication at the application layer (critical issue). While the endpoint remains unauthenticated in-app, the PR's infrastructure changes add protection:
Caddyfileandinfra/common/cloud-init.templateimplement@internal_onlymatchers that restrict/metricsaccess to internal clients only- The production deployment routes external traffic through Caddy, which blocks unauthorized
/metricsrequests before they reach the appThis is a valid defense-in-depth approach for observability endpoints. However, if the app is ever deployed without the Caddy reverse proxy (or if the Caddy config is misconfigured),
/metricswould be publicly exposed.Consider adding an environment-driven toggle or request source validation in-app as a defense-in-depth measure, especially if the app might be deployed in environments where the reverse proxy protection isn't guaranteed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/telemetry.rs` around lines 45 - 57, The setup currently exposes /metrics unauthenticated in setup_metrics_router (Router::new().route("/metrics", get(move || std::future::ready(handle.render())))); add an in-app, opt-in defense-in-depth by reading an env toggle (e.g., METRICS_INTERNAL_ONLY or REQUIRE_METRICS_AUTH) during setup_metrics_router and, when enabled, wrap the metrics handler with a simple request validator that checks either the X-Forwarded-For/remote addr or a shared internal header/token and returns 403 for disallowed requests; implement the validation as a small async handler (or middleware) that calls handle.render() only when the check passes so deployments without the Caddy proxy remain protected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docker-compose.dev.yml`:
- Around line 49-52: Remove the hardcoded Grafana admin credentials in
docker-compose.dev.yml (GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD);
instead read them from an external secret source (e.g., a .env file or Docker
secrets) or omit them so Grafana generates a random admin password on first
start, and update any local docs/scripts to load ADMIN_USER/ADMIN_PASSWORD from
that secure source rather than committing values into the repo.
In `@docker-compose.yml`:
- Line 1: Fix the typo in the top-of-file comment in docker-compose.yml: change
the word "Intergrated" to "Integrated" in the comment line that currently reads
"Intergrated WARP Proxy." to correct the spelling.
In `@infra/common/cloud-init.template`:
- Around line 243-246: The fetch_url retry helper currently calls curl without
timeouts which can hang; update the fetch_url function (the curl invocation that
uses flags -sSfL "$1" -o "$2") to include explicit connect and overall timeouts
(e.g. --connect-timeout 10 and --max-time 30) so each attempt cannot block
indefinitely while keeping the existing retry loop intact.
- Around line 59-66: Replace the mutable :latest tags for the Prometheus and
Grafana services in cloud-init.template by pinning to explicit versioned image
tags or using environment variables; specifically update the image lines
referencing "prom/prometheus:latest" and "grafana/grafana:latest" to either
hard-versioned tags (e.g., prom/prometheus:vX.Y.Z and grafana/grafana:vA.B.C) or
to variables like ${PROMETHEUS_IMAGE} and ${GRAFANA_IMAGE} and
document/propagate those environment variables for controlled version
management.
In `@Makefile`:
- Around line 183-193: The tr -d '\\r' usages are inconsistently escaped;
standardize all tr -d invocations to use a single backslash escape (tr -d '\r')
inside the Makefile command substitutions so they behave the same in the bash
-lc double-quoted context. Update the tr calls associated with
TF_VAR_CADDY_CUSTOM_BROWSE_FILE_URL, TF_VAR_PROMETHEUS_CONFIG_URL,
TF_VAR_GRAFANA_DATASOURCE_URL, TF_VAR_GRAFANA_PROVIDER_URL,
TF_VAR_YTDLP_DASHBOARD_URL and TF_VAR_CAPTCHA_SECURITY_DASHBOARD_URL to use tr
-d '\r' (or document why any deviation is required).
---
Duplicate comments:
In `@infra/common/grafana/provisioning/dashboards/ytdlp-health.json`:
- Line 120: The dashboard query is incorrectly using label-based quantiles for
the emitted histogram metric ytdlp_job_duration_seconds; update the panels that
use ytdlp_job_duration_seconds{quantile="0.5"} and {quantile="0.99"} to use
histogram_quantile with the _bucket series and a rate window, e.g. replace those
expressions with histogram_quantile(0.5,
rate(ytdlp_job_duration_seconds_bucket[5m])) and histogram_quantile(0.99,
rate(ytdlp_job_duration_seconds_bucket[5m])); alternatively, if you intend to
expose quantile labels instead, change the metric definition in
src/services/ytdlp/mod.rs from histogram!() to summary!() so the quantile labels
exist.
---
Nitpick comments:
In `@Cargo.toml`:
- Around line 18-19: Update the Cargo.toml dependency entries for metrics and
metrics-exporter-prometheus by bumping the versions: set metrics = "0.24.5" and
metrics-exporter-prometheus = "0.18.3" (replace the existing metrics = "0.22"
and metrics-exporter-prometheus = "0.13" lines). Ensure Cargo.toml still
validates and run cargo update / cargo build to verify there are no new
compatibility issues.
In `@src/telemetry.rs`:
- Around line 45-57: The setup currently exposes /metrics unauthenticated in
setup_metrics_router (Router::new().route("/metrics", get(move ||
std::future::ready(handle.render())))); add an in-app, opt-in defense-in-depth
by reading an env toggle (e.g., METRICS_INTERNAL_ONLY or REQUIRE_METRICS_AUTH)
during setup_metrics_router and, when enabled, wrap the metrics handler with a
simple request validator that checks either the X-Forwarded-For/remote addr or a
shared internal header/token and returns 403 for disallowed requests; implement
the validation as a small async handler (or middleware) that calls
handle.render() only when the check passes so deployments without the Caddy
proxy remain protected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: c5af420d-d4e8-43b4-a133-2dc70d5040ab
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (29)
.env.example.gemini/settings.jsonCaddyfileCaddyfile.localCargo.tomlDockerfileDockerfile.devMakefileREADME.mddocker-compose.dev.ymldocker-compose.ymlinfra/common/cloud-init.templateinfra/common/grafana/provisioning/dashboards/captcha-security.jsoninfra/common/grafana/provisioning/dashboards/default.ymlinfra/common/grafana/provisioning/dashboards/ytdlp-health.jsoninfra/common/grafana/provisioning/datasources/prometheus.ymlinfra/common/prometheus.ymlinfra/digitalocean/accounts/naduns-team/main.tfinfra/digitalocean/accounts/naduns-team/variables.tfinfra/digitalocean/components/locals.tfinfra/digitalocean/components/variables.tfpostman/collections/Nadzu API/Metrics.request.yamlpostman/collections/Nadzu API/YT-DLP Enqueue.request.yamlpostman/environments/Nadzu Local.environment.yamlsrc/app.rssrc/lib.rssrc/middleware/captcha.rssrc/services/ytdlp/mod.rssrc/telemetry.rs
| if curl -sSfL "$1" -o "$2"; then return 0; fi | ||
| sleep $((2**i)) | ||
| done | ||
| return 1 |
There was a problem hiding this comment.
Add curl timeouts in fetch_url to avoid bootstrap hangs.
Retries are present, but without connect/overall timeouts a single attempt can block for a long time and stall provisioning.
🔧 Suggested fix
- if curl -sSfL "$1" -o "$2"; then return 0; fi
+ if curl -sSfL --connect-timeout 10 --max-time 120 "$1" -o "$2"; then return 0; fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if curl -sSfL "$1" -o "$2"; then return 0; fi | |
| sleep $((2**i)) | |
| done | |
| return 1 | |
| if curl -sSfL --connect-timeout 10 --max-time 120 "$1" -o "$2"; then return 0; fi | |
| sleep $((2**i)) | |
| done | |
| return 1 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@infra/common/cloud-init.template` around lines 243 - 246, The fetch_url retry
helper currently calls curl without timeouts which can hang; update the
fetch_url function (the curl invocation that uses flags -sSfL "$1" -o "$2") to
include explicit connect and overall timeouts (e.g. --connect-timeout 10 and
--max-time 30) so each attempt cannot block indefinitely while keeping the
existing retry loop intact.
| counter!("ytdlp_jobs_enqueued_total").increment(1); | ||
| gauge!("ytdlp_active_jobs").increment(1.0); |
There was a problem hiding this comment.
Finalize job metrics on all exit paths.
run_job returns early on setup/spawn failures, so ytdlp_active_jobs is never decremented and failure duration/completion metrics are skipped for those jobs.
Suggested patch
@@
async fn run_job(
@@
- let start_time = Instant::now();
+ let start_time = Instant::now();
@@
let temp_dir = PathBuf::from(&output_dir).join("tmp").join(&id);
if let Err(err) = fs::create_dir_all(&temp_dir).await {
self.mark_job_failed(&id, format!("failed to create temp directory: {err}"));
+ self.record_job_metrics(start_time, false);
return;
}
@@
let mut child = match cmd.spawn() {
Ok(child) => child,
Err(err) => {
self.mark_job_failed(&id, format!("failed to spawn yt-dlp: {err}"));
error!("failed ytdlp job id={id}: {err}");
let _ = fs::remove_dir_all(&temp_dir).await;
+ self.record_job_metrics(start_time, false);
return;
}
};
@@
- histogram!("ytdlp_job_duration_seconds").record(start_time.elapsed().as_secs_f64());
- if job_success {
- counter!("ytdlp_jobs_completed_total", "status" => "success").increment(1);
- } else {
- counter!("ytdlp_jobs_completed_total", "status" => "error").increment(1);
- }
- gauge!("ytdlp_active_jobs").decrement(1.0);
+ self.record_job_metrics(start_time, job_success);
}
+
+ fn record_job_metrics(&self, start_time: Instant, job_success: bool) {
+ histogram!("ytdlp_job_duration_seconds").record(start_time.elapsed().as_secs_f64());
+ if job_success {
+ counter!("ytdlp_jobs_completed_total", "status" => "success").increment(1);
+ } else {
+ counter!("ytdlp_jobs_completed_total", "status" => "error").increment(1);
+ }
+ gauge!("ytdlp_active_jobs").decrement(1.0);
+ }Also applies to: 228-239, 309-316, 400-407
Description
Introduces a full observability layer using Prometheus and Grafana across local dev and production environments. This PR adds deep instrumentation to both core services (ytdlp) and security middleware (captcha), exposed via a protected
/metricsendpoint.Key additions:
metricsandmetrics-exporter-prometheuscrates; extracted telemetry setup intosrc/telemetry.rs.enqueued,active,completed) and processing duration.captcha_check_total(labels:status,reason) andcaptcha_verify_duration_secondsto monitor hits, API key bypasses, and validation failures.Makefile,Terraform, andCloud-initto handle S3 uploads and automated fetching of all observability artifacts (configs + dashboards).@internal_onlymatcher to protect/nadun/fs/*,/nadun/grafana/*, and/metricsacross all environments.docker-compose.dev.ymlto use production-style image builds, resolving previous binary staleness and linking issues in Alpine/musl environments.Types of changes
Checklist
captcha_check_totalmetrics are correctly ingested and labeled in Prometheus./metricsscrape endpoint is reachable by Prometheus and blocked externally via@internal_only.Summary by CodeRabbit
New Features
/metricsendpoint for monitoring system healthInfrastructure & DevOps