Skip to content

Phase/observability incident reliability#9

Merged
Ryanakml merged 8 commits intomainfrom
phase/observability-incident-reliability
Mar 8, 2026
Merged

Phase/observability incident reliability#9
Ryanakml merged 8 commits intomainfrom
phase/observability-incident-reliability

Conversation

@Ryanakml
Copy link
Copy Markdown
Owner

@Ryanakml Ryanakml commented Mar 8, 2026

Summary by CodeRabbit

  • New Features

    • Added centralized telemetry, structured logging, and metrics across services plus a Grafana observability dashboard.
  • Infrastructure

    • Local Prometheus and Grafana services added and metrics exported for local visualization.
    • CI workflow to validate Prometheus config and alert rules.
  • Alerts

    • New alerting rules for webhook failures, latency breaches, DLQ spikes, provider fallbacks, and escalation anomalies.
  • Tests

    • Tests added to verify trace-context propagation and telemetry instruments.
  • Chores

    • Added runtime telemetry/logging dependencies and a metrics simulation script.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 8, 2026

Warning

Rate limit exceeded

@Ryanakml has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 4 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c5023e66-eddd-473a-8e9e-c9a7dc4d3133

📥 Commits

Reviewing files that changed from the base of the PR and between e7064bf and 356cbcc.

📒 Files selected for processing (2)
  • docker-compose.local.yml
  • docker/prometheus.yml
📝 Walkthrough

Walkthrough

Adds centralized telemetry and structured logging (OpenTelemetry + Pino), instruments API, worker, LLM, tools and simulator with app-level metrics, provisions Prometheus/Grafana (scrape/export pipeline + dashboards + alerts), adds traceContext propagation for ingress jobs, and a CI check for Prometheus rules.

Changes

Cohort / File(s) Summary
Shared telemetry & logging
packages/shared/src/telemetry.ts, packages/shared/src/telemetry.js, packages/shared/src/telemetry.d.ts, packages/shared/src/logger.ts, packages/shared/src/logger.js, packages/shared/src/logger.d.ts
New OpenTelemetry setup (NodeSDK, OTLP exporters, meter), appMetrics instruments, and a Pino-based structured logger exported from shared package.
Shared public surface
packages/shared/src/index.ts, packages/shared/src/index.js, packages/shared/src/index.d.ts, packages/shared/package.json
Re-exports telemetry and logger; adds runtime dependencies; extends IngressJobPayload/createIngressJobPayload to optionally include traceContext.
API service
apps/api/src/index.ts, apps/api/src/index.js, apps/api/src/index.test.ts, apps/api/src/index.test.js
Initializes telemetry/logger, adds request-count/latency/error metrics middleware, swaps console logs for structured logger, and attaches traceContext to enqueued webhook ingress payloads; tests updated to assert traceContext.
Worker service
apps/worker/src/index.ts, apps/worker/src/index.test.ts, apps/worker/src/queue/consumer.ts
Initializes telemetry/logger, structured logs across lifecycle, queue-depth tracking (with delta), processing latency, retry/DLQ metrics; tests updated to validate traceContext and permanent-error behavior.
LLM & tools instrumentation
packages/llm/src/langchain/chains/router.ts, packages/llm/src/langchain/pipeline.ts, packages/llm/src/router/model-router.ts, packages/llm/src/tools/index.ts
Instruments agent path counts, parse failure counts, provider fallback callbacks, and tool execution outcomes (success/timeout/error) using appMetrics.
Prometheus & Grafana provisioning
docker/prometheus.yml, docker/prometheus/alert.rules.yml, docker/grafana/provisioning/datasources/prometheus.yml, docker/grafana/provisioning/dashboards/dashboards.yml, docker/grafana/dashboards/k2-observability.json
Adds Prometheus config and alert rules (5 alerts), Grafana datasource and dashboard provisioning, and a new observability dashboard with multiple panels/queries.
Local compose & OTEL config
docker-compose.local.yml, docker/otel-collector-config.yml, docker/grafana/..., docker/prometheus/...
Adds Prometheus and Grafana services to local compose, exposes OTEL collector Prometheus exporter endpoint (8889), and mounts Grafana provisioning/dashboards.
CI check for Prometheus rules
.github/workflows/check-prometheus-rules.yml
New GitHub Actions workflow that downloads Prometheus binaries and validates docker/prometheus.yml and docker/prometheus/alert.rules.yml with promtool.
Simulator & metrics script
scripts/simulate-alerts.ts, docker/grafana/dashboards/*
New Express-based metrics simulator exposing synthetic Prometheus metrics on :9091 to exercise alerts and dashboards.
Typing & minor changes
apps/dashboard/lib/supabase/server.ts, apps/dashboard/middleware.ts, package.json, packages/shared/test-resource.ts, various tests
Tightened cookie typing, added dev deps (express, prom-client) and telemetry tests, small test helper/resource file.

Sequence Diagram(s)

sequenceDiagram
participant API as API Service
participant Worker as Worker
participant Shared as Shared (telemetry/logger)
participant OTEL as OTEL Collector
participant Prom as Prometheus
participant Graf as Grafana

API->>Shared: setupTelemetry('wa-chat-api') & use logger/appMetrics
API->>Worker: enqueue ingress job (payload + traceContext)
Worker->>Shared: setupTelemetry('wa-chat-worker') & use logger/appMetrics
Shared->>OTEL: SDK exports traces/metrics (OTLP HTTP)
OTEL->>Prom: exposes /metrics for Prometheus (8889)
Prom->>Graf: Grafana queries Prometheus datasource for dashboards/alerts
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hop on traces and sprinkle logs so bright,

Counters tick and histograms take flight,
Prom scrapes, Grafana lights the rooms,
Jobs carry traceContext in their booms,
Hop-hop — observability blooms at night. 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'Phase/observability incident reliability' is vague and doesn't clearly summarize the main changes. While it references observability, it's overly generic and doesn't convey what specific work was done. Use a more descriptive title that summarizes the main changes, such as 'Add comprehensive observability infrastructure with telemetry, logging, and monitoring dashboards' or similar.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch phase/observability-incident-reliability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/api/src/index.test.ts (1)

336-344: ⚠️ Potential issue | 🟡 Minor

Assert propagation, not just shape.

This still passes if enqueueing generates a second random trace ID, because the only trace assertion is “looks like hex”. Compare the queued value with ingressEvents[0].context.traceId so the test actually verifies end-to-end propagation.

✅ Tighten the assertion
       const enqueuedTraceContext = deps.enqueuedJobs[0]?.traceContext;
       assert.ok(enqueuedTraceContext, 'Enqueued job must contain traceContext');
       assert.equal(enqueuedTraceContext.correlationId, correlationId);
       assert.match(enqueuedTraceContext.traceId, /^[a-f0-9]{32}$/);

       const ingressEvents = getObservabilityEvents(deps.observabilityEvents, 'ingress_start');
       assert.equal(ingressEvents.length, 1);
       assert.equal(ingressEvents[0]?.context.correlationId, correlationId);
       assert.match(ingressEvents[0]?.context.traceId || '', /^[a-f0-9]{32}$/);
+      assert.equal(enqueuedTraceContext.traceId, ingressEvents[0]?.context.traceId);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/index.test.ts` around lines 336 - 344, The test currently only
asserts traceId shape; instead verify end-to-end propagation by asserting the
queued trace equals the ingress event trace: after computing
enqueuedTraceContext from deps.enqueuedJobs and ingressEvents from
getObservabilityEvents(deps.observabilityEvents, 'ingress_start'), add an
assertion that enqueuedTraceContext.traceId === ingressEvents[0].context.traceId
(and keep the existing correlationId equality checks), so the test uses the
actual values (enqueuedTraceContext, ingressEvents, correlationId) rather than
only pattern-matching the traceId.
🧹 Nitpick comments (6)
packages/shared/src/telemetry.js (2)

48-56: SIGTERM handler calls process.exit(0) in finally block unconditionally.

The finally block on line 53-55 always calls process?.exit?.(0), even if shutdown failed. Consider only exiting successfully if shutdown succeeds, or using different exit codes:

♻️ Differentiate exit codes based on shutdown result
     processRef?.on?.('SIGTERM', () => {
         sdk
             .shutdown()
-            .then(() => logger.info('Telemetry shut down successfully'))
-            .catch((error) => logger.error({ error }, 'Error shutting down telemetry'))
-            .finally(() => {
-            processRef?.exit?.(0);
-        });
+            .then(() => {
+                logger.info('Telemetry shut down successfully');
+                processRef?.exit?.(0);
+            })
+            .catch((error) => {
+                logger.error({ error }, 'Error shutting down telemetry');
+                processRef?.exit?.(1);
+            });
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/telemetry.js` around lines 48 - 56, The SIGTERM handler
currently calls processRef?.exit?.(0) unconditionally in the finally block after
sdk.shutdown(), which exits with success even on shutdown failure; update the
handler so that sdk.shutdown().then(...) calls processRef?.exit?.(0) on success
and the .catch(...) calls processRef?.exit?.(1) (or omit exiting on failure if
desired), and remove the unconditional exit from the .finally() block so exit
codes reflect shutdown result; refer to processRef, sdk.shutdown(), and logger
in your changes.

13-24: URL construction logic is overly complex and potentially confusing.

The metric exporter URL construction (lines 18-24) checks if the endpoint ends with /v1/traces and replaces it, which is an unusual scenario. While the logic technically works for common cases, it's confusing to read.

Consider simplifying to always construct paths from a base endpoint:

♻️ Simplified URL construction
+    // Normalize base endpoint (strip any path suffix)
+    const baseEndpoint = otlpEndpoint?.replace(/\/v1\/(traces|metrics)$/, '') || '';
+    
     const traceExporter = otlpEndpoint
-        ? new OTLPTraceExporter({
-            url: otlpEndpoint.endsWith('/v1/traces') ? otlpEndpoint : `${otlpEndpoint}/v1/traces`,
-        })
+        ? new OTLPTraceExporter({ url: `${baseEndpoint}/v1/traces` })
         : new OTLPTraceExporter();
     const metricExporter = otlpEndpoint
-        ? new OTLPMetricExporter({
-            url: otlpEndpoint.endsWith('/v1/traces')
-                ? otlpEndpoint.replace('/v1/traces', '/v1/metrics')
-                : `${otlpEndpoint}/v1/metrics`,
-        })
+        ? new OTLPMetricExporter({ url: `${baseEndpoint}/v1/metrics` })
         : new OTLPMetricExporter();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/telemetry.js` around lines 13 - 24, The metric/exporter
URL logic is confusing by special-casing endpoints that end with "/v1/traces";
simplify by normalizing otlpEndpoint into a single base URL (trim any trailing
slash and strip any trailing "/v1/traces" or "/v1/metrics" if present) and then
build trace and metric URLs by appending "/v1/traces" and "/v1/metrics"
respectively before passing to OTLPTraceExporter and OTLPMetricExporter (keep
fallback to new OTLPTraceExporter()/new OTLPMetricExporter() when otlpEndpoint
is falsy). Ensure you update the code around traceExporter, metricExporter, and
the otlpEndpoint normalization so both exporters are constructed from a single,
predictable base.
packages/shared/package.json (1)

19-20: Consider making pino-pretty a devDependency or optional.

pino-pretty is only used when NODE_ENV !== 'production'. In production builds, this dependency adds unnecessary weight. Consider either:

  1. Moving it to devDependencies and using dynamic import() with error handling
  2. Making it an optional peer dependency
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/shared/package.json` around lines 19 - 20, package.json currently
lists "pino-pretty" as a regular dependency which inflates production bundles;
move "pino-pretty" to devDependencies and update runtime initialization code
that configures the logger (where you reference "pino" and configure
pretty-printing) to use a dynamic import() for "pino-pretty" only when NODE_ENV
!== 'production', wrapping the import in try/catch and falling back to plain
pino if the import fails; alternatively document and/or configure "pino-pretty"
as an optional peerDependency so consumers can opt-in—ensure the changes
reference the "pino-pretty" entry in package.json and the logger setup code that
currently imports or requires it.
packages/shared/src/logger.js (1)

1-20: Compiled JS file in src/ directory is unusual.

This appears to be TypeScript compilation output (note the sourceMappingURL comment). Typically, compiled .js files belong in dist/ while only source .ts files reside in src/. Having both in src/ can cause confusion and may lead to the compiled files getting out of sync with the source.

Consider:

  1. Adding *.js and *.js.map to .gitignore for the src/ directory
  2. Ensuring the build process outputs to dist/ (which aligns with package.json's "main": "dist/index.js")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/logger.js` around lines 1 - 20, The compiled JavaScript
file (contains sourceMappingURL) was committed into src/ (logger.js with export
logger and pino/pino-pretty usage); remove the compiled artifact from the source
tree and ensure builds emit to dist/ instead: delete logger.js and logger.js.map
from src/, remove them from git with git rm --cached if already tracked, add a
rule to ignore src/*.js and src/*.js.map in .gitignore, and adjust the
build/config (tsconfig/outDir or build script) so TypeScript/tsc outputs to
dist/ and source .ts logger file remains in src/ (ensure exported symbol logger
remains implemented in src/logger.ts).
packages/llm/src/tools/index.ts (1)

13-37: Route tool failures through the shared logger as well.

This path now emits metrics, but it still falls back to raw console.error. Using the shared logger here keeps tool failures structured and trace-correlated like the rest of the observability work in this PR.

Suggested refactor
-import { appMetrics } from '@wa-chat/shared';
+import { appMetrics, logger } from '@wa-chat/shared';
@@
   } catch (error) {
     const isTimeout =
       error instanceof Error &&
       (error.message.includes('timeout') || error.name === 'TimeoutError');
     appMetrics.toolExecutionCount.add(1, {
       tool: toolName,
       status: isTimeout ? 'timeout' : 'error',
     });
-    console.error(
-      `[Tool Error] ${toolName} failed:`,
-      error instanceof Error ? error.message : error,
-    );
+    logger.error({ err: error, tool: toolName, status: isTimeout ? 'timeout' : 'error' }, 'Tool execution failed');
     return `I'm currently unable to complete this action due to a technical issue. Please try again later or type 'escalate' to speak with a human agent.`;
   }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/llm/src/tools/index.ts` around lines 13 - 37, The catch block in
withSafeFallback logs errors with console.error instead of the shared structured
logger; update packages/llm/src/tools/index.ts to import the shared logger
(e.g., appLogger) from '@wa-chat/shared' and replace the console.error call
inside withSafeFallback with appLogger.error, passing a descriptive message, the
error object (not just error.message), and context tags (tool: toolName, status:
isTimeout ? 'timeout' : 'error') so failures are structured and trace-correlated
while preserving the existing appMetrics calls.
packages/llm/src/router/model-router.ts (1)

28-36: Extract the fallback metric wrapper once.

The same callback block is duplicated in both router builders. Pulling it into one shared config/helper will keep fallback instrumentation from drifting the next time this metric changes.

♻️ Small cleanup
+const fallbackMetricConfig = {
+  callbacks: [
+    {
+      handleLLMStart: () => {
+        appMetrics.agentProviderFallbackCount.add(1);
+      },
+    },
+  ],
+};
+
-  const fallbackModel = createGeminiAdapter(config?.fallbackConfig).withConfig({
-    callbacks: [
-      {
-        handleLLMStart: () => {
-          appMetrics.agentProviderFallbackCount.add(1);
-        },
-      },
-    ],
-  });
+  const fallbackModel = createGeminiAdapter(config?.fallbackConfig).withConfig(
+    fallbackMetricConfig,
+  );
...
-    .withConfig({
-      callbacks: [
-        {
-          handleLLMStart: () => {
-            appMetrics.agentProviderFallbackCount.add(1);
-          },
-        },
-      ],
-    });
+    .withConfig(fallbackMetricConfig);

Also applies to: 52-62

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/llm/src/router/model-router.ts` around lines 28 - 36, The duplicated
fallback callback that increments appMetrics.agentProviderFallbackCount
(currently embedded in the createGeminiAdapter(...).withConfig({ callbacks: [{
handleLLMStart: () => { appMetrics.agentProviderFallbackCount.add(1); } }] })
call used for the fallbackModel and the other router builder) should be
extracted into a single shared helper/config (e.g., a getFallbackCallbacks or
fallbackInstrumentationConfig) and reused by both router builders; locate the
two createGeminiAdapter(...).withConfig(...) usages (the fallbackModel variable
and the second builder where the same callbacks are repeated) and replace the
inline callbacks with a reference to the shared helper to avoid duplication and
drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/check-prometheus-rules.yml:
- Line 2: Remove the invalid top-level key "description" from the workflow (the
YAML root must only use allowed keys such as name, on, jobs, env, permissions,
defaults, concurrency, run-name); either delete the "description" entry entirely
or convert its text to a YAML comment at the top of the file so the workflow
validates.

In `@apps/api/src/index.js`:
- Around line 280-296: The middleware that builds metric labels (inside the
app.use(...) block that sets startTime and constructs attributes) currently
falls back to req.path when req.route is undefined, creating high-cardinality
time series; change the route label to use a bounded fallback (e.g.,
'unmatched') or a normalized route template instead of req.path by replacing the
expression req.route?.path || req.path with something like req.route?.path ||
'unmatched' (or retrieve and stash a normalized route template) so
appMetrics.apiRequestCount, appMetrics.apiLatency, and appMetrics.apiErrorCount
receive a stable route label.
- Around line 473-476: The traceContext currently uses a synthetic traceId from
buildIngressTraceContext; instead capture the active OpenTelemetry span and
propagate its full span context downstream instead of the synthetic ID. In
buildIngressTraceContext (and where traceContext is assigned to
ingressContext/traceContext) call trace.getSpan(context.active())?.spanContext()
and construct a W3C traceparent (or include the full spanContext object) and set
that value on ingressContext.traceContext (or a traceparent field) so the worker
receives the real span information; ensure
setupTelemetry/getNodeAutoInstrumentations remains enabled and use the
spanContext rather than randomBytes-generated IDs.
- Around line 1-3: The telemetry setup call setupTelemetry('wa-chat-api') must
run before any app module imports so auto-instrumentation can patch
Express/Redis/BullMQ; move this initialization into a tiny bootstrap entry point
(e.g., create a new bootstrap file that imports and calls
setupTelemetry('wa-chat-api') and then dynamically imports the existing
index.js/app), update your start/dev scripts to run the bootstrap (or use tsx
for dev) so the bootstrap executes first, and remove or replace the direct
setupTelemetry call from apps/api/src/index.js to avoid double-initialization.

In `@apps/api/src/index.ts`:
- Around line 1-3: The module-level call to setupTelemetry('wa-chat-api') runs
before environment variables are loaded; move this call out of top-level
execution and into a bootstrap/initialization routine that runs after
dotenv.config() (or call it after the env-loading code), e.g., remove the direct
setupTelemetry(...) invocation from the top of apps/api/src/index.ts and instead
invoke setupTelemetry inside your existing bootstrap function or immediately
after dotenv.config() so telemetry initialization uses the proper env values;
reference the setupTelemetry symbol and the module-level invocation in index.ts
when making the change.
- Around line 402-418: Middleware is currently using req.route?.path or req.path
which either drops mounted prefixes or yields high-cardinality user-controlled
paths; change the label to a low-cardinality route identifier by building it
from the mounting prefix and route pattern (e.g., use req.baseUrl +
req.route?.path when req.route exists) and avoid falling back to raw req.path
for unmatched routes—use a stable sentinel like "unmatched_route" or
"unknown_route" for 404s; update the app.use middleware that records
appMetrics.apiRequestCount, appMetrics.apiLatency, and appMetrics.apiErrorCount
to compute this routeLabel and pass it in attributes instead of req.path.

In `@apps/worker/src/index.ts`:
- Around line 486-514: The retry/failure/DLQ logs emitted via
logger.warn/logger.error (events 'worker.job.retried', 'worker.job.failed', and
'worker.dlq.inflow.alert') are missing trace correlation and eventKey; update
the payloads to merge the ingress payload's traceContext and eventKey into
baseLogData before logging (i.e., extend baseLogData with trace ids and eventKey
from the incoming traceContext/eventKey fields used earlier in this module),
then pass that enriched object to logger.warn/logger.error for the retry/failure
and the DLQ branch (routeResult.routedToDlq) so those logs include the trace
identifiers and eventKey for incident correlation.
- Around line 1-3: The telemetry SDK is being initialized too early
(setupTelemetry('wa-chat-worker') runs before dotenv.config()) and causes SDK
and SIGTERM handlers to register during module import (affecting tests); move
the setupTelemetry call to after dotenv.config() and ensure initialization
(including any SIGTERM handler registration) only runs on application startup by
either exporting a start function (e.g., startWorker/startApp) that calls
dotenv.config() then setupTelemetry, or by guarding top-level startup with a
runtime check (e.g., require.main === module) so imports (like in index.test.ts)
do not trigger SDK/handler registration. Use the symbols setupTelemetry and any
existing SIGTERM handler registration code to locate and relocate the
initialization logic.

In `@apps/worker/src/queue/consumer.ts`:
- Around line 136-151: The code is incrementing queueDlqCount/queueRetryCount
too early in runIngressJob() before routeFailedIngressJobToDlq() determines
final outcome, and it also misses counting permanent retries when
permanentMaxAttempts > 1; remove the early metric increments from the
failure-handling branch that checks classifiedError.errorClass and
currentAttempt, and instead increment appMetrics.queueDlqCount.add(...) or
appMetrics.queueRetryCount.add(...) inside routeFailedIngressJobToDlq() after
the DLQ enqueue result is known (or in the final retry decision path), ensuring
you use classifiedError.errorClass and currentAttempt vs.
policies.retry.transientMaxAttempts / permanentAttempts to decide DLQ vs retry
and still preserve throwing UnrecoverableError for permanent errors (as in the
UnrecoverableError handling).

In `@docker/prometheus.yml`:
- Around line 1-5: Add an evaluation_interval under the global Prometheus config
to match the fast scrape cadence: in the global block (where scrape_interval: 1s
is set) add evaluation_interval: 1s so alert rules (rule_files
/etc/prometheus/alert.rules.yml) are evaluated every second instead of the 60s
default.

In `@packages/shared/src/index.ts`:
- Around line 22-25: isIngressJobPayload currently ignores traceContext and
assertIngressJobPayload forwards any truthy object, allowing malformed
traceContext like { traceId: 123 }; update isIngressJobPayload to validate that
if payload.traceContext exists it is an object with both traceId and
correlationId as strings, and then update assertIngressJobPayload to call this
stricter check (rejecting payloads where traceContext is missing keys or types),
ensuring any place that forwards the payload (the code path that passes
traceContext along) only accepts payloads that pass the new validation.

In `@packages/shared/src/telemetry.ts`:
- Around line 60-67: The SIGTERM handler in the shared telemetry helper (the
processRef?.on?('SIGTERM', ...) block that calls sdk.shutdown()) must not call
processRef?.exit?.(0) or otherwise force process exit; remove the forced exit
and avoid owning process lifecycle here. Modify the handler so it only calls
sdk.shutdown() (and logs success or error via logger) but does not exit the
process or swallow failures—allow the API/worker entrypoints to perform their
own shutdown sequencing (e.g., worker.close()/queue.close()) and then call
sdk.shutdown() or exit as appropriate.

In `@scripts/simulate-alerts.ts`:
- Line 91: Fix the typo in the shutdown log message: replace the misspelled
string "Sutdown requested, stopping metrics server..." used in the console.log
call with "Shutdown requested, stopping metrics server..." so the shutdown
handler / log output (the console.log invocation in scripts/simulate-alerts.ts)
prints the correct word.

---

Outside diff comments:
In `@apps/api/src/index.test.ts`:
- Around line 336-344: The test currently only asserts traceId shape; instead
verify end-to-end propagation by asserting the queued trace equals the ingress
event trace: after computing enqueuedTraceContext from deps.enqueuedJobs and
ingressEvents from getObservabilityEvents(deps.observabilityEvents,
'ingress_start'), add an assertion that enqueuedTraceContext.traceId ===
ingressEvents[0].context.traceId (and keep the existing correlationId equality
checks), so the test uses the actual values (enqueuedTraceContext,
ingressEvents, correlationId) rather than only pattern-matching the traceId.

---

Nitpick comments:
In `@packages/llm/src/router/model-router.ts`:
- Around line 28-36: The duplicated fallback callback that increments
appMetrics.agentProviderFallbackCount (currently embedded in the
createGeminiAdapter(...).withConfig({ callbacks: [{ handleLLMStart: () => {
appMetrics.agentProviderFallbackCount.add(1); } }] }) call used for the
fallbackModel and the other router builder) should be extracted into a single
shared helper/config (e.g., a getFallbackCallbacks or
fallbackInstrumentationConfig) and reused by both router builders; locate the
two createGeminiAdapter(...).withConfig(...) usages (the fallbackModel variable
and the second builder where the same callbacks are repeated) and replace the
inline callbacks with a reference to the shared helper to avoid duplication and
drift.

In `@packages/llm/src/tools/index.ts`:
- Around line 13-37: The catch block in withSafeFallback logs errors with
console.error instead of the shared structured logger; update
packages/llm/src/tools/index.ts to import the shared logger (e.g., appLogger)
from '@wa-chat/shared' and replace the console.error call inside
withSafeFallback with appLogger.error, passing a descriptive message, the error
object (not just error.message), and context tags (tool: toolName, status:
isTimeout ? 'timeout' : 'error') so failures are structured and trace-correlated
while preserving the existing appMetrics calls.

In `@packages/shared/package.json`:
- Around line 19-20: package.json currently lists "pino-pretty" as a regular
dependency which inflates production bundles; move "pino-pretty" to
devDependencies and update runtime initialization code that configures the
logger (where you reference "pino" and configure pretty-printing) to use a
dynamic import() for "pino-pretty" only when NODE_ENV !== 'production', wrapping
the import in try/catch and falling back to plain pino if the import fails;
alternatively document and/or configure "pino-pretty" as an optional
peerDependency so consumers can opt-in—ensure the changes reference the
"pino-pretty" entry in package.json and the logger setup code that currently
imports or requires it.

In `@packages/shared/src/logger.js`:
- Around line 1-20: The compiled JavaScript file (contains sourceMappingURL) was
committed into src/ (logger.js with export logger and pino/pino-pretty usage);
remove the compiled artifact from the source tree and ensure builds emit to
dist/ instead: delete logger.js and logger.js.map from src/, remove them from
git with git rm --cached if already tracked, add a rule to ignore src/*.js and
src/*.js.map in .gitignore, and adjust the build/config (tsconfig/outDir or
build script) so TypeScript/tsc outputs to dist/ and source .ts logger file
remains in src/ (ensure exported symbol logger remains implemented in
src/logger.ts).

In `@packages/shared/src/telemetry.js`:
- Around line 48-56: The SIGTERM handler currently calls processRef?.exit?.(0)
unconditionally in the finally block after sdk.shutdown(), which exits with
success even on shutdown failure; update the handler so that
sdk.shutdown().then(...) calls processRef?.exit?.(0) on success and the
.catch(...) calls processRef?.exit?.(1) (or omit exiting on failure if desired),
and remove the unconditional exit from the .finally() block so exit codes
reflect shutdown result; refer to processRef, sdk.shutdown(), and logger in your
changes.
- Around line 13-24: The metric/exporter URL logic is confusing by
special-casing endpoints that end with "/v1/traces"; simplify by normalizing
otlpEndpoint into a single base URL (trim any trailing slash and strip any
trailing "/v1/traces" or "/v1/metrics" if present) and then build trace and
metric URLs by appending "/v1/traces" and "/v1/metrics" respectively before
passing to OTLPTraceExporter and OTLPMetricExporter (keep fallback to new
OTLPTraceExporter()/new OTLPMetricExporter() when otlpEndpoint is falsy). Ensure
you update the code around traceExporter, metricExporter, and the otlpEndpoint
normalization so both exporters are constructed from a single, predictable base.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1cda0560-5943-4e6d-afe1-b1496200c51f

📥 Commits

Reviewing files that changed from the base of the PR and between 60abff3 and 75ac7f8.

⛔ Files ignored due to path filters (10)
  • apps/api/src/index.d.ts.map is excluded by !**/*.map
  • apps/api/src/index.js.map is excluded by !**/*.map
  • apps/api/src/index.test.js.map is excluded by !**/*.map
  • package-lock.json is excluded by !**/package-lock.json
  • packages/shared/src/index.d.ts.map is excluded by !**/*.map
  • packages/shared/src/index.js.map is excluded by !**/*.map
  • packages/shared/src/logger.d.ts.map is excluded by !**/*.map
  • packages/shared/src/logger.js.map is excluded by !**/*.map
  • packages/shared/src/telemetry.d.ts.map is excluded by !**/*.map
  • packages/shared/src/telemetry.js.map is excluded by !**/*.map
📒 Files selected for processing (35)
  • .github/workflows/check-prometheus-rules.yml
  • apps/api/src/index.js
  • apps/api/src/index.test.js
  • apps/api/src/index.test.ts
  • apps/api/src/index.ts
  • apps/dashboard/lib/supabase/server.ts
  • apps/dashboard/middleware.ts
  • apps/worker/src/index.test.ts
  • apps/worker/src/index.ts
  • apps/worker/src/queue/consumer.ts
  • docker-compose.local.yml
  • docker/grafana/dashboards/k2-observability.json
  • docker/grafana/provisioning/dashboards/dashboards.yml
  • docker/grafana/provisioning/datasources/prometheus.yml
  • docker/otel-collector-config.yml
  • docker/prometheus.yml
  • docker/prometheus/alert.rules.yml
  • package.json
  • packages/llm/src/langchain/chains/router.ts
  • packages/llm/src/langchain/pipeline.ts
  • packages/llm/src/router/model-router.ts
  • packages/llm/src/tools/index.ts
  • packages/shared/package.json
  • packages/shared/src/index.d.ts
  • packages/shared/src/index.js
  • packages/shared/src/index.ts
  • packages/shared/src/logger.d.ts
  • packages/shared/src/logger.js
  • packages/shared/src/logger.ts
  • packages/shared/src/telemetry.d.ts
  • packages/shared/src/telemetry.js
  • packages/shared/src/telemetry.test.ts
  • packages/shared/src/telemetry.ts
  • packages/shared/test-resource.ts
  • scripts/simulate-alerts.ts

@@ -0,0 +1,21 @@
name: K3 Alerting Rules Check
description: Validates Prometheus alert rules
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Invalid workflow key: description is not allowed at the workflow level.

GitHub Actions workflows do not support a description key at the top level. Valid keys are: name, on, jobs, env, permissions, defaults, concurrency, run-name. This will cause the workflow to fail validation.

🔧 Proposed fix: Remove the invalid key
 name: K3 Alerting Rules Check
-description: Validates Prometheus alert rules

 on: [push, pull_request]

If you want to retain the description, add it as a YAML comment:

+# Validates Prometheus alert rules
 name: K3 Alerting Rules Check

 on: [push, pull_request]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
description: Validates Prometheus alert rules
name: K3 Alerting Rules Check
on: [push, pull_request]
🧰 Tools
🪛 actionlint (1.7.11)

[error] 2-2: unexpected key "description" for "workflow" section. expected one of "concurrency", "defaults", "env", "jobs", "name", "on", "permissions", "run-name"

(syntax-check)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/check-prometheus-rules.yml at line 2, Remove the invalid
top-level key "description" from the workflow (the YAML root must only use
allowed keys such as name, on, jobs, env, permissions, defaults, concurrency,
run-name); either delete the "description" entry entirely or convert its text to
a YAML comment at the top of the file so the workflow validates.

Comment on lines +1 to +3
import { setupTelemetry, logger } from '@wa-chat/shared';
// Initialize telemetry before other imports if possible, or right at the top
setupTelemetry('wa-chat-api');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether the API starts directly from apps/api/src/index.* without a bootstrap wrapper.

echo "=== package scripts ==="
fd '^package\.json$' . -x sh -c 'echo "\n## $1"; jq -r ".scripts // {}" "$1"' sh {}

echo "\n=== setupTelemetry call sites ==="
rg -n "setupTelemetry\(" apps packages

echo "\n=== API entrypoint candidates ==="
fd 'index\.(ts|js)$' apps/api

Repository: Ryanakml/ChatFuse

Length of output: 3386


🏁 Script executed:

#!/bin/bash
# Extract package.json scripts correctly
echo "=== Root package.json scripts ==="
cat package.json | jq -r '.scripts // {}' 2>/dev/null || echo "Failed to parse"

echo "\n=== Apps/api package.json scripts ==="
cat apps/api/package.json | jq -r '.scripts // {}' 2>/dev/null || echo "Failed to parse"

echo "\n=== Full index.js content ==="
wc -l apps/api/src/index.js
head -50 apps/api/src/index.js

echo "\n=== Full index.ts content ==="
wc -l apps/api/src/index.ts
head -50 apps/api/src/index.ts

echo "\n=== Check for bootstrap files ==="
fd 'bootstrap' apps/api/src

echo "\n=== Type field in package.json ==="
grep -A2 '"type"' package.json apps/api/package.json 2>/dev/null

Repository: Ryanakml/ChatFuse

Length of output: 4987


🏁 Script executed:

#!/bin/bash
echo "=== setupTelemetry implementation ==="
cat packages/shared/src/telemetry.ts | head -100

echo "\n=== Check for OpenTelemetry/instrumentation usage ==="
rg -n "instrumentation|NodeSDK" packages/shared/src/telemetry.ts | head -30

echo "\n=== Check for trace context reconstruction ==="
rg -n "traceContext|ingressContext" apps/api/src/index.ts | head -20

echo "\n=== Check worker queue processing ==="
rg -n "traceContext|ingressContext" apps/worker/src/index.ts | head -20

Repository: Ryanakml/ChatFuse

Length of output: 5075


🏁 Script executed:

#!/bin/bash
echo "=== Check for any entry point configuration ==="
cat apps/api/package.json | jq '.' | grep -A20 '"exports"\|"main"'

echo "\n=== Check for Node loader hooks or require hooks ==="
rg -n "register|loader|hook" apps/api/src

echo "\n=== Check if dist/index.js differs from src/index.js ==="
if [ -f dist/index.js ]; then
  echo "dist/index.js exists"
  head -10 dist/index.js
else
  echo "dist/index.js does not exist"
fi

echo "\n=== Check actual worker app startup pattern ==="
cat apps/worker/src/index.ts | head -30

Repository: Ryanakml/ChatFuse

Length of output: 15608


Move telemetry initialization to a separate bootstrap entry point.

Because this is an ESM module, static imports are hoisted and evaluated before the module body executes. This means setupTelemetry('wa-chat-api') runs after Express, Redis, and BullMQ are already loaded, preventing auto-instrumentation from patching them. Create a tiny bootstrap entrypoint that initializes telemetry before importing the app, or use a Node.js loader to prepend initialization.

🧭 Bootstrap pattern
// apps/api/src/bootstrap.ts
import { setupTelemetry } from '@wa-chat/shared';
setupTelemetry('wa-chat-api');
await import('./index.js');

Then update package.json:

{
  "start": "node --input-type=module --eval \"import('./dist/bootstrap.js')\""
}

Or with tsx for development:

{
  "dev": "tsx src/bootstrap.ts"
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/index.js` around lines 1 - 3, The telemetry setup call
setupTelemetry('wa-chat-api') must run before any app module imports so
auto-instrumentation can patch Express/Redis/BullMQ; move this initialization
into a tiny bootstrap entry point (e.g., create a new bootstrap file that
imports and calls setupTelemetry('wa-chat-api') and then dynamically imports the
existing index.js/app), update your start/dev scripts to run the bootstrap (or
use tsx for dev) so the bootstrap executes first, and remove or replace the
direct setupTelemetry call from apps/api/src/index.js to avoid
double-initialization.

Comment on lines +280 to +296
app.use((req, res, next) => {
const startTime = Date.now();
res.on('finish', () => {
const durationMs = Date.now() - startTime;
const attributes = {
method: req.method,
route: req.route?.path || req.path,
status: String(res.statusCode),
};
appMetrics.apiRequestCount.add(1, attributes);
appMetrics.apiLatency.record(durationMs, attributes);
if (res.statusCode >= 500) {
appMetrics.apiErrorCount.add(1, attributes);
}
});
next();
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't fall back to raw request paths in metric labels.

When req.route is unset, this emits concrete URLs as separate time series. That is exactly the sort of cardinality spike that hurts Prometheus when you need it most. Use a bounded fallback like 'unmatched', or explicitly stash a normalized route template.

📉 Suggested fix
     app.use((req, res, next) => {
         const startTime = Date.now();
         res.on('finish', () => {
             const durationMs = Date.now() - startTime;
+            const route = req.route?.path ?? 'unmatched';
             const attributes = {
                 method: req.method,
-                route: req.route?.path || req.path,
+                route,
                 status: String(res.statusCode),
             };
             appMetrics.apiRequestCount.add(1, attributes);
             appMetrics.apiLatency.record(durationMs, attributes);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/index.js` around lines 280 - 296, The middleware that builds
metric labels (inside the app.use(...) block that sets startTime and constructs
attributes) currently falls back to req.path when req.route is undefined,
creating high-cardinality time series; change the route label to use a bounded
fallback (e.g., 'unmatched') or a normalized route template instead of req.path
by replacing the expression req.route?.path || req.path with something like
req.route?.path || 'unmatched' (or retrieve and stash a normalized route
template) so appMetrics.apiRequestCount, appMetrics.apiLatency, and
appMetrics.apiErrorCount receive a stable route label.

Comment on lines +473 to +476
traceContext: {
traceId: ingressContext.traceId,
correlationId: ingressContext.correlationId,
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether traceContext is used as actual distributed-tracing state anywhere downstream.

echo "=== trace propagation usage ==="
rg -n "traceContext|traceId|correlationId|traceparent|propagation|context\.active|getSpan|spanContext|startActiveSpan" apps packages

Repository: Ryanakml/ChatFuse

Length of output: 5182


🏁 Script executed:

#!/bin/bash
# Check the context around lines 473-476 in apps/api/src/index.js
echo "=== API code around lines 473-476 ==="
sed -n '420,490p' apps/api/src/index.js

echo ""
echo "=== Check ingressContext creation ==="
sed -n '400,435p' apps/api/src/index.js

echo ""
echo "=== Check imports at top of file ==="
head -40 apps/api/src/index.js

Repository: Ryanakml/ChatFuse

Length of output: 6261


🏁 Script executed:

#!/bin/bash
# Check how worker uses traceContext
echo "=== Worker code using traceContext ==="
sed -n '330,360p' apps/worker/src/index.ts

echo ""
echo "=== Check if OpenTelemetry is used in worker ==="
head -50 apps/worker/src/index.ts | grep -i "otel\|opentelemetry\|trace\|span"

Repository: Ryanakml/ChatFuse

Length of output: 1042


🏁 Script executed:

#!/bin/bash
# Look for how ingressContext.traceId is created - check for randomBytes or createTraceId
echo "=== Search for createTraceId function ==="
rg -n "createTraceId|randomBytes" apps/api/src/index.js -A 3

echo ""
echo "=== Check IngressContext interface ==="
rg -n "IngressContext|ingressContext" apps/api/src/index.js | head -20

Repository: Ryanakml/ChatFuse

Length of output: 1835


🏁 Script executed:

#!/bin/bash
# Check the attachIngressTraceContext middleware
echo "=== Looking for attachIngressTraceContext middleware ==="
rg -n "attachIngressTraceContext|const attachIngressTraceContext|function attachIngressTraceContext" apps/api/src/index.js -A 10

echo ""
echo "=== Check getIngressTraceContext function ==="
rg -n "getIngressTraceContext|const getIngressTraceContext" apps/api/src/index.js -B 2 -A 8

Repository: Ryanakml/ChatFuse

Length of output: 3325


🏁 Script executed:

#!/bin/bash
# Check setupTelemetry in the shared package
echo "=== Check setupTelemetry implementation ==="
rg -n "setupTelemetry|export.*setupTelemetry" packages/shared/src/ -A 5

echo ""
echo "=== Check if there's any OpenTelemetry configuration ==="
rg -n "@opentelemetry|otel|OpenTelemetry" packages/shared/src/ -l

Repository: Ryanakml/ChatFuse

Length of output: 2931


🏁 Script executed:

#!/bin/bash
# Check full setupTelemetry implementation
echo "=== Full setupTelemetry implementation ==="
cat packages/shared/src/telemetry.ts

Repository: Ryanakml/ChatFuse

Length of output: 4277


🏁 Script executed:

#!/bin/bash
# Check buildIngressTraceContext function
echo "=== buildIngressTraceContext function ==="
rg -n "buildIngressTraceContext|const buildIngressTraceContext" apps/api/src/index.js -B 2 -A 15

Repository: Ryanakml/ChatFuse

Length of output: 1995


Propagate the active OpenTelemetry span context, not a synthetic trace ID.

The code generates a new trace ID with randomBytes at line 302 of buildIngressTraceContext, which creates a trace-shaped string unrelated to the actual HTTP request span. Since setupTelemetry enables getNodeAutoInstrumentations(), the webhook POST request already has an active span from automatic HTTP instrumentation. However, this span context is never captured. The worker receives only the synthetic ID and cannot reconstruct or join the distributed trace.

Pass the active span context downstream so the worker can continue the trace. Capture it with trace.getSpan(context.active())?.spanContext() and pass the full span context (not just traceId), ideally as traceparent or equivalent W3C format for proper trace propagation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/index.js` around lines 473 - 476, The traceContext currently
uses a synthetic traceId from buildIngressTraceContext; instead capture the
active OpenTelemetry span and propagate its full span context downstream instead
of the synthetic ID. In buildIngressTraceContext (and where traceContext is
assigned to ingressContext/traceContext) call
trace.getSpan(context.active())?.spanContext() and construct a W3C traceparent
(or include the full spanContext object) and set that value on
ingressContext.traceContext (or a traceparent field) so the worker receives the
real span information; ensure setupTelemetry/getNodeAutoInstrumentations remains
enabled and use the spanContext rather than randomBytes-generated IDs.

Comment on lines +1 to +3
import { setupTelemetry, logger } from '@wa-chat/shared';
// Initialize telemetry before other imports if possible, or right at the top
setupTelemetry('wa-chat-api');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the main entry point and understand the project structure
git ls-files apps/api/src/ | head -20

Repository: Ryanakml/ChatFuse

Length of output: 588


🏁 Script executed:

# Look at the specific file mentioned in the review
cat -n apps/api/src/index.ts

Repository: Ryanakml/ChatFuse

Length of output: 27383


🏁 Script executed:

# Search for dotenv.config() calls to understand when env is loaded
rg "dotenv\.config|config()" apps/api/src/ -A 2 -B 2

Repository: Ryanakml/ChatFuse

Length of output: 2432


🏁 Script executed:

# Check if there are other entry points or bootstrap files
fd -e ts -e js apps/api/src/ | grep -E "(main|bootstrap|entry|server)" | head -10

Repository: Ryanakml/ChatFuse

Length of output: 418


🏁 Script executed:

# Search for where apps/api/src/index.ts is imported from
rg "from.*['\"].*apps/api/src['\"]|from.*['\"]\.\./" apps/ --type ts --type js | head -20

Repository: Ryanakml/ChatFuse

Length of output: 987


Move telemetry initialization behind env loading or into a function.

Module-level setupTelemetry() at line 3 executes before dotenv.config() at line 732. This means any code importing apps/api/src/index.ts starts the SDK before environment variables are loaded—creating test pollution and potential misconfiguration in consuming processes. Move this call into a bootstrap function or after dotenv.config().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/index.ts` around lines 1 - 3, The module-level call to
setupTelemetry('wa-chat-api') runs before environment variables are loaded; move
this call out of top-level execution and into a bootstrap/initialization routine
that runs after dotenv.config() (or call it after the env-loading code), e.g.,
remove the direct setupTelemetry(...) invocation from the top of
apps/api/src/index.ts and instead invoke setupTelemetry inside your existing
bootstrap function or immediately after dotenv.config() so telemetry
initialization uses the proper env values; reference the setupTelemetry symbol
and the module-level invocation in index.ts when making the change.

Comment on lines 136 to +151
if (classifiedError.errorClass === 'permanent' && currentAttempt >= permanentAttempts) {
appMetrics.queueDlqCount.add(1, { error_class: 'permanent' });
if (classifiedError.error instanceof UnrecoverableError) {
throw classifiedError.error;
}

throw new UnrecoverableError(classifiedError.error.message);
}

if (classifiedError.errorClass === 'transient') {
if (currentAttempt >= policies.retry.transientMaxAttempts) {
appMetrics.queueDlqCount.add(1, { error_class: 'transient' });
} else {
appMetrics.queueRetryCount.add(1);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Emit retry/DLQ metrics only after the final outcome is known.

runIngressJob() increments these counters before routeFailedIngressJobToDlq() knows whether the job will actually retry and before the DLQ enqueue succeeds. That makes queueDlqCount over-report failed DLQ routes, and it also misses permanent retries entirely when permanentMaxAttempts > 1.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/queue/consumer.ts` around lines 136 - 151, The code is
incrementing queueDlqCount/queueRetryCount too early in runIngressJob() before
routeFailedIngressJobToDlq() determines final outcome, and it also misses
counting permanent retries when permanentMaxAttempts > 1; remove the early
metric increments from the failure-handling branch that checks
classifiedError.errorClass and currentAttempt, and instead increment
appMetrics.queueDlqCount.add(...) or appMetrics.queueRetryCount.add(...) inside
routeFailedIngressJobToDlq() after the DLQ enqueue result is known (or in the
final retry decision path), ensuring you use classifiedError.errorClass and
currentAttempt vs. policies.retry.transientMaxAttempts / permanentAttempts to
decide DLQ vs retry and still preserve throwing UnrecoverableError for permanent
errors (as in the UnrecoverableError handling).

Comment on lines +1 to +5
global:
scrape_interval: 1s # Set small scrape interval since we're simulating fast

rule_files:
- /etc/prometheus/alert.rules.yml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Prometheus, if global.evaluation_interval is omitted, what default is used, and does changing scrape_interval also change alert/rule evaluation frequency?

💡 Result:

  • Default when global.evaluation_interval is omitted: 1m (1 minute). (prometheus.io)
  • Does changing scrape_interval change alert/rule evaluation frequency? No. Rule/alert evaluation runs on global.evaluation_interval (or a rule-group’s own interval override). Changing scrape_interval only changes how often targets are scraped, not how often rules are evaluated. (prometheus.io)

Citations:


Add evaluation_interval to match the fast scrape cadence.

With scrape_interval: 1s but no evaluation_interval configured, Prometheus will scrape metrics every second but evaluate alert rules every 60 seconds (the default). This mismatch means incident alerts can fire much later than expected, defeating the purpose of fast scraping in simulation.

Suggested fix
 global:
   scrape_interval: 1s # Set small scrape interval since we're simulating fast
+  evaluation_interval: 1s
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
global:
scrape_interval: 1s # Set small scrape interval since we're simulating fast
rule_files:
- /etc/prometheus/alert.rules.yml
global:
scrape_interval: 1s # Set small scrape interval since we're simulating fast
evaluation_interval: 1s
rule_files:
- /etc/prometheus/alert.rules.yml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/prometheus.yml` around lines 1 - 5, Add an evaluation_interval under
the global Prometheus config to match the fast scrape cadence: in the global
block (where scrape_interval: 1s is set) add evaluation_interval: 1s so alert
rules (rule_files /etc/prometheus/alert.rules.yml) are evaluated every second
instead of the 60s default.

Comment on lines +22 to +25
traceContext?: {
traceId: string;
correlationId: string;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate traceContext at runtime as well.

IngressJobPayload now exposes traceContext as two required strings, but Line 188 forwards any truthy object and isIngressJobPayload() still ignores the field entirely. That means malformed payloads like { traceContext: { traceId: 123 } } can be created and then pass assertIngressJobPayload(), which defeats the correlation guarantees this PR is adding.

Suggested fix
+const isTraceContext = (
+  value: unknown,
+): value is { traceId: string; correlationId: string } =>
+  isRecord(value) &&
+  isNonEmptyString(value.traceId) &&
+  isNonEmptyString(value.correlationId);
+
 export const isIngressJobPayload = (value: unknown): value is IngressJobPayload => {
   if (!isRecord(value)) {
     return false;
   }
@@
   if (typeof value.receivedAt !== 'string' || Number.isNaN(Date.parse(value.receivedAt))) {
     return false;
   }
+
+  if (!(value.traceContext === undefined || isTraceContext(value.traceContext))) {
+    return false;
+  }

   return isJsonValue(value.payload);
 };
@@
 export const createIngressJobPayload = (input: {
   eventKey: string;
   payload: JsonValue;
   receivedAt?: string;
   traceContext?: { traceId: string; correlationId: string };
 }): IngressJobPayload => {
@@
   if (Number.isNaN(Date.parse(receivedAt))) {
     throw new Error('Ingress job receivedAt must be a valid ISO timestamp');
   }
+
+  if (input.traceContext && !isTraceContext(input.traceContext)) {
+    throw new Error(
+      'Ingress job traceContext must contain non-empty traceId and correlationId',
+    );
+  }

   return {
     schemaVersion: INGRESS_JOB_SCHEMA_VERSION,
     eventKey,
     payload: input.payload,

Also applies to: 167-189

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/index.ts` around lines 22 - 25, isIngressJobPayload
currently ignores traceContext and assertIngressJobPayload forwards any truthy
object, allowing malformed traceContext like { traceId: 123 }; update
isIngressJobPayload to validate that if payload.traceContext exists it is an
object with both traceId and correlationId as strings, and then update
assertIngressJobPayload to call this stricter check (rejecting payloads where
traceContext is missing keys or types), ensuring any place that forwards the
payload (the code path that passes traceContext along) only accepts payloads
that pass the new validation.

Comment on lines +60 to +67
processRef?.on?.('SIGTERM', () => {
sdk
.shutdown()
.then(() => logger.info('Telemetry shut down successfully'))
.catch((error) => logger.error({ error }, 'Error shutting down telemetry'))
.finally(() => {
processRef?.exit?.(0);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't own process exit from the shared telemetry helper.

This SIGTERM handler always calls process.exit(0), even when sdk.shutdown() fails. In the worker, that races the existing shutdown hook and can cut off worker.close()/queue.close() while still reporting a clean exit.

✂️ Suggested direction
-  processRef?.on?.('SIGTERM', () => {
-    sdk
-      .shutdown()
-      .then(() => logger.info('Telemetry shut down successfully'))
-      .catch((error) => logger.error({ error }, 'Error shutting down telemetry'))
-      .finally(() => {
-        processRef?.exit?.(0);
-      });
-  });
-
   return sdk;

Let the API/worker entrypoints call sdk.shutdown() from their own signal handlers after their service-specific cleanup completes.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
processRef?.on?.('SIGTERM', () => {
sdk
.shutdown()
.then(() => logger.info('Telemetry shut down successfully'))
.catch((error) => logger.error({ error }, 'Error shutting down telemetry'))
.finally(() => {
processRef?.exit?.(0);
});
return sdk;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/telemetry.ts` around lines 60 - 67, The SIGTERM handler
in the shared telemetry helper (the processRef?.on?('SIGTERM', ...) block that
calls sdk.shutdown()) must not call processRef?.exit?.(0) or otherwise force
process exit; remove the forced exit and avoid owning process lifecycle here.
Modify the handler so it only calls sdk.shutdown() (and logs success or error
via logger) but does not exit the process or swallow failures—allow the
API/worker entrypoints to perform their own shutdown sequencing (e.g.,
worker.close()/queue.close()) and then call sdk.shutdown() or exit as
appropriate.

simulate();

process.on('SIGINT', () => {
console.log('\nSutdown requested, stopping metrics server...');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: "Sutdown" should be "Shutdown".

🔧 Fix typo
 process.on('SIGINT', () => {
-  console.log('\nSutdown requested, stopping metrics server...');
+  console.log('\nShutdown requested, stopping metrics server...');
   server.close(() => process.exit(0));
 });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
console.log('\nSutdown requested, stopping metrics server...');
process.on('SIGINT', () => {
console.log('\nShutdown requested, stopping metrics server...');
server.close(() => process.exit(0));
});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/simulate-alerts.ts` at line 91, Fix the typo in the shutdown log
message: replace the misspelled string "Sutdown requested, stopping metrics
server..." used in the console.log call with "Shutdown requested, stopping
metrics server..." so the shutdown handler / log output (the console.log
invocation in scripts/simulate-alerts.ts) prints the correct word.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/check-prometheus-rules.yml (1)

2-2: ⚠️ Potential issue | 🔴 Critical

Remove the unsupported workflow-level description key.

Line 2 is still invalid GitHub Actions syntax, so the workflow will fail validation before this job can run. This was already flagged on an earlier revision and still applies here.

🔧 Proposed fix
 name: K3 Alerting Rules Check
-description: Validates Prometheus alert rules
+# Validates Prometheus alert rules
 
 on: [push, pull_request]
#!/bin/bash
python - <<'PY'
import sys
from pathlib import Path

try:
    import yaml
except ImportError:
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "PyYAML"])
    import yaml

path = Path(".github/workflows/check-prometheus-rules.yml")
data = yaml.safe_load(path.read_text())
allowed = {"name", "on", "jobs", "env", "permissions", "defaults", "concurrency", "run-name"}

keys = list(data.keys())
invalid = [k for k in keys if k not in allowed]

print("top-level keys:", keys)
print("invalid keys:", invalid)
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/check-prometheus-rules.yml at line 2, Remove the
unsupported top-level workflow key "description" from the GitHub Actions YAML
(the workflow-level "description" present in check-prometheus-rules.yml); locate
the "description:" entry at the top of the file and delete it (or move its text
into the "name:" field or into a job-level step/comment) so the workflow only
contains valid top-level keys like name, on, jobs, env, permissions, defaults,
concurrency, and run-name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/check-prometheus-rules.yml:
- Around line 14-16: Download the published sha256sums file for Prometheus
alongside the prometheus-2.51.1.linux-amd64.tar.gz and verify the archive before
extracting and moving promtool; specifically, fetch sha256sums.txt from the same
release, run a checksum verification (e.g., use sha256sum -c or grep the
prometheus-2.51.1.linux-amd64.tar.gz line and pipe to sha256sum -c) and abort
the job if verification fails, then only tar xvf
prometheus-2.51.1.linux-amd64.tar.gz and sudo mv
prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/ when the checksum check
succeeds.

---

Duplicate comments:
In @.github/workflows/check-prometheus-rules.yml:
- Line 2: Remove the unsupported top-level workflow key "description" from the
GitHub Actions YAML (the workflow-level "description" present in
check-prometheus-rules.yml); locate the "description:" entry at the top of the
file and delete it (or move its text into the "name:" field or into a job-level
step/comment) so the workflow only contains valid top-level keys like name, on,
jobs, env, permissions, defaults, concurrency, and run-name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a3d953e7-b361-48d1-8674-e952374eacf5

📥 Commits

Reviewing files that changed from the base of the PR and between 75ac7f8 and 8ece4d3.

📒 Files selected for processing (1)
  • .github/workflows/check-prometheus-rules.yml

Comment on lines +14 to +16
wget https://github.com/prometheus/prometheus/releases/download/v2.51.1/prometheus-2.51.1.linux-amd64.tar.gz
tar xvf prometheus-2.51.1.linux-amd64.tar.gz
sudo mv prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import json
import urllib.request

url = "https://api.github.com/repos/prometheus/prometheus/releases/tags/v2.51.1"
with urllib.request.urlopen(url) as r:
    data = json.load(r)

assets = sorted(asset["name"] for asset in data.get("assets", []))
print("\n".join(assets))
print("\nsha256sums.txt present:", "sha256sums.txt" in assets)
PY

Repository: Ryanakml/ChatFuse

Length of output: 2642


🏁 Script executed:

# Check if the workflow file exists and examine current state
git ls-files .github/workflows/ | head -20
cat .github/workflows/check-prometheus-rules.yml 2>/dev/null || echo "File not found"

Repository: Ryanakml/ChatFuse

Length of output: 858


🌐 Web query:

Prometheus v2.51.1 release assets sha256sums GitHub

💡 Result:

For the Prometheus v2.51.1 GitHub release (dated 2024-03-27), the SHA256 checksums are published as release assets alongside the binaries. [1]

Release assets (direct download paths on GitHub Releases):

https://github.com/prometheus/prometheus/releases/download/v2.51.1/sha256sums.txt
https://github.com/prometheus/prometheus/releases/download/v2.51.1/sha256sums.txt.asc

The binaries themselves are under the same releases/download/v2.51.1/ path (e.g., prometheus-2.51.1.linux-amd64.tar.gz). [1][2]

Sources: [1] [2]


Add integrity verification for Prometheus archive before installation.

The workflow downloads an executable from the network and moves it to PATH without any integrity check. The Prometheus v2.51.1 release includes sha256sums.txt as a published asset—use it to verify the archive before extraction.

🔒 Proposed hardening
       - name: Install Promtool
         run: |
-          wget https://github.com/prometheus/prometheus/releases/download/v2.51.1/prometheus-2.51.1.linux-amd64.tar.gz
-          tar xvf prometheus-2.51.1.linux-amd64.tar.gz
-          sudo mv prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/
+          set -euo pipefail
+          VERSION=2.51.1
+          ARCHIVE="prometheus-${VERSION}.linux-amd64.tar.gz"
+          DIR="prometheus-${VERSION}.linux-amd64"
+          wget "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/${ARCHIVE}"
+          wget "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/sha256sums.txt"
+          grep " ${ARCHIVE}\$" sha256sums.txt | sha256sum -c -
+          tar xvf "${ARCHIVE}"
+          sudo mv "${DIR}/promtool" /usr/local/bin/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/check-prometheus-rules.yml around lines 14 - 16, Download
the published sha256sums file for Prometheus alongside the
prometheus-2.51.1.linux-amd64.tar.gz and verify the archive before extracting
and moving promtool; specifically, fetch sha256sums.txt from the same release,
run a checksum verification (e.g., use sha256sum -c or grep the
prometheus-2.51.1.linux-amd64.tar.gz line and pipe to sha256sum -c) and abort
the job if verification fails, then only tar xvf
prometheus-2.51.1.linux-amd64.tar.gz and sudo mv
prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/ when the checksum check
succeeds.

@Ryanakml Ryanakml merged commit 7ccb9a1 into main Mar 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant