Phase/observability incident reliability by Ryanakml · Pull Request #9 · Ryanakml/ChatFuse

Ryanakml · 2026-03-08T17:05:09Z

Summary by CodeRabbit

New Features
- Added centralized telemetry, structured logging, and metrics across services plus a Grafana observability dashboard.
Infrastructure
- Local Prometheus and Grafana services added and metrics exported for local visualization.
- CI workflow to validate Prometheus config and alert rules.
Alerts
- New alerting rules for webhook failures, latency breaches, DLQ spikes, provider fallbacks, and escalation anomalies.
Tests
- Tests added to verify trace-context propagation and telemetry instruments.
Chores
- Added runtime telemetry/logging dependencies and a metrics simulation script.

…uctured logging

…dashboards

…se, and llm provider outages

coderabbitai · 2026-03-08T17:05:23Z

Warning

Rate limit exceeded

@Ryanakml has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 4 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c5023e66-eddd-473a-8e9e-c9a7dc4d3133

📥 Commits

Reviewing files that changed from the base of the PR and between e7064bf and 356cbcc.

📒 Files selected for processing (2)

docker-compose.local.yml
docker/prometheus.yml

📝 Walkthrough

Walkthrough

Adds centralized telemetry and structured logging (OpenTelemetry + Pino), instruments API, worker, LLM, tools and simulator with app-level metrics, provisions Prometheus/Grafana (scrape/export pipeline + dashboards + alerts), adds traceContext propagation for ingress jobs, and a CI check for Prometheus rules.

Changes

Cohort / File(s)	Summary
Shared telemetry & logging `packages/shared/src/telemetry.ts`, `packages/shared/src/telemetry.js`, `packages/shared/src/telemetry.d.ts`, `packages/shared/src/logger.ts`, `packages/shared/src/logger.js`, `packages/shared/src/logger.d.ts`	New OpenTelemetry setup (NodeSDK, OTLP exporters, meter), appMetrics instruments, and a Pino-based structured logger exported from shared package.
Shared public surface `packages/shared/src/index.ts`, `packages/shared/src/index.js`, `packages/shared/src/index.d.ts`, `packages/shared/package.json`	Re-exports telemetry and logger; adds runtime dependencies; extends IngressJobPayload/createIngressJobPayload to optionally include `traceContext`.
API service `apps/api/src/index.ts`, `apps/api/src/index.js`, `apps/api/src/index.test.ts`, `apps/api/src/index.test.js`	Initializes telemetry/logger, adds request-count/latency/error metrics middleware, swaps console logs for structured logger, and attaches `traceContext` to enqueued webhook ingress payloads; tests updated to assert traceContext.
Worker service `apps/worker/src/index.ts`, `apps/worker/src/index.test.ts`, `apps/worker/src/queue/consumer.ts`	Initializes telemetry/logger, structured logs across lifecycle, queue-depth tracking (with delta), processing latency, retry/DLQ metrics; tests updated to validate traceContext and permanent-error behavior.
LLM & tools instrumentation `packages/llm/src/langchain/chains/router.ts`, `packages/llm/src/langchain/pipeline.ts`, `packages/llm/src/router/model-router.ts`, `packages/llm/src/tools/index.ts`	Instruments agent path counts, parse failure counts, provider fallback callbacks, and tool execution outcomes (success/timeout/error) using appMetrics.
Prometheus & Grafana provisioning `docker/prometheus.yml`, `docker/prometheus/alert.rules.yml`, `docker/grafana/provisioning/datasources/prometheus.yml`, `docker/grafana/provisioning/dashboards/dashboards.yml`, `docker/grafana/dashboards/k2-observability.json`	Adds Prometheus config and alert rules (5 alerts), Grafana datasource and dashboard provisioning, and a new observability dashboard with multiple panels/queries.
Local compose & OTEL config `docker-compose.local.yml`, `docker/otel-collector-config.yml`, `docker/grafana/...`, `docker/prometheus/...`	Adds Prometheus and Grafana services to local compose, exposes OTEL collector Prometheus exporter endpoint (8889), and mounts Grafana provisioning/dashboards.
CI check for Prometheus rules `.github/workflows/check-prometheus-rules.yml`	New GitHub Actions workflow that downloads Prometheus binaries and validates `docker/prometheus.yml` and `docker/prometheus/alert.rules.yml` with promtool.
Simulator & metrics script `scripts/simulate-alerts.ts`, `docker/grafana/dashboards/*`	New Express-based metrics simulator exposing synthetic Prometheus metrics on :9091 to exercise alerts and dashboards.
Typing & minor changes `apps/dashboard/lib/supabase/server.ts`, `apps/dashboard/middleware.ts`, `package.json`, `packages/shared/test-resource.ts`, various tests	Tightened cookie typing, added dev deps (`express`, `prom-client`) and telemetry tests, small test helper/resource file.

Sequence Diagram(s)

sequenceDiagram
participant API as API Service
participant Worker as Worker
participant Shared as Shared (telemetry/logger)
participant OTEL as OTEL Collector
participant Prom as Prometheus
participant Graf as Grafana

API->>Shared: setupTelemetry('wa-chat-api') & use logger/appMetrics
API->>Worker: enqueue ingress job (payload + traceContext)
Worker->>Shared: setupTelemetry('wa-chat-worker') & use logger/appMetrics
Shared->>OTEL: SDK exports traces/metrics (OTLP HTTP)
OTEL->>Prom: exposes /metrics for Prometheus (8889)
Prom->>Graf: Grafana queries Prometheus datasource for dashboards/alerts

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Phase/worker retry dlq #4: Overlaps shared ingress types/helpers — related to traceContext additions in packages/shared and ingress payloads.
Phase/langchain agent runtime #5: Touches packages/llm/src/langchain/chains/router.ts — related to agent path instrumentation added here.

Poem

🐰 I hop on traces and sprinkle logs so bright,

Counters tick and histograms take flight,
Prom scrapes, Grafana lights the rooms,
Jobs carry traceContext in their booms,
Hop-hop — observability blooms at night. 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'Phase/observability incident reliability' is vague and doesn't clearly summarize the main changes. While it references observability, it's overly generic and doesn't convey what specific work was done.	Use a more descriptive title that summarizes the main changes, such as 'Add comprehensive observability infrastructure with telemetry, logging, and monitoring dashboards' or similar.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch phase/observability-incident-reliability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/api/src/index.test.ts (1)

336-344: ⚠️ Potential issue | 🟡 Minor

Assert propagation, not just shape.

This still passes if enqueueing generates a second random trace ID, because the only trace assertion is “looks like hex”. Compare the queued value with ingressEvents[0].context.traceId so the test actually verifies end-to-end propagation.

✅ Tighten the assertion

       const enqueuedTraceContext = deps.enqueuedJobs[0]?.traceContext;
       assert.ok(enqueuedTraceContext, 'Enqueued job must contain traceContext');
       assert.equal(enqueuedTraceContext.correlationId, correlationId);
       assert.match(enqueuedTraceContext.traceId, /^[a-f0-9]{32}$/);

       const ingressEvents = getObservabilityEvents(deps.observabilityEvents, 'ingress_start');
       assert.equal(ingressEvents.length, 1);
       assert.equal(ingressEvents[0]?.context.correlationId, correlationId);
       assert.match(ingressEvents[0]?.context.traceId || '', /^[a-f0-9]{32}$/);
+      assert.equal(enqueuedTraceContext.traceId, ingressEvents[0]?.context.traceId);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/index.test.ts` around lines 336 - 344, The test currently only
asserts traceId shape; instead verify end-to-end propagation by asserting the
queued trace equals the ingress event trace: after computing
enqueuedTraceContext from deps.enqueuedJobs and ingressEvents from
getObservabilityEvents(deps.observabilityEvents, 'ingress_start'), add an
assertion that enqueuedTraceContext.traceId === ingressEvents[0].context.traceId
(and keep the existing correlationId equality checks), so the test uses the
actual values (enqueuedTraceContext, ingressEvents, correlationId) rather than
only pattern-matching the traceId.

🧹 Nitpick comments (6)

packages/shared/src/telemetry.js (2)

48-56: SIGTERM handler calls process.exit(0) in finally block unconditionally.

The finally block on line 53-55 always calls process?.exit?.(0), even if shutdown failed. Consider only exiting successfully if shutdown succeeds, or using different exit codes:

♻️ Differentiate exit codes based on shutdown result

     processRef?.on?.('SIGTERM', () => {
         sdk
             .shutdown()
-            .then(() => logger.info('Telemetry shut down successfully'))
-            .catch((error) => logger.error({ error }, 'Error shutting down telemetry'))
-            .finally(() => {
-            processRef?.exit?.(0);
-        });
+            .then(() => {
+                logger.info('Telemetry shut down successfully');
+                processRef?.exit?.(0);
+            })
+            .catch((error) => {
+                logger.error({ error }, 'Error shutting down telemetry');
+                processRef?.exit?.(1);
+            });
     });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/telemetry.js` around lines 48 - 56, The SIGTERM handler
currently calls processRef?.exit?.(0) unconditionally in the finally block after
sdk.shutdown(), which exits with success even on shutdown failure; update the
handler so that sdk.shutdown().then(...) calls processRef?.exit?.(0) on success
and the .catch(...) calls processRef?.exit?.(1) (or omit exiting on failure if
desired), and remove the unconditional exit from the .finally() block so exit
codes reflect shutdown result; refer to processRef, sdk.shutdown(), and logger
in your changes.

13-24: URL construction logic is overly complex and potentially confusing.

The metric exporter URL construction (lines 18-24) checks if the endpoint ends with /v1/traces and replaces it, which is an unusual scenario. While the logic technically works for common cases, it's confusing to read.

Consider simplifying to always construct paths from a base endpoint:

♻️ Simplified URL construction

+    // Normalize base endpoint (strip any path suffix)
+    const baseEndpoint = otlpEndpoint?.replace(/\/v1\/(traces|metrics)$/, '') || '';
+    
     const traceExporter = otlpEndpoint
-        ? new OTLPTraceExporter({
-            url: otlpEndpoint.endsWith('/v1/traces') ? otlpEndpoint : `${otlpEndpoint}/v1/traces`,
-        })
+        ? new OTLPTraceExporter({ url: `${baseEndpoint}/v1/traces` })
         : new OTLPTraceExporter();
     const metricExporter = otlpEndpoint
-        ? new OTLPMetricExporter({
-            url: otlpEndpoint.endsWith('/v1/traces')
-                ? otlpEndpoint.replace('/v1/traces', '/v1/metrics')
-                : `${otlpEndpoint}/v1/metrics`,
-        })
+        ? new OTLPMetricExporter({ url: `${baseEndpoint}/v1/metrics` })
         : new OTLPMetricExporter();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/telemetry.js` around lines 13 - 24, The metric/exporter
URL logic is confusing by special-casing endpoints that end with "/v1/traces";
simplify by normalizing otlpEndpoint into a single base URL (trim any trailing
slash and strip any trailing "/v1/traces" or "/v1/metrics" if present) and then
build trace and metric URLs by appending "/v1/traces" and "/v1/metrics"
respectively before passing to OTLPTraceExporter and OTLPMetricExporter (keep
fallback to new OTLPTraceExporter()/new OTLPMetricExporter() when otlpEndpoint
is falsy). Ensure you update the code around traceExporter, metricExporter, and
the otlpEndpoint normalization so both exporters are constructed from a single,
predictable base.

packages/shared/package.json (1)

19-20: Consider making pino-pretty a devDependency or optional.

pino-pretty is only used when NODE_ENV !== 'production'. In production builds, this dependency adds unnecessary weight. Consider either:

Moving it to devDependencies and using dynamic import() with error handling
Making it an optional peer dependency

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/shared/package.json` around lines 19 - 20, package.json currently
lists "pino-pretty" as a regular dependency which inflates production bundles;
move "pino-pretty" to devDependencies and update runtime initialization code
that configures the logger (where you reference "pino" and configure
pretty-printing) to use a dynamic import() for "pino-pretty" only when NODE_ENV
!== 'production', wrapping the import in try/catch and falling back to plain
pino if the import fails; alternatively document and/or configure "pino-pretty"
as an optional peerDependency so consumers can opt-in—ensure the changes
reference the "pino-pretty" entry in package.json and the logger setup code that
currently imports or requires it.

packages/shared/src/logger.js (1)

1-20: Compiled JS file in src/ directory is unusual.

This appears to be TypeScript compilation output (note the sourceMappingURL comment). Typically, compiled .js files belong in dist/ while only source .ts files reside in src/. Having both in src/ can cause confusion and may lead to the compiled files getting out of sync with the source.

Consider:

Adding *.js and *.js.map to .gitignore for the src/ directory

Ensuring the build process outputs to dist/ (which aligns with package.json's "main": "dist/index.js")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/logger.js` around lines 1 - 20, The compiled JavaScript
file (contains sourceMappingURL) was committed into src/ (logger.js with export
logger and pino/pino-pretty usage); remove the compiled artifact from the source
tree and ensure builds emit to dist/ instead: delete logger.js and logger.js.map
from src/, remove them from git with git rm --cached if already tracked, add a
rule to ignore src/*.js and src/*.js.map in .gitignore, and adjust the
build/config (tsconfig/outDir or build script) so TypeScript/tsc outputs to
dist/ and source .ts logger file remains in src/ (ensure exported symbol logger
remains implemented in src/logger.ts).

packages/llm/src/tools/index.ts (1)

13-37: Route tool failures through the shared logger as well.

This path now emits metrics, but it still falls back to raw console.error. Using the shared logger here keeps tool failures structured and trace-correlated like the rest of the observability work in this PR.

Suggested refactor

-import { appMetrics } from '@wa-chat/shared';
+import { appMetrics, logger } from '@wa-chat/shared';
@@
   } catch (error) {
     const isTimeout =
       error instanceof Error &&
       (error.message.includes('timeout') || error.name === 'TimeoutError');
     appMetrics.toolExecutionCount.add(1, {
       tool: toolName,
       status: isTimeout ? 'timeout' : 'error',
     });
-    console.error(
-      `[Tool Error] ${toolName} failed:`,
-      error instanceof Error ? error.message : error,
-    );
+    logger.error({ err: error, tool: toolName, status: isTimeout ? 'timeout' : 'error' }, 'Tool execution failed');
     return `I'm currently unable to complete this action due to a technical issue. Please try again later or type 'escalate' to speak with a human agent.`;
   }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/llm/src/tools/index.ts` around lines 13 - 37, The catch block in
withSafeFallback logs errors with console.error instead of the shared structured
logger; update packages/llm/src/tools/index.ts to import the shared logger
(e.g., appLogger) from '@wa-chat/shared' and replace the console.error call
inside withSafeFallback with appLogger.error, passing a descriptive message, the
error object (not just error.message), and context tags (tool: toolName, status:
isTimeout ? 'timeout' : 'error') so failures are structured and trace-correlated
while preserving the existing appMetrics calls.

packages/llm/src/router/model-router.ts (1)

28-36: Extract the fallback metric wrapper once.

The same callback block is duplicated in both router builders. Pulling it into one shared config/helper will keep fallback instrumentation from drifting the next time this metric changes.

♻️ Small cleanup

+const fallbackMetricConfig = {
+  callbacks: [
+    {
+      handleLLMStart: () => {
+        appMetrics.agentProviderFallbackCount.add(1);
+      },
+    },
+  ],
+};
+
-  const fallbackModel = createGeminiAdapter(config?.fallbackConfig).withConfig({
-    callbacks: [
-      {
-        handleLLMStart: () => {
-          appMetrics.agentProviderFallbackCount.add(1);
-        },
-      },
-    ],
-  });
+  const fallbackModel = createGeminiAdapter(config?.fallbackConfig).withConfig(
+    fallbackMetricConfig,
+  );
...
-    .withConfig({
-      callbacks: [
-        {
-          handleLLMStart: () => {
-            appMetrics.agentProviderFallbackCount.add(1);
-          },
-        },
-      ],
-    });
+    .withConfig(fallbackMetricConfig);

Also applies to: 52-62

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/llm/src/router/model-router.ts` around lines 28 - 36, The duplicated
fallback callback that increments appMetrics.agentProviderFallbackCount
(currently embedded in the createGeminiAdapter(...).withConfig({ callbacks: [{
handleLLMStart: () => { appMetrics.agentProviderFallbackCount.add(1); } }] })
call used for the fallbackModel and the other router builder) should be
extracted into a single shared helper/config (e.g., a getFallbackCallbacks or
fallbackInstrumentationConfig) and reused by both router builders; locate the
two createGeminiAdapter(...).withConfig(...) usages (the fallbackModel variable
and the second builder where the same callbacks are repeated) and replace the
inline callbacks with a reference to the shared helper to avoid duplication and
drift.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/check-prometheus-rules.yml:
- Line 2: Remove the invalid top-level key "description" from the workflow (the
YAML root must only use allowed keys such as name, on, jobs, env, permissions,
defaults, concurrency, run-name); either delete the "description" entry entirely
or convert its text to a YAML comment at the top of the file so the workflow
validates.

In `@apps/api/src/index.js`:
- Around line 280-296: The middleware that builds metric labels (inside the
app.use(...) block that sets startTime and constructs attributes) currently
falls back to req.path when req.route is undefined, creating high-cardinality
time series; change the route label to use a bounded fallback (e.g.,
'unmatched') or a normalized route template instead of req.path by replacing the
expression req.route?.path || req.path with something like req.route?.path ||
'unmatched' (or retrieve and stash a normalized route template) so
appMetrics.apiRequestCount, appMetrics.apiLatency, and appMetrics.apiErrorCount
receive a stable route label.
- Around line 473-476: The traceContext currently uses a synthetic traceId from
buildIngressTraceContext; instead capture the active OpenTelemetry span and
propagate its full span context downstream instead of the synthetic ID. In
buildIngressTraceContext (and where traceContext is assigned to
ingressContext/traceContext) call trace.getSpan(context.active())?.spanContext()
and construct a W3C traceparent (or include the full spanContext object) and set
that value on ingressContext.traceContext (or a traceparent field) so the worker
receives the real span information; ensure
setupTelemetry/getNodeAutoInstrumentations remains enabled and use the
spanContext rather than randomBytes-generated IDs.
- Around line 1-3: The telemetry setup call setupTelemetry('wa-chat-api') must
run before any app module imports so auto-instrumentation can patch
Express/Redis/BullMQ; move this initialization into a tiny bootstrap entry point
(e.g., create a new bootstrap file that imports and calls
setupTelemetry('wa-chat-api') and then dynamically imports the existing
index.js/app), update your start/dev scripts to run the bootstrap (or use tsx
for dev) so the bootstrap executes first, and remove or replace the direct
setupTelemetry call from apps/api/src/index.js to avoid double-initialization.

In `@apps/api/src/index.ts`:
- Around line 1-3: The module-level call to setupTelemetry('wa-chat-api') runs
before environment variables are loaded; move this call out of top-level
execution and into a bootstrap/initialization routine that runs after
dotenv.config() (or call it after the env-loading code), e.g., remove the direct
setupTelemetry(...) invocation from the top of apps/api/src/index.ts and instead
invoke setupTelemetry inside your existing bootstrap function or immediately
after dotenv.config() so telemetry initialization uses the proper env values;
reference the setupTelemetry symbol and the module-level invocation in index.ts
when making the change.
- Around line 402-418: Middleware is currently using req.route?.path or req.path
which either drops mounted prefixes or yields high-cardinality user-controlled
paths; change the label to a low-cardinality route identifier by building it
from the mounting prefix and route pattern (e.g., use req.baseUrl +
req.route?.path when req.route exists) and avoid falling back to raw req.path
for unmatched routes—use a stable sentinel like "unmatched_route" or
"unknown_route" for 404s; update the app.use middleware that records
appMetrics.apiRequestCount, appMetrics.apiLatency, and appMetrics.apiErrorCount
to compute this routeLabel and pass it in attributes instead of req.path.

In `@apps/worker/src/index.ts`:
- Around line 486-514: The retry/failure/DLQ logs emitted via
logger.warn/logger.error (events 'worker.job.retried', 'worker.job.failed', and
'worker.dlq.inflow.alert') are missing trace correlation and eventKey; update
the payloads to merge the ingress payload's traceContext and eventKey into
baseLogData before logging (i.e., extend baseLogData with trace ids and eventKey
from the incoming traceContext/eventKey fields used earlier in this module),
then pass that enriched object to logger.warn/logger.error for the retry/failure
and the DLQ branch (routeResult.routedToDlq) so those logs include the trace
identifiers and eventKey for incident correlation.
- Around line 1-3: The telemetry SDK is being initialized too early
(setupTelemetry('wa-chat-worker') runs before dotenv.config()) and causes SDK
and SIGTERM handlers to register during module import (affecting tests); move
the setupTelemetry call to after dotenv.config() and ensure initialization
(including any SIGTERM handler registration) only runs on application startup by
either exporting a start function (e.g., startWorker/startApp) that calls
dotenv.config() then setupTelemetry, or by guarding top-level startup with a
runtime check (e.g., require.main === module) so imports (like in index.test.ts)
do not trigger SDK/handler registration. Use the symbols setupTelemetry and any
existing SIGTERM handler registration code to locate and relocate the
initialization logic.

In `@apps/worker/src/queue/consumer.ts`:
- Around line 136-151: The code is incrementing queueDlqCount/queueRetryCount
too early in runIngressJob() before routeFailedIngressJobToDlq() determines
final outcome, and it also misses counting permanent retries when
permanentMaxAttempts > 1; remove the early metric increments from the
failure-handling branch that checks classifiedError.errorClass and
currentAttempt, and instead increment appMetrics.queueDlqCount.add(...) or
appMetrics.queueRetryCount.add(...) inside routeFailedIngressJobToDlq() after
the DLQ enqueue result is known (or in the final retry decision path), ensuring
you use classifiedError.errorClass and currentAttempt vs.
policies.retry.transientMaxAttempts / permanentAttempts to decide DLQ vs retry
and still preserve throwing UnrecoverableError for permanent errors (as in the
UnrecoverableError handling).

In `@docker/prometheus.yml`:
- Around line 1-5: Add an evaluation_interval under the global Prometheus config
to match the fast scrape cadence: in the global block (where scrape_interval: 1s
is set) add evaluation_interval: 1s so alert rules (rule_files
/etc/prometheus/alert.rules.yml) are evaluated every second instead of the 60s
default.

In `@packages/shared/src/index.ts`:
- Around line 22-25: isIngressJobPayload currently ignores traceContext and
assertIngressJobPayload forwards any truthy object, allowing malformed
traceContext like { traceId: 123 }; update isIngressJobPayload to validate that
if payload.traceContext exists it is an object with both traceId and
correlationId as strings, and then update assertIngressJobPayload to call this
stricter check (rejecting payloads where traceContext is missing keys or types),
ensuring any place that forwards the payload (the code path that passes
traceContext along) only accepts payloads that pass the new validation.

In `@packages/shared/src/telemetry.ts`:
- Around line 60-67: The SIGTERM handler in the shared telemetry helper (the
processRef?.on?('SIGTERM', ...) block that calls sdk.shutdown()) must not call
processRef?.exit?.(0) or otherwise force process exit; remove the forced exit
and avoid owning process lifecycle here. Modify the handler so it only calls
sdk.shutdown() (and logs success or error via logger) but does not exit the
process or swallow failures—allow the API/worker entrypoints to perform their
own shutdown sequencing (e.g., worker.close()/queue.close()) and then call
sdk.shutdown() or exit as appropriate.

In `@scripts/simulate-alerts.ts`:
- Line 91: Fix the typo in the shutdown log message: replace the misspelled
string "Sutdown requested, stopping metrics server..." used in the console.log
call with "Shutdown requested, stopping metrics server..." so the shutdown
handler / log output (the console.log invocation in scripts/simulate-alerts.ts)
prints the correct word.

---

Outside diff comments:
In `@apps/api/src/index.test.ts`:
- Around line 336-344: The test currently only asserts traceId shape; instead
verify end-to-end propagation by asserting the queued trace equals the ingress
event trace: after computing enqueuedTraceContext from deps.enqueuedJobs and
ingressEvents from getObservabilityEvents(deps.observabilityEvents,
'ingress_start'), add an assertion that enqueuedTraceContext.traceId ===
ingressEvents[0].context.traceId (and keep the existing correlationId equality
checks), so the test uses the actual values (enqueuedTraceContext,
ingressEvents, correlationId) rather than only pattern-matching the traceId.

---

Nitpick comments:
In `@packages/llm/src/router/model-router.ts`:
- Around line 28-36: The duplicated fallback callback that increments
appMetrics.agentProviderFallbackCount (currently embedded in the
createGeminiAdapter(...).withConfig({ callbacks: [{ handleLLMStart: () => {
appMetrics.agentProviderFallbackCount.add(1); } }] }) call used for the
fallbackModel and the other router builder) should be extracted into a single
shared helper/config (e.g., a getFallbackCallbacks or
fallbackInstrumentationConfig) and reused by both router builders; locate the
two createGeminiAdapter(...).withConfig(...) usages (the fallbackModel variable
and the second builder where the same callbacks are repeated) and replace the
inline callbacks with a reference to the shared helper to avoid duplication and
drift.

In `@packages/llm/src/tools/index.ts`:
- Around line 13-37: The catch block in withSafeFallback logs errors with
console.error instead of the shared structured logger; update
packages/llm/src/tools/index.ts to import the shared logger (e.g., appLogger)
from '@wa-chat/shared' and replace the console.error call inside
withSafeFallback with appLogger.error, passing a descriptive message, the error
object (not just error.message), and context tags (tool: toolName, status:
isTimeout ? 'timeout' : 'error') so failures are structured and trace-correlated
while preserving the existing appMetrics calls.

In `@packages/shared/package.json`:
- Around line 19-20: package.json currently lists "pino-pretty" as a regular
dependency which inflates production bundles; move "pino-pretty" to
devDependencies and update runtime initialization code that configures the
logger (where you reference "pino" and configure pretty-printing) to use a
dynamic import() for "pino-pretty" only when NODE_ENV !== 'production', wrapping
the import in try/catch and falling back to plain pino if the import fails;
alternatively document and/or configure "pino-pretty" as an optional
peerDependency so consumers can opt-in—ensure the changes reference the
"pino-pretty" entry in package.json and the logger setup code that currently
imports or requires it.

In `@packages/shared/src/logger.js`:
- Around line 1-20: The compiled JavaScript file (contains sourceMappingURL) was
committed into src/ (logger.js with export logger and pino/pino-pretty usage);
remove the compiled artifact from the source tree and ensure builds emit to
dist/ instead: delete logger.js and logger.js.map from src/, remove them from
git with git rm --cached if already tracked, add a rule to ignore src/*.js and
src/*.js.map in .gitignore, and adjust the build/config (tsconfig/outDir or
build script) so TypeScript/tsc outputs to dist/ and source .ts logger file
remains in src/ (ensure exported symbol logger remains implemented in
src/logger.ts).

In `@packages/shared/src/telemetry.js`:
- Around line 48-56: The SIGTERM handler currently calls processRef?.exit?.(0)
unconditionally in the finally block after sdk.shutdown(), which exits with
success even on shutdown failure; update the handler so that
sdk.shutdown().then(...) calls processRef?.exit?.(0) on success and the
.catch(...) calls processRef?.exit?.(1) (or omit exiting on failure if desired),
and remove the unconditional exit from the .finally() block so exit codes
reflect shutdown result; refer to processRef, sdk.shutdown(), and logger in your
changes.
- Around line 13-24: The metric/exporter URL logic is confusing by
special-casing endpoints that end with "/v1/traces"; simplify by normalizing
otlpEndpoint into a single base URL (trim any trailing slash and strip any
trailing "/v1/traces" or "/v1/metrics" if present) and then build trace and
metric URLs by appending "/v1/traces" and "/v1/metrics" respectively before
passing to OTLPTraceExporter and OTLPMetricExporter (keep fallback to new
OTLPTraceExporter()/new OTLPMetricExporter() when otlpEndpoint is falsy). Ensure
you update the code around traceExporter, metricExporter, and the otlpEndpoint
normalization so both exporters are constructed from a single, predictable base.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1cda0560-5943-4e6d-afe1-b1496200c51f

📥 Commits

Reviewing files that changed from the base of the PR and between 60abff3 and 75ac7f8.

⛔ Files ignored due to path filters (10)

apps/api/src/index.d.ts.map is excluded by !**/*.map
apps/api/src/index.js.map is excluded by !**/*.map
apps/api/src/index.test.js.map is excluded by !**/*.map
package-lock.json is excluded by !**/package-lock.json
packages/shared/src/index.d.ts.map is excluded by !**/*.map
packages/shared/src/index.js.map is excluded by !**/*.map
packages/shared/src/logger.d.ts.map is excluded by !**/*.map
packages/shared/src/logger.js.map is excluded by !**/*.map
packages/shared/src/telemetry.d.ts.map is excluded by !**/*.map
packages/shared/src/telemetry.js.map is excluded by !**/*.map

📒 Files selected for processing (35)

.github/workflows/check-prometheus-rules.yml
apps/api/src/index.js
apps/api/src/index.test.js
apps/api/src/index.test.ts
apps/api/src/index.ts
apps/dashboard/lib/supabase/server.ts
apps/dashboard/middleware.ts
apps/worker/src/index.test.ts
apps/worker/src/index.ts
apps/worker/src/queue/consumer.ts
docker-compose.local.yml
docker/grafana/dashboards/k2-observability.json
docker/grafana/provisioning/dashboards/dashboards.yml
docker/grafana/provisioning/datasources/prometheus.yml
docker/otel-collector-config.yml
docker/prometheus.yml
docker/prometheus/alert.rules.yml
package.json
packages/llm/src/langchain/chains/router.ts
packages/llm/src/langchain/pipeline.ts
packages/llm/src/router/model-router.ts
packages/llm/src/tools/index.ts
packages/shared/package.json
packages/shared/src/index.d.ts
packages/shared/src/index.js
packages/shared/src/index.ts
packages/shared/src/logger.d.ts
packages/shared/src/logger.js
packages/shared/src/logger.ts
packages/shared/src/telemetry.d.ts
packages/shared/src/telemetry.js
packages/shared/src/telemetry.test.ts
packages/shared/src/telemetry.ts
packages/shared/test-resource.ts
scripts/simulate-alerts.ts

coderabbitai · 2026-03-08T17:22:04Z

.github/workflows/check-prometheus-rules.yml

@@ -0,0 +1,21 @@
+name: K3 Alerting Rules Check
+description: Validates Prometheus alert rules


⚠️ Potential issue | 🟡 Minor

Invalid workflow key: description is not allowed at the workflow level.

GitHub Actions workflows do not support a description key at the top level. Valid keys are: name, on, jobs, env, permissions, defaults, concurrency, run-name. This will cause the workflow to fail validation.

🔧 Proposed fix: Remove the invalid key

name: K3 Alerting Rules Check -description: Validates Prometheus alert rules on: [push, pull_request]

If you want to retain the description, add it as a YAML comment:

+# Validates Prometheus alert rules name: K3 Alerting Rules Check on: [push, pull_request]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

description: Validates Prometheus alert rules

name: K3 Alerting Rules Check

on: [push, pull_request]

🧰 Tools

🪛 actionlint (1.7.11)

[error] 2-2: unexpected key "description" for "workflow" section. expected one of "concurrency", "defaults", "env", "jobs", "name", "on", "permissions", "run-name"

(syntax-check)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/check-prometheus-rules.yml at line 2, Remove the invalid top-level key "description" from the workflow (the YAML root must only use allowed keys such as name, on, jobs, env, permissions, defaults, concurrency, run-name); either delete the "description" entry entirely or convert its text to a YAML comment at the top of the file so the workflow validates.

coderabbitai · 2026-03-08T17:22:05Z

apps/api/src/index.js

+import { setupTelemetry, logger } from '@wa-chat/shared';
+// Initialize telemetry before other imports if possible, or right at the top
+setupTelemetry('wa-chat-api');


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether the API starts directly from apps/api/src/index.* without a bootstrap wrapper. echo "=== package scripts ===" fd '^package\.json$' . -x sh -c 'echo "\n## $1"; jq -r ".scripts // {}" "$1"' sh {} echo "\n=== setupTelemetry call sites ===" rg -n "setupTelemetry\(" apps packages echo "\n=== API entrypoint candidates ===" fd 'index\.(ts|js)$' apps/api

Repository: Ryanakml/ChatFuse

Length of output: 3386

🏁 Script executed:

#!/bin/bash # Extract package.json scripts correctly echo "=== Root package.json scripts ===" cat package.json | jq -r '.scripts // {}' 2>/dev/null || echo "Failed to parse" echo "\n=== Apps/api package.json scripts ===" cat apps/api/package.json | jq -r '.scripts // {}' 2>/dev/null || echo "Failed to parse" echo "\n=== Full index.js content ===" wc -l apps/api/src/index.js head -50 apps/api/src/index.js echo "\n=== Full index.ts content ===" wc -l apps/api/src/index.ts head -50 apps/api/src/index.ts echo "\n=== Check for bootstrap files ===" fd 'bootstrap' apps/api/src echo "\n=== Type field in package.json ===" grep -A2 '"type"' package.json apps/api/package.json 2>/dev/null

Repository: Ryanakml/ChatFuse

Length of output: 4987

🏁 Script executed:

#!/bin/bash echo "=== setupTelemetry implementation ===" cat packages/shared/src/telemetry.ts | head -100 echo "\n=== Check for OpenTelemetry/instrumentation usage ===" rg -n "instrumentation|NodeSDK" packages/shared/src/telemetry.ts | head -30 echo "\n=== Check for trace context reconstruction ===" rg -n "traceContext|ingressContext" apps/api/src/index.ts | head -20 echo "\n=== Check worker queue processing ===" rg -n "traceContext|ingressContext" apps/worker/src/index.ts | head -20

Repository: Ryanakml/ChatFuse

Length of output: 5075

🏁 Script executed:

#!/bin/bash echo "=== Check for any entry point configuration ===" cat apps/api/package.json | jq '.' | grep -A20 '"exports"\|"main"' echo "\n=== Check for Node loader hooks or require hooks ===" rg -n "register|loader|hook" apps/api/src echo "\n=== Check if dist/index.js differs from src/index.js ===" if [ -f dist/index.js ]; then echo "dist/index.js exists" head -10 dist/index.js else echo "dist/index.js does not exist" fi echo "\n=== Check actual worker app startup pattern ===" cat apps/worker/src/index.ts | head -30

Repository: Ryanakml/ChatFuse

Length of output: 15608

Move telemetry initialization to a separate bootstrap entry point.

Because this is an ESM module, static imports are hoisted and evaluated before the module body executes. This means setupTelemetry('wa-chat-api') runs after Express, Redis, and BullMQ are already loaded, preventing auto-instrumentation from patching them. Create a tiny bootstrap entrypoint that initializes telemetry before importing the app, or use a Node.js loader to prepend initialization.

🧭 Bootstrap pattern

// apps/api/src/bootstrap.ts import { setupTelemetry } from '@wa-chat/shared'; setupTelemetry('wa-chat-api'); await import('./index.js');

Then update package.json:

{ "start": "node --input-type=module --eval \"import('./dist/bootstrap.js')\"" }

Or with tsx for development:

{ "dev": "tsx src/bootstrap.ts" }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/api/src/index.js` around lines 1 - 3, The telemetry setup call setupTelemetry('wa-chat-api') must run before any app module imports so auto-instrumentation can patch Express/Redis/BullMQ; move this initialization into a tiny bootstrap entry point (e.g., create a new bootstrap file that imports and calls setupTelemetry('wa-chat-api') and then dynamically imports the existing index.js/app), update your start/dev scripts to run the bootstrap (or use tsx for dev) so the bootstrap executes first, and remove or replace the direct setupTelemetry call from apps/api/src/index.js to avoid double-initialization.

coderabbitai · 2026-03-08T17:22:05Z

apps/api/src/index.js

+    app.use((req, res, next) => {
+        const startTime = Date.now();
+        res.on('finish', () => {
+            const durationMs = Date.now() - startTime;
+            const attributes = {
+                method: req.method,
+                route: req.route?.path || req.path,
+                status: String(res.statusCode),
+            };
+            appMetrics.apiRequestCount.add(1, attributes);
+            appMetrics.apiLatency.record(durationMs, attributes);
+            if (res.statusCode >= 500) {
+                appMetrics.apiErrorCount.add(1, attributes);
+            }
+        });
+        next();
+    });


⚠️ Potential issue | 🟠 Major

Don't fall back to raw request paths in metric labels.

When req.route is unset, this emits concrete URLs as separate time series. That is exactly the sort of cardinality spike that hurts Prometheus when you need it most. Use a bounded fallback like 'unmatched', or explicitly stash a normalized route template.

📉 Suggested fix

app.use((req, res, next) => { const startTime = Date.now(); res.on('finish', () => { const durationMs = Date.now() - startTime; + const route = req.route?.path ?? 'unmatched'; const attributes = { method: req.method, - route: req.route?.path || req.path, + route, status: String(res.statusCode), }; appMetrics.apiRequestCount.add(1, attributes); appMetrics.apiLatency.record(durationMs, attributes);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/api/src/index.js` around lines 280 - 296, The middleware that builds metric labels (inside the app.use(...) block that sets startTime and constructs attributes) currently falls back to req.path when req.route is undefined, creating high-cardinality time series; change the route label to use a bounded fallback (e.g., 'unmatched') or a normalized route template instead of req.path by replacing the expression req.route?.path || req.path with something like req.route?.path || 'unmatched' (or retrieve and stash a normalized route template) so appMetrics.apiRequestCount, appMetrics.apiLatency, and appMetrics.apiErrorCount receive a stable route label.

coderabbitai · 2026-03-08T17:22:05Z

apps/api/src/index.js

+                traceContext: {
+                    traceId: ingressContext.traceId,
+                    correlationId: ingressContext.correlationId,
+                },


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether traceContext is used as actual distributed-tracing state anywhere downstream. echo "=== trace propagation usage ===" rg -n "traceContext|traceId|correlationId|traceparent|propagation|context\.active|getSpan|spanContext|startActiveSpan" apps packages

Repository: Ryanakml/ChatFuse

Length of output: 5182

🏁 Script executed:

#!/bin/bash # Check the context around lines 473-476 in apps/api/src/index.js echo "=== API code around lines 473-476 ===" sed -n '420,490p' apps/api/src/index.js echo "" echo "=== Check ingressContext creation ===" sed -n '400,435p' apps/api/src/index.js echo "" echo "=== Check imports at top of file ===" head -40 apps/api/src/index.js

Repository: Ryanakml/ChatFuse

Length of output: 6261

🏁 Script executed:

#!/bin/bash # Check how worker uses traceContext echo "=== Worker code using traceContext ===" sed -n '330,360p' apps/worker/src/index.ts echo "" echo "=== Check if OpenTelemetry is used in worker ===" head -50 apps/worker/src/index.ts | grep -i "otel\|opentelemetry\|trace\|span"

Repository: Ryanakml/ChatFuse

Length of output: 1042

🏁 Script executed:

#!/bin/bash # Look for how ingressContext.traceId is created - check for randomBytes or createTraceId echo "=== Search for createTraceId function ===" rg -n "createTraceId|randomBytes" apps/api/src/index.js -A 3 echo "" echo "=== Check IngressContext interface ===" rg -n "IngressContext|ingressContext" apps/api/src/index.js | head -20

Repository: Ryanakml/ChatFuse

Length of output: 1835

🏁 Script executed:

#!/bin/bash # Check the attachIngressTraceContext middleware echo "=== Looking for attachIngressTraceContext middleware ===" rg -n "attachIngressTraceContext|const attachIngressTraceContext|function attachIngressTraceContext" apps/api/src/index.js -A 10 echo "" echo "=== Check getIngressTraceContext function ===" rg -n "getIngressTraceContext|const getIngressTraceContext" apps/api/src/index.js -B 2 -A 8

Repository: Ryanakml/ChatFuse

Length of output: 3325

🏁 Script executed:

#!/bin/bash # Check setupTelemetry in the shared package echo "=== Check setupTelemetry implementation ===" rg -n "setupTelemetry|export.*setupTelemetry" packages/shared/src/ -A 5 echo "" echo "=== Check if there's any OpenTelemetry configuration ===" rg -n "@opentelemetry|otel|OpenTelemetry" packages/shared/src/ -l

Repository: Ryanakml/ChatFuse

Length of output: 2931

🏁 Script executed:

#!/bin/bash # Check full setupTelemetry implementation echo "=== Full setupTelemetry implementation ===" cat packages/shared/src/telemetry.ts

Repository: Ryanakml/ChatFuse

Length of output: 4277

🏁 Script executed:

#!/bin/bash # Check buildIngressTraceContext function echo "=== buildIngressTraceContext function ===" rg -n "buildIngressTraceContext|const buildIngressTraceContext" apps/api/src/index.js -B 2 -A 15

Repository: Ryanakml/ChatFuse

Length of output: 1995

Propagate the active OpenTelemetry span context, not a synthetic trace ID.

The code generates a new trace ID with randomBytes at line 302 of buildIngressTraceContext, which creates a trace-shaped string unrelated to the actual HTTP request span. Since setupTelemetry enables getNodeAutoInstrumentations(), the webhook POST request already has an active span from automatic HTTP instrumentation. However, this span context is never captured. The worker receives only the synthetic ID and cannot reconstruct or join the distributed trace.

Pass the active span context downstream so the worker can continue the trace. Capture it with trace.getSpan(context.active())?.spanContext() and pass the full span context (not just traceId), ideally as traceparent or equivalent W3C format for proper trace propagation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/api/src/index.js` around lines 473 - 476, The traceContext currently uses a synthetic traceId from buildIngressTraceContext; instead capture the active OpenTelemetry span and propagate its full span context downstream instead of the synthetic ID. In buildIngressTraceContext (and where traceContext is assigned to ingressContext/traceContext) call trace.getSpan(context.active())?.spanContext() and construct a W3C traceparent (or include the full spanContext object) and set that value on ingressContext.traceContext (or a traceparent field) so the worker receives the real span information; ensure setupTelemetry/getNodeAutoInstrumentations remains enabled and use the spanContext rather than randomBytes-generated IDs.

coderabbitai · 2026-03-08T17:22:05Z

apps/api/src/index.ts

+import { setupTelemetry, logger } from '@wa-chat/shared';
+// Initialize telemetry before other imports if possible, or right at the top
+setupTelemetry('wa-chat-api');


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the main entry point and understand the project structure git ls-files apps/api/src/ | head -20

Repository: Ryanakml/ChatFuse

Length of output: 588

🏁 Script executed:

# Look at the specific file mentioned in the review cat -n apps/api/src/index.ts

Repository: Ryanakml/ChatFuse

Length of output: 27383

🏁 Script executed:

# Search for dotenv.config() calls to understand when env is loaded rg "dotenv\.config|config()" apps/api/src/ -A 2 -B 2

Repository: Ryanakml/ChatFuse

Length of output: 2432

🏁 Script executed:

# Check if there are other entry points or bootstrap files fd -e ts -e js apps/api/src/ | grep -E "(main|bootstrap|entry|server)" | head -10

Repository: Ryanakml/ChatFuse

Length of output: 418

🏁 Script executed:

# Search for where apps/api/src/index.ts is imported from rg "from.*['\"].*apps/api/src['\"]|from.*['\"]\.\./" apps/ --type ts --type js | head -20

Repository: Ryanakml/ChatFuse

Length of output: 987

Move telemetry initialization behind env loading or into a function.

Module-level setupTelemetry() at line 3 executes before dotenv.config() at line 732. This means any code importing apps/api/src/index.ts starts the SDK before environment variables are loaded—creating test pollution and potential misconfiguration in consuming processes. Move this call into a bootstrap function or after dotenv.config().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/api/src/index.ts` around lines 1 - 3, The module-level call to setupTelemetry('wa-chat-api') runs before environment variables are loaded; move this call out of top-level execution and into a bootstrap/initialization routine that runs after dotenv.config() (or call it after the env-loading code), e.g., remove the direct setupTelemetry(...) invocation from the top of apps/api/src/index.ts and instead invoke setupTelemetry inside your existing bootstrap function or immediately after dotenv.config() so telemetry initialization uses the proper env values; reference the setupTelemetry symbol and the module-level invocation in index.ts when making the change.

coderabbitai · 2026-03-08T17:22:05Z

apps/worker/src/queue/consumer.ts

    if (classifiedError.errorClass === 'permanent' && currentAttempt >= permanentAttempts) {
+      appMetrics.queueDlqCount.add(1, { error_class: 'permanent' });
      if (classifiedError.error instanceof UnrecoverableError) {
        throw classifiedError.error;
      }

      throw new UnrecoverableError(classifiedError.error.message);
    }

+    if (classifiedError.errorClass === 'transient') {
+      if (currentAttempt >= policies.retry.transientMaxAttempts) {
+        appMetrics.queueDlqCount.add(1, { error_class: 'transient' });
+      } else {
+        appMetrics.queueRetryCount.add(1);
+      }
+    }


⚠️ Potential issue | 🟠 Major

Emit retry/DLQ metrics only after the final outcome is known.

runIngressJob() increments these counters before routeFailedIngressJobToDlq() knows whether the job will actually retry and before the DLQ enqueue succeeds. That makes queueDlqCount over-report failed DLQ routes, and it also misses permanent retries entirely when permanentMaxAttempts > 1.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/queue/consumer.ts` around lines 136 - 151, The code is incrementing queueDlqCount/queueRetryCount too early in runIngressJob() before routeFailedIngressJobToDlq() determines final outcome, and it also misses counting permanent retries when permanentMaxAttempts > 1; remove the early metric increments from the failure-handling branch that checks classifiedError.errorClass and currentAttempt, and instead increment appMetrics.queueDlqCount.add(...) or appMetrics.queueRetryCount.add(...) inside routeFailedIngressJobToDlq() after the DLQ enqueue result is known (or in the final retry decision path), ensuring you use classifiedError.errorClass and currentAttempt vs. policies.retry.transientMaxAttempts / permanentAttempts to decide DLQ vs retry and still preserve throwing UnrecoverableError for permanent errors (as in the UnrecoverableError handling).

coderabbitai · 2026-03-08T17:22:05Z

docker/prometheus.yml

+global:
+  scrape_interval: 1s # Set small scrape interval since we're simulating fast
+
+rule_files:
+  - /etc/prometheus/alert.rules.yml


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Prometheus, if global.evaluation_interval is omitted, what default is used, and does changing scrape_interval also change alert/rule evaluation frequency?

💡 Result:

Default when global.evaluation_interval is omitted: 1m (1 minute). (prometheus.io)

Does changing scrape_interval change alert/rule evaluation frequency? No. Rule/alert evaluation runs on global.evaluation_interval (or a rule-group’s own interval override). Changing scrape_interval only changes how often targets are scraped, not how often rules are evaluated. (prometheus.io)

Citations:

1: https://prometheus.io/docs/prometheus/3.4/configuration/configuration/?utm_source=openai

2: https://prometheus.io/docs/prometheus/3.4/configuration/configuration/?utm_source=openai

Add evaluation_interval to match the fast scrape cadence.

With scrape_interval: 1s but no evaluation_interval configured, Prometheus will scrape metrics every second but evaluate alert rules every 60 seconds (the default). This mismatch means incident alerts can fire much later than expected, defeating the purpose of fast scraping in simulation.

Suggested fix

global: scrape_interval: 1s # Set small scrape interval since we're simulating fast + evaluation_interval: 1s

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

global:

scrape_interval: 1s # Set small scrape interval since we're simulating fast

rule_files:

- /etc/prometheus/alert.rules.yml

global:

scrape_interval: 1s # Set small scrape interval since we're simulating fast

evaluation_interval: 1s

rule_files:

- /etc/prometheus/alert.rules.yml

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docker/prometheus.yml` around lines 1 - 5, Add an evaluation_interval under the global Prometheus config to match the fast scrape cadence: in the global block (where scrape_interval: 1s is set) add evaluation_interval: 1s so alert rules (rule_files /etc/prometheus/alert.rules.yml) are evaluated every second instead of the 60s default.

coderabbitai · 2026-03-08T17:22:05Z

packages/shared/src/index.ts

+  traceContext?: {
+    traceId: string;
+    correlationId: string;
+  };


⚠️ Potential issue | 🟠 Major

Validate traceContext at runtime as well.

IngressJobPayload now exposes traceContext as two required strings, but Line 188 forwards any truthy object and isIngressJobPayload() still ignores the field entirely. That means malformed payloads like { traceContext: { traceId: 123 } } can be created and then pass assertIngressJobPayload(), which defeats the correlation guarantees this PR is adding.

Suggested fix

+const isTraceContext = ( + value: unknown, +): value is { traceId: string; correlationId: string } => + isRecord(value) && + isNonEmptyString(value.traceId) && + isNonEmptyString(value.correlationId); + export const isIngressJobPayload = (value: unknown): value is IngressJobPayload => { if (!isRecord(value)) { return false; } @@ if (typeof value.receivedAt !== 'string' || Number.isNaN(Date.parse(value.receivedAt))) { return false; } + + if (!(value.traceContext === undefined || isTraceContext(value.traceContext))) { + return false; + } return isJsonValue(value.payload); }; @@ export const createIngressJobPayload = (input: { eventKey: string; payload: JsonValue; receivedAt?: string; traceContext?: { traceId: string; correlationId: string }; }): IngressJobPayload => { @@ if (Number.isNaN(Date.parse(receivedAt))) { throw new Error('Ingress job receivedAt must be a valid ISO timestamp'); } + + if (input.traceContext && !isTraceContext(input.traceContext)) { + throw new Error( + 'Ingress job traceContext must contain non-empty traceId and correlationId', + ); + } return { schemaVersion: INGRESS_JOB_SCHEMA_VERSION, eventKey, payload: input.payload,

Also applies to: 167-189

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/shared/src/index.ts` around lines 22 - 25, isIngressJobPayload currently ignores traceContext and assertIngressJobPayload forwards any truthy object, allowing malformed traceContext like { traceId: 123 }; update isIngressJobPayload to validate that if payload.traceContext exists it is an object with both traceId and correlationId as strings, and then update assertIngressJobPayload to call this stricter check (rejecting payloads where traceContext is missing keys or types), ensuring any place that forwards the payload (the code path that passes traceContext along) only accepts payloads that pass the new validation.

coderabbitai · 2026-03-08T17:22:05Z

packages/shared/src/telemetry.ts

+  processRef?.on?.('SIGTERM', () => {
+    sdk
+      .shutdown()
+      .then(() => logger.info('Telemetry shut down successfully'))
+      .catch((error) => logger.error({ error }, 'Error shutting down telemetry'))
+      .finally(() => {
+        processRef?.exit?.(0);
+      });


⚠️ Potential issue | 🟠 Major

Don't own process exit from the shared telemetry helper.

This SIGTERM handler always calls process.exit(0), even when sdk.shutdown() fails. In the worker, that races the existing shutdown hook and can cut off worker.close()/queue.close() while still reporting a clean exit.

✂️ Suggested direction

- processRef?.on?.('SIGTERM', () => { - sdk - .shutdown() - .then(() => logger.info('Telemetry shut down successfully')) - .catch((error) => logger.error({ error }, 'Error shutting down telemetry')) - .finally(() => { - processRef?.exit?.(0); - }); - }); - return sdk;

Let the API/worker entrypoints call sdk.shutdown() from their own signal handlers after their service-specific cleanup completes.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

processRef?.on?.('SIGTERM', () => {

sdk

.shutdown()

.then(() => logger.info('Telemetry shut down successfully'))

.catch((error) => logger.error({ error }, 'Error shutting down telemetry'))

.finally(() => {

processRef?.exit?.(0);

});

return sdk;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/shared/src/telemetry.ts` around lines 60 - 67, The SIGTERM handler in the shared telemetry helper (the processRef?.on?('SIGTERM', ...) block that calls sdk.shutdown()) must not call processRef?.exit?.(0) or otherwise force process exit; remove the forced exit and avoid owning process lifecycle here. Modify the handler so it only calls sdk.shutdown() (and logs success or error via logger) but does not exit the process or swallow failures—allow the API/worker entrypoints to perform their own shutdown sequencing (e.g., worker.close()/queue.close()) and then call sdk.shutdown() or exit as appropriate.

coderabbitai · 2026-03-08T17:22:05Z

scripts/simulate-alerts.ts

+simulate();
+
+process.on('SIGINT', () => {
+  console.log('\nSutdown requested, stopping metrics server...');


⚠️ Potential issue | 🟡 Minor

Typo: "Sutdown" should be "Shutdown".

🔧 Fix typo

process.on('SIGINT', () => { - console.log('\nSutdown requested, stopping metrics server...'); + console.log('\nShutdown requested, stopping metrics server...'); server.close(() => process.exit(0)); });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

console.log('\nSutdown requested, stopping metrics server...');

process.on('SIGINT', () => {

console.log('\nShutdown requested, stopping metrics server...');

server.close(() => process.exit(0));

});

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/simulate-alerts.ts` at line 91, Fix the typo in the shutdown log message: replace the misspelled string "Sutdown requested, stopping metrics server..." used in the console.log call with "Shutdown requested, stopping metrics server..." so the shutdown handler / log output (the console.log invocation in scripts/simulate-alerts.ts) prints the correct word.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

.github/workflows/check-prometheus-rules.yml (1)

2-2: ⚠️ Potential issue | 🔴 Critical

Remove the unsupported workflow-level description key.

Line 2 is still invalid GitHub Actions syntax, so the workflow will fail validation before this job can run. This was already flagged on an earlier revision and still applies here.

🔧 Proposed fix

 name: K3 Alerting Rules Check
-description: Validates Prometheus alert rules
+# Validates Prometheus alert rules
 
 on: [push, pull_request]

#!/bin/bash
python - <<'PY'
import sys
from pathlib import Path

try:
    import yaml
except ImportError:
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "PyYAML"])
    import yaml

path = Path(".github/workflows/check-prometheus-rules.yml")
data = yaml.safe_load(path.read_text())
allowed = {"name", "on", "jobs", "env", "permissions", "defaults", "concurrency", "run-name"}

keys = list(data.keys())
invalid = [k for k in keys if k not in allowed]

print("top-level keys:", keys)
print("invalid keys:", invalid)
PY

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/check-prometheus-rules.yml at line 2, Remove the
unsupported top-level workflow key "description" from the GitHub Actions YAML
(the workflow-level "description" present in check-prometheus-rules.yml); locate
the "description:" entry at the top of the file and delete it (or move its text
into the "name:" field or into a job-level step/comment) so the workflow only
contains valid top-level keys like name, on, jobs, env, permissions, defaults,
concurrency, and run-name.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/check-prometheus-rules.yml:
- Around line 14-16: Download the published sha256sums file for Prometheus
alongside the prometheus-2.51.1.linux-amd64.tar.gz and verify the archive before
extracting and moving promtool; specifically, fetch sha256sums.txt from the same
release, run a checksum verification (e.g., use sha256sum -c or grep the
prometheus-2.51.1.linux-amd64.tar.gz line and pipe to sha256sum -c) and abort
the job if verification fails, then only tar xvf
prometheus-2.51.1.linux-amd64.tar.gz and sudo mv
prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/ when the checksum check
succeeds.

---

Duplicate comments:
In @.github/workflows/check-prometheus-rules.yml:
- Line 2: Remove the unsupported top-level workflow key "description" from the
GitHub Actions YAML (the workflow-level "description" present in
check-prometheus-rules.yml); locate the "description:" entry at the top of the
file and delete it (or move its text into the "name:" field or into a job-level
step/comment) so the workflow only contains valid top-level keys like name, on,
jobs, env, permissions, defaults, concurrency, and run-name.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a3d953e7-b361-48d1-8674-e952374eacf5

📥 Commits

Reviewing files that changed from the base of the PR and between 75ac7f8 and 8ece4d3.

📒 Files selected for processing (1)

.github/workflows/check-prometheus-rules.yml

coderabbitai · 2026-03-08T17:28:39Z

.github/workflows/check-prometheus-rules.yml

+          wget https://github.com/prometheus/prometheus/releases/download/v2.51.1/prometheus-2.51.1.linux-amd64.tar.gz
+          tar xvf prometheus-2.51.1.linux-amd64.tar.gz
+          sudo mv prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' import json import urllib.request url = "https://api.github.com/repos/prometheus/prometheus/releases/tags/v2.51.1" with urllib.request.urlopen(url) as r: data = json.load(r) assets = sorted(asset["name"] for asset in data.get("assets", [])) print("\n".join(assets)) print("\nsha256sums.txt present:", "sha256sums.txt" in assets) PY

Repository: Ryanakml/ChatFuse

Length of output: 2642

🏁 Script executed:

# Check if the workflow file exists and examine current state git ls-files .github/workflows/ | head -20 cat .github/workflows/check-prometheus-rules.yml 2>/dev/null || echo "File not found"

Repository: Ryanakml/ChatFuse

Length of output: 858

🌐 Web query:

Prometheus v2.51.1 release assets sha256sums GitHub

💡 Result:

For the Prometheus v2.51.1 GitHub release (dated 2024-03-27), the SHA256 checksums are published as release assets alongside the binaries. [1]

Release assets (direct download paths on GitHub Releases):

https://github.com/prometheus/prometheus/releases/download/v2.51.1/sha256sums.txt https://github.com/prometheus/prometheus/releases/download/v2.51.1/sha256sums.txt.asc

The binaries themselves are under the same releases/download/v2.51.1/ path (e.g., prometheus-2.51.1.linux-amd64.tar.gz). [1][2]

Sources: [1] [2]

Add integrity verification for Prometheus archive before installation.

The workflow downloads an executable from the network and moves it to PATH without any integrity check. The Prometheus v2.51.1 release includes sha256sums.txt as a published asset—use it to verify the archive before extraction.

🔒 Proposed hardening

- name: Install Promtool run: | - wget https://github.com/prometheus/prometheus/releases/download/v2.51.1/prometheus-2.51.1.linux-amd64.tar.gz - tar xvf prometheus-2.51.1.linux-amd64.tar.gz - sudo mv prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/ + set -euo pipefail + VERSION=2.51.1 + ARCHIVE="prometheus-${VERSION}.linux-amd64.tar.gz" + DIR="prometheus-${VERSION}.linux-amd64" + wget "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/${ARCHIVE}" + wget "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/sha256sums.txt" + grep " ${ARCHIVE}\$" sha256sums.txt | sha256sum -c - + tar xvf "${ARCHIVE}" + sudo mv "${DIR}/promtool" /usr/local/bin/

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/check-prometheus-rules.yml around lines 14 - 16, Download the published sha256sums file for Prometheus alongside the prometheus-2.51.1.linux-amd64.tar.gz and verify the archive before extracting and moving promtool; specifically, fetch sha256sums.txt from the same release, run a checksum verification (e.g., use sha256sum -c or grep the prometheus-2.51.1.linux-amd64.tar.gz line and pipe to sha256sum -c) and abort the job if verification fails, then only tar xvf prometheus-2.51.1.linux-amd64.tar.gz and sudo mv prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/ when the checksum check succeeds.

Ryanakml added 4 commits March 8, 2026 20:20

feat: propagate trace ids across async boundaries and standardize str…

313d4fd

…uctured logging

feat: add api, queue, agent, and tool metrics with latency and error …

dd53f8b

…dashboards

chore(observability): fix implicit any types for typescript

a1d9070

docs(observability): add incident runbooks for webhook, redis, supaba…

75ac7f8

…se, and llm provider outages

ci: fix promtool action and paths in check-prometheus-rules

8ece4d3

coderabbitai bot reviewed Mar 8, 2026

View reviewed changes

ci: update paths in Prometheus config and rules checks

f8a8d96

coderabbitai bot reviewed Mar 8, 2026

View reviewed changes

Ryanakml added 2 commits March 9, 2026 00:33

ci: update alert rules file path in Prometheus configuration

e7064bf

ci: fix Prometheus rule_files path resolution

356cbcc

Ryanakml merged commit 7ccb9a1 into main Mar 8, 2026
4 checks passed

		@@ -0,0 +1,21 @@
		name: K3 Alerting Rules Check
		description: Validates Prometheus alert rules

Conversation

Ryanakml commented Mar 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ryanakml commented Mar 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 8, 2026 •

edited

Loading