Conversation
…se, and llm provider outages
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds centralized telemetry and structured logging (OpenTelemetry + Pino), instruments API, worker, LLM, tools and simulator with app-level metrics, provisions Prometheus/Grafana (scrape/export pipeline + dashboards + alerts), adds traceContext propagation for ingress jobs, and a CI check for Prometheus rules. Changes
Sequence Diagram(s)sequenceDiagram
participant API as API Service
participant Worker as Worker
participant Shared as Shared (telemetry/logger)
participant OTEL as OTEL Collector
participant Prom as Prometheus
participant Graf as Grafana
API->>Shared: setupTelemetry('wa-chat-api') & use logger/appMetrics
API->>Worker: enqueue ingress job (payload + traceContext)
Worker->>Shared: setupTelemetry('wa-chat-worker') & use logger/appMetrics
Shared->>OTEL: SDK exports traces/metrics (OTLP HTTP)
OTEL->>Prom: exposes /metrics for Prometheus (8889)
Prom->>Graf: Grafana queries Prometheus datasource for dashboards/alerts
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/api/src/index.test.ts (1)
336-344:⚠️ Potential issue | 🟡 MinorAssert propagation, not just shape.
This still passes if enqueueing generates a second random trace ID, because the only trace assertion is “looks like hex”. Compare the queued value with
ingressEvents[0].context.traceIdso the test actually verifies end-to-end propagation.✅ Tighten the assertion
const enqueuedTraceContext = deps.enqueuedJobs[0]?.traceContext; assert.ok(enqueuedTraceContext, 'Enqueued job must contain traceContext'); assert.equal(enqueuedTraceContext.correlationId, correlationId); assert.match(enqueuedTraceContext.traceId, /^[a-f0-9]{32}$/); const ingressEvents = getObservabilityEvents(deps.observabilityEvents, 'ingress_start'); assert.equal(ingressEvents.length, 1); assert.equal(ingressEvents[0]?.context.correlationId, correlationId); assert.match(ingressEvents[0]?.context.traceId || '', /^[a-f0-9]{32}$/); + assert.equal(enqueuedTraceContext.traceId, ingressEvents[0]?.context.traceId);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/index.test.ts` around lines 336 - 344, The test currently only asserts traceId shape; instead verify end-to-end propagation by asserting the queued trace equals the ingress event trace: after computing enqueuedTraceContext from deps.enqueuedJobs and ingressEvents from getObservabilityEvents(deps.observabilityEvents, 'ingress_start'), add an assertion that enqueuedTraceContext.traceId === ingressEvents[0].context.traceId (and keep the existing correlationId equality checks), so the test uses the actual values (enqueuedTraceContext, ingressEvents, correlationId) rather than only pattern-matching the traceId.
🧹 Nitpick comments (6)
packages/shared/src/telemetry.js (2)
48-56: SIGTERM handler callsprocess.exit(0)in finally block unconditionally.The
finallyblock on line 53-55 always callsprocess?.exit?.(0), even if shutdown failed. Consider only exiting successfully if shutdown succeeds, or using different exit codes:♻️ Differentiate exit codes based on shutdown result
processRef?.on?.('SIGTERM', () => { sdk .shutdown() - .then(() => logger.info('Telemetry shut down successfully')) - .catch((error) => logger.error({ error }, 'Error shutting down telemetry')) - .finally(() => { - processRef?.exit?.(0); - }); + .then(() => { + logger.info('Telemetry shut down successfully'); + processRef?.exit?.(0); + }) + .catch((error) => { + logger.error({ error }, 'Error shutting down telemetry'); + processRef?.exit?.(1); + }); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/shared/src/telemetry.js` around lines 48 - 56, The SIGTERM handler currently calls processRef?.exit?.(0) unconditionally in the finally block after sdk.shutdown(), which exits with success even on shutdown failure; update the handler so that sdk.shutdown().then(...) calls processRef?.exit?.(0) on success and the .catch(...) calls processRef?.exit?.(1) (or omit exiting on failure if desired), and remove the unconditional exit from the .finally() block so exit codes reflect shutdown result; refer to processRef, sdk.shutdown(), and logger in your changes.
13-24: URL construction logic is overly complex and potentially confusing.The metric exporter URL construction (lines 18-24) checks if the endpoint ends with
/v1/tracesand replaces it, which is an unusual scenario. While the logic technically works for common cases, it's confusing to read.Consider simplifying to always construct paths from a base endpoint:
♻️ Simplified URL construction
+ // Normalize base endpoint (strip any path suffix) + const baseEndpoint = otlpEndpoint?.replace(/\/v1\/(traces|metrics)$/, '') || ''; + const traceExporter = otlpEndpoint - ? new OTLPTraceExporter({ - url: otlpEndpoint.endsWith('/v1/traces') ? otlpEndpoint : `${otlpEndpoint}/v1/traces`, - }) + ? new OTLPTraceExporter({ url: `${baseEndpoint}/v1/traces` }) : new OTLPTraceExporter(); const metricExporter = otlpEndpoint - ? new OTLPMetricExporter({ - url: otlpEndpoint.endsWith('/v1/traces') - ? otlpEndpoint.replace('/v1/traces', '/v1/metrics') - : `${otlpEndpoint}/v1/metrics`, - }) + ? new OTLPMetricExporter({ url: `${baseEndpoint}/v1/metrics` }) : new OTLPMetricExporter();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/shared/src/telemetry.js` around lines 13 - 24, The metric/exporter URL logic is confusing by special-casing endpoints that end with "/v1/traces"; simplify by normalizing otlpEndpoint into a single base URL (trim any trailing slash and strip any trailing "/v1/traces" or "/v1/metrics" if present) and then build trace and metric URLs by appending "/v1/traces" and "/v1/metrics" respectively before passing to OTLPTraceExporter and OTLPMetricExporter (keep fallback to new OTLPTraceExporter()/new OTLPMetricExporter() when otlpEndpoint is falsy). Ensure you update the code around traceExporter, metricExporter, and the otlpEndpoint normalization so both exporters are constructed from a single, predictable base.packages/shared/package.json (1)
19-20: Consider makingpino-prettya devDependency or optional.
pino-prettyis only used whenNODE_ENV !== 'production'. In production builds, this dependency adds unnecessary weight. Consider either:
- Moving it to
devDependenciesand using dynamicimport()with error handling- Making it an optional peer dependency
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/shared/package.json` around lines 19 - 20, package.json currently lists "pino-pretty" as a regular dependency which inflates production bundles; move "pino-pretty" to devDependencies and update runtime initialization code that configures the logger (where you reference "pino" and configure pretty-printing) to use a dynamic import() for "pino-pretty" only when NODE_ENV !== 'production', wrapping the import in try/catch and falling back to plain pino if the import fails; alternatively document and/or configure "pino-pretty" as an optional peerDependency so consumers can opt-in—ensure the changes reference the "pino-pretty" entry in package.json and the logger setup code that currently imports or requires it.packages/shared/src/logger.js (1)
1-20: Compiled JS file insrc/directory is unusual.This appears to be TypeScript compilation output (note the
sourceMappingURLcomment). Typically, compiled.jsfiles belong indist/while only source.tsfiles reside insrc/. Having both insrc/can cause confusion and may lead to the compiled files getting out of sync with the source.Consider:
- Adding
*.jsand*.js.mapto.gitignorefor thesrc/directory- Ensuring the build process outputs to
dist/(which aligns withpackage.json's"main": "dist/index.js")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/shared/src/logger.js` around lines 1 - 20, The compiled JavaScript file (contains sourceMappingURL) was committed into src/ (logger.js with export logger and pino/pino-pretty usage); remove the compiled artifact from the source tree and ensure builds emit to dist/ instead: delete logger.js and logger.js.map from src/, remove them from git with git rm --cached if already tracked, add a rule to ignore src/*.js and src/*.js.map in .gitignore, and adjust the build/config (tsconfig/outDir or build script) so TypeScript/tsc outputs to dist/ and source .ts logger file remains in src/ (ensure exported symbol logger remains implemented in src/logger.ts).packages/llm/src/tools/index.ts (1)
13-37: Route tool failures through the shared logger as well.This path now emits metrics, but it still falls back to raw
console.error. Using the shared logger here keeps tool failures structured and trace-correlated like the rest of the observability work in this PR.Suggested refactor
-import { appMetrics } from '@wa-chat/shared'; +import { appMetrics, logger } from '@wa-chat/shared'; @@ } catch (error) { const isTimeout = error instanceof Error && (error.message.includes('timeout') || error.name === 'TimeoutError'); appMetrics.toolExecutionCount.add(1, { tool: toolName, status: isTimeout ? 'timeout' : 'error', }); - console.error( - `[Tool Error] ${toolName} failed:`, - error instanceof Error ? error.message : error, - ); + logger.error({ err: error, tool: toolName, status: isTimeout ? 'timeout' : 'error' }, 'Tool execution failed'); return `I'm currently unable to complete this action due to a technical issue. Please try again later or type 'escalate' to speak with a human agent.`; } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/llm/src/tools/index.ts` around lines 13 - 37, The catch block in withSafeFallback logs errors with console.error instead of the shared structured logger; update packages/llm/src/tools/index.ts to import the shared logger (e.g., appLogger) from '@wa-chat/shared' and replace the console.error call inside withSafeFallback with appLogger.error, passing a descriptive message, the error object (not just error.message), and context tags (tool: toolName, status: isTimeout ? 'timeout' : 'error') so failures are structured and trace-correlated while preserving the existing appMetrics calls.packages/llm/src/router/model-router.ts (1)
28-36: Extract the fallback metric wrapper once.The same callback block is duplicated in both router builders. Pulling it into one shared config/helper will keep fallback instrumentation from drifting the next time this metric changes.
♻️ Small cleanup
+const fallbackMetricConfig = { + callbacks: [ + { + handleLLMStart: () => { + appMetrics.agentProviderFallbackCount.add(1); + }, + }, + ], +}; + - const fallbackModel = createGeminiAdapter(config?.fallbackConfig).withConfig({ - callbacks: [ - { - handleLLMStart: () => { - appMetrics.agentProviderFallbackCount.add(1); - }, - }, - ], - }); + const fallbackModel = createGeminiAdapter(config?.fallbackConfig).withConfig( + fallbackMetricConfig, + ); ... - .withConfig({ - callbacks: [ - { - handleLLMStart: () => { - appMetrics.agentProviderFallbackCount.add(1); - }, - }, - ], - }); + .withConfig(fallbackMetricConfig);Also applies to: 52-62
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/llm/src/router/model-router.ts` around lines 28 - 36, The duplicated fallback callback that increments appMetrics.agentProviderFallbackCount (currently embedded in the createGeminiAdapter(...).withConfig({ callbacks: [{ handleLLMStart: () => { appMetrics.agentProviderFallbackCount.add(1); } }] }) call used for the fallbackModel and the other router builder) should be extracted into a single shared helper/config (e.g., a getFallbackCallbacks or fallbackInstrumentationConfig) and reused by both router builders; locate the two createGeminiAdapter(...).withConfig(...) usages (the fallbackModel variable and the second builder where the same callbacks are repeated) and replace the inline callbacks with a reference to the shared helper to avoid duplication and drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/check-prometheus-rules.yml:
- Line 2: Remove the invalid top-level key "description" from the workflow (the
YAML root must only use allowed keys such as name, on, jobs, env, permissions,
defaults, concurrency, run-name); either delete the "description" entry entirely
or convert its text to a YAML comment at the top of the file so the workflow
validates.
In `@apps/api/src/index.js`:
- Around line 280-296: The middleware that builds metric labels (inside the
app.use(...) block that sets startTime and constructs attributes) currently
falls back to req.path when req.route is undefined, creating high-cardinality
time series; change the route label to use a bounded fallback (e.g.,
'unmatched') or a normalized route template instead of req.path by replacing the
expression req.route?.path || req.path with something like req.route?.path ||
'unmatched' (or retrieve and stash a normalized route template) so
appMetrics.apiRequestCount, appMetrics.apiLatency, and appMetrics.apiErrorCount
receive a stable route label.
- Around line 473-476: The traceContext currently uses a synthetic traceId from
buildIngressTraceContext; instead capture the active OpenTelemetry span and
propagate its full span context downstream instead of the synthetic ID. In
buildIngressTraceContext (and where traceContext is assigned to
ingressContext/traceContext) call trace.getSpan(context.active())?.spanContext()
and construct a W3C traceparent (or include the full spanContext object) and set
that value on ingressContext.traceContext (or a traceparent field) so the worker
receives the real span information; ensure
setupTelemetry/getNodeAutoInstrumentations remains enabled and use the
spanContext rather than randomBytes-generated IDs.
- Around line 1-3: The telemetry setup call setupTelemetry('wa-chat-api') must
run before any app module imports so auto-instrumentation can patch
Express/Redis/BullMQ; move this initialization into a tiny bootstrap entry point
(e.g., create a new bootstrap file that imports and calls
setupTelemetry('wa-chat-api') and then dynamically imports the existing
index.js/app), update your start/dev scripts to run the bootstrap (or use tsx
for dev) so the bootstrap executes first, and remove or replace the direct
setupTelemetry call from apps/api/src/index.js to avoid double-initialization.
In `@apps/api/src/index.ts`:
- Around line 1-3: The module-level call to setupTelemetry('wa-chat-api') runs
before environment variables are loaded; move this call out of top-level
execution and into a bootstrap/initialization routine that runs after
dotenv.config() (or call it after the env-loading code), e.g., remove the direct
setupTelemetry(...) invocation from the top of apps/api/src/index.ts and instead
invoke setupTelemetry inside your existing bootstrap function or immediately
after dotenv.config() so telemetry initialization uses the proper env values;
reference the setupTelemetry symbol and the module-level invocation in index.ts
when making the change.
- Around line 402-418: Middleware is currently using req.route?.path or req.path
which either drops mounted prefixes or yields high-cardinality user-controlled
paths; change the label to a low-cardinality route identifier by building it
from the mounting prefix and route pattern (e.g., use req.baseUrl +
req.route?.path when req.route exists) and avoid falling back to raw req.path
for unmatched routes—use a stable sentinel like "unmatched_route" or
"unknown_route" for 404s; update the app.use middleware that records
appMetrics.apiRequestCount, appMetrics.apiLatency, and appMetrics.apiErrorCount
to compute this routeLabel and pass it in attributes instead of req.path.
In `@apps/worker/src/index.ts`:
- Around line 486-514: The retry/failure/DLQ logs emitted via
logger.warn/logger.error (events 'worker.job.retried', 'worker.job.failed', and
'worker.dlq.inflow.alert') are missing trace correlation and eventKey; update
the payloads to merge the ingress payload's traceContext and eventKey into
baseLogData before logging (i.e., extend baseLogData with trace ids and eventKey
from the incoming traceContext/eventKey fields used earlier in this module),
then pass that enriched object to logger.warn/logger.error for the retry/failure
and the DLQ branch (routeResult.routedToDlq) so those logs include the trace
identifiers and eventKey for incident correlation.
- Around line 1-3: The telemetry SDK is being initialized too early
(setupTelemetry('wa-chat-worker') runs before dotenv.config()) and causes SDK
and SIGTERM handlers to register during module import (affecting tests); move
the setupTelemetry call to after dotenv.config() and ensure initialization
(including any SIGTERM handler registration) only runs on application startup by
either exporting a start function (e.g., startWorker/startApp) that calls
dotenv.config() then setupTelemetry, or by guarding top-level startup with a
runtime check (e.g., require.main === module) so imports (like in index.test.ts)
do not trigger SDK/handler registration. Use the symbols setupTelemetry and any
existing SIGTERM handler registration code to locate and relocate the
initialization logic.
In `@apps/worker/src/queue/consumer.ts`:
- Around line 136-151: The code is incrementing queueDlqCount/queueRetryCount
too early in runIngressJob() before routeFailedIngressJobToDlq() determines
final outcome, and it also misses counting permanent retries when
permanentMaxAttempts > 1; remove the early metric increments from the
failure-handling branch that checks classifiedError.errorClass and
currentAttempt, and instead increment appMetrics.queueDlqCount.add(...) or
appMetrics.queueRetryCount.add(...) inside routeFailedIngressJobToDlq() after
the DLQ enqueue result is known (or in the final retry decision path), ensuring
you use classifiedError.errorClass and currentAttempt vs.
policies.retry.transientMaxAttempts / permanentAttempts to decide DLQ vs retry
and still preserve throwing UnrecoverableError for permanent errors (as in the
UnrecoverableError handling).
In `@docker/prometheus.yml`:
- Around line 1-5: Add an evaluation_interval under the global Prometheus config
to match the fast scrape cadence: in the global block (where scrape_interval: 1s
is set) add evaluation_interval: 1s so alert rules (rule_files
/etc/prometheus/alert.rules.yml) are evaluated every second instead of the 60s
default.
In `@packages/shared/src/index.ts`:
- Around line 22-25: isIngressJobPayload currently ignores traceContext and
assertIngressJobPayload forwards any truthy object, allowing malformed
traceContext like { traceId: 123 }; update isIngressJobPayload to validate that
if payload.traceContext exists it is an object with both traceId and
correlationId as strings, and then update assertIngressJobPayload to call this
stricter check (rejecting payloads where traceContext is missing keys or types),
ensuring any place that forwards the payload (the code path that passes
traceContext along) only accepts payloads that pass the new validation.
In `@packages/shared/src/telemetry.ts`:
- Around line 60-67: The SIGTERM handler in the shared telemetry helper (the
processRef?.on?('SIGTERM', ...) block that calls sdk.shutdown()) must not call
processRef?.exit?.(0) or otherwise force process exit; remove the forced exit
and avoid owning process lifecycle here. Modify the handler so it only calls
sdk.shutdown() (and logs success or error via logger) but does not exit the
process or swallow failures—allow the API/worker entrypoints to perform their
own shutdown sequencing (e.g., worker.close()/queue.close()) and then call
sdk.shutdown() or exit as appropriate.
In `@scripts/simulate-alerts.ts`:
- Line 91: Fix the typo in the shutdown log message: replace the misspelled
string "Sutdown requested, stopping metrics server..." used in the console.log
call with "Shutdown requested, stopping metrics server..." so the shutdown
handler / log output (the console.log invocation in scripts/simulate-alerts.ts)
prints the correct word.
---
Outside diff comments:
In `@apps/api/src/index.test.ts`:
- Around line 336-344: The test currently only asserts traceId shape; instead
verify end-to-end propagation by asserting the queued trace equals the ingress
event trace: after computing enqueuedTraceContext from deps.enqueuedJobs and
ingressEvents from getObservabilityEvents(deps.observabilityEvents,
'ingress_start'), add an assertion that enqueuedTraceContext.traceId ===
ingressEvents[0].context.traceId (and keep the existing correlationId equality
checks), so the test uses the actual values (enqueuedTraceContext,
ingressEvents, correlationId) rather than only pattern-matching the traceId.
---
Nitpick comments:
In `@packages/llm/src/router/model-router.ts`:
- Around line 28-36: The duplicated fallback callback that increments
appMetrics.agentProviderFallbackCount (currently embedded in the
createGeminiAdapter(...).withConfig({ callbacks: [{ handleLLMStart: () => {
appMetrics.agentProviderFallbackCount.add(1); } }] }) call used for the
fallbackModel and the other router builder) should be extracted into a single
shared helper/config (e.g., a getFallbackCallbacks or
fallbackInstrumentationConfig) and reused by both router builders; locate the
two createGeminiAdapter(...).withConfig(...) usages (the fallbackModel variable
and the second builder where the same callbacks are repeated) and replace the
inline callbacks with a reference to the shared helper to avoid duplication and
drift.
In `@packages/llm/src/tools/index.ts`:
- Around line 13-37: The catch block in withSafeFallback logs errors with
console.error instead of the shared structured logger; update
packages/llm/src/tools/index.ts to import the shared logger (e.g., appLogger)
from '@wa-chat/shared' and replace the console.error call inside
withSafeFallback with appLogger.error, passing a descriptive message, the error
object (not just error.message), and context tags (tool: toolName, status:
isTimeout ? 'timeout' : 'error') so failures are structured and trace-correlated
while preserving the existing appMetrics calls.
In `@packages/shared/package.json`:
- Around line 19-20: package.json currently lists "pino-pretty" as a regular
dependency which inflates production bundles; move "pino-pretty" to
devDependencies and update runtime initialization code that configures the
logger (where you reference "pino" and configure pretty-printing) to use a
dynamic import() for "pino-pretty" only when NODE_ENV !== 'production', wrapping
the import in try/catch and falling back to plain pino if the import fails;
alternatively document and/or configure "pino-pretty" as an optional
peerDependency so consumers can opt-in—ensure the changes reference the
"pino-pretty" entry in package.json and the logger setup code that currently
imports or requires it.
In `@packages/shared/src/logger.js`:
- Around line 1-20: The compiled JavaScript file (contains sourceMappingURL) was
committed into src/ (logger.js with export logger and pino/pino-pretty usage);
remove the compiled artifact from the source tree and ensure builds emit to
dist/ instead: delete logger.js and logger.js.map from src/, remove them from
git with git rm --cached if already tracked, add a rule to ignore src/*.js and
src/*.js.map in .gitignore, and adjust the build/config (tsconfig/outDir or
build script) so TypeScript/tsc outputs to dist/ and source .ts logger file
remains in src/ (ensure exported symbol logger remains implemented in
src/logger.ts).
In `@packages/shared/src/telemetry.js`:
- Around line 48-56: The SIGTERM handler currently calls processRef?.exit?.(0)
unconditionally in the finally block after sdk.shutdown(), which exits with
success even on shutdown failure; update the handler so that
sdk.shutdown().then(...) calls processRef?.exit?.(0) on success and the
.catch(...) calls processRef?.exit?.(1) (or omit exiting on failure if desired),
and remove the unconditional exit from the .finally() block so exit codes
reflect shutdown result; refer to processRef, sdk.shutdown(), and logger in your
changes.
- Around line 13-24: The metric/exporter URL logic is confusing by
special-casing endpoints that end with "/v1/traces"; simplify by normalizing
otlpEndpoint into a single base URL (trim any trailing slash and strip any
trailing "/v1/traces" or "/v1/metrics" if present) and then build trace and
metric URLs by appending "/v1/traces" and "/v1/metrics" respectively before
passing to OTLPTraceExporter and OTLPMetricExporter (keep fallback to new
OTLPTraceExporter()/new OTLPMetricExporter() when otlpEndpoint is falsy). Ensure
you update the code around traceExporter, metricExporter, and the otlpEndpoint
normalization so both exporters are constructed from a single, predictable base.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1cda0560-5943-4e6d-afe1-b1496200c51f
⛔ Files ignored due to path filters (10)
apps/api/src/index.d.ts.mapis excluded by!**/*.mapapps/api/src/index.js.mapis excluded by!**/*.mapapps/api/src/index.test.js.mapis excluded by!**/*.mappackage-lock.jsonis excluded by!**/package-lock.jsonpackages/shared/src/index.d.ts.mapis excluded by!**/*.mappackages/shared/src/index.js.mapis excluded by!**/*.mappackages/shared/src/logger.d.ts.mapis excluded by!**/*.mappackages/shared/src/logger.js.mapis excluded by!**/*.mappackages/shared/src/telemetry.d.ts.mapis excluded by!**/*.mappackages/shared/src/telemetry.js.mapis excluded by!**/*.map
📒 Files selected for processing (35)
.github/workflows/check-prometheus-rules.ymlapps/api/src/index.jsapps/api/src/index.test.jsapps/api/src/index.test.tsapps/api/src/index.tsapps/dashboard/lib/supabase/server.tsapps/dashboard/middleware.tsapps/worker/src/index.test.tsapps/worker/src/index.tsapps/worker/src/queue/consumer.tsdocker-compose.local.ymldocker/grafana/dashboards/k2-observability.jsondocker/grafana/provisioning/dashboards/dashboards.ymldocker/grafana/provisioning/datasources/prometheus.ymldocker/otel-collector-config.ymldocker/prometheus.ymldocker/prometheus/alert.rules.ymlpackage.jsonpackages/llm/src/langchain/chains/router.tspackages/llm/src/langchain/pipeline.tspackages/llm/src/router/model-router.tspackages/llm/src/tools/index.tspackages/shared/package.jsonpackages/shared/src/index.d.tspackages/shared/src/index.jspackages/shared/src/index.tspackages/shared/src/logger.d.tspackages/shared/src/logger.jspackages/shared/src/logger.tspackages/shared/src/telemetry.d.tspackages/shared/src/telemetry.jspackages/shared/src/telemetry.test.tspackages/shared/src/telemetry.tspackages/shared/test-resource.tsscripts/simulate-alerts.ts
| @@ -0,0 +1,21 @@ | |||
| name: K3 Alerting Rules Check | |||
| description: Validates Prometheus alert rules | |||
There was a problem hiding this comment.
Invalid workflow key: description is not allowed at the workflow level.
GitHub Actions workflows do not support a description key at the top level. Valid keys are: name, on, jobs, env, permissions, defaults, concurrency, run-name. This will cause the workflow to fail validation.
🔧 Proposed fix: Remove the invalid key
name: K3 Alerting Rules Check
-description: Validates Prometheus alert rules
on: [push, pull_request]If you want to retain the description, add it as a YAML comment:
+# Validates Prometheus alert rules
name: K3 Alerting Rules Check
on: [push, pull_request]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| description: Validates Prometheus alert rules | |
| name: K3 Alerting Rules Check | |
| on: [push, pull_request] |
🧰 Tools
🪛 actionlint (1.7.11)
[error] 2-2: unexpected key "description" for "workflow" section. expected one of "concurrency", "defaults", "env", "jobs", "name", "on", "permissions", "run-name"
(syntax-check)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/check-prometheus-rules.yml at line 2, Remove the invalid
top-level key "description" from the workflow (the YAML root must only use
allowed keys such as name, on, jobs, env, permissions, defaults, concurrency,
run-name); either delete the "description" entry entirely or convert its text to
a YAML comment at the top of the file so the workflow validates.
| import { setupTelemetry, logger } from '@wa-chat/shared'; | ||
| // Initialize telemetry before other imports if possible, or right at the top | ||
| setupTelemetry('wa-chat-api'); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether the API starts directly from apps/api/src/index.* without a bootstrap wrapper.
echo "=== package scripts ==="
fd '^package\.json$' . -x sh -c 'echo "\n## $1"; jq -r ".scripts // {}" "$1"' sh {}
echo "\n=== setupTelemetry call sites ==="
rg -n "setupTelemetry\(" apps packages
echo "\n=== API entrypoint candidates ==="
fd 'index\.(ts|js)$' apps/apiRepository: Ryanakml/ChatFuse
Length of output: 3386
🏁 Script executed:
#!/bin/bash
# Extract package.json scripts correctly
echo "=== Root package.json scripts ==="
cat package.json | jq -r '.scripts // {}' 2>/dev/null || echo "Failed to parse"
echo "\n=== Apps/api package.json scripts ==="
cat apps/api/package.json | jq -r '.scripts // {}' 2>/dev/null || echo "Failed to parse"
echo "\n=== Full index.js content ==="
wc -l apps/api/src/index.js
head -50 apps/api/src/index.js
echo "\n=== Full index.ts content ==="
wc -l apps/api/src/index.ts
head -50 apps/api/src/index.ts
echo "\n=== Check for bootstrap files ==="
fd 'bootstrap' apps/api/src
echo "\n=== Type field in package.json ==="
grep -A2 '"type"' package.json apps/api/package.json 2>/dev/nullRepository: Ryanakml/ChatFuse
Length of output: 4987
🏁 Script executed:
#!/bin/bash
echo "=== setupTelemetry implementation ==="
cat packages/shared/src/telemetry.ts | head -100
echo "\n=== Check for OpenTelemetry/instrumentation usage ==="
rg -n "instrumentation|NodeSDK" packages/shared/src/telemetry.ts | head -30
echo "\n=== Check for trace context reconstruction ==="
rg -n "traceContext|ingressContext" apps/api/src/index.ts | head -20
echo "\n=== Check worker queue processing ==="
rg -n "traceContext|ingressContext" apps/worker/src/index.ts | head -20Repository: Ryanakml/ChatFuse
Length of output: 5075
🏁 Script executed:
#!/bin/bash
echo "=== Check for any entry point configuration ==="
cat apps/api/package.json | jq '.' | grep -A20 '"exports"\|"main"'
echo "\n=== Check for Node loader hooks or require hooks ==="
rg -n "register|loader|hook" apps/api/src
echo "\n=== Check if dist/index.js differs from src/index.js ==="
if [ -f dist/index.js ]; then
echo "dist/index.js exists"
head -10 dist/index.js
else
echo "dist/index.js does not exist"
fi
echo "\n=== Check actual worker app startup pattern ==="
cat apps/worker/src/index.ts | head -30Repository: Ryanakml/ChatFuse
Length of output: 15608
Move telemetry initialization to a separate bootstrap entry point.
Because this is an ESM module, static imports are hoisted and evaluated before the module body executes. This means setupTelemetry('wa-chat-api') runs after Express, Redis, and BullMQ are already loaded, preventing auto-instrumentation from patching them. Create a tiny bootstrap entrypoint that initializes telemetry before importing the app, or use a Node.js loader to prepend initialization.
🧭 Bootstrap pattern
// apps/api/src/bootstrap.ts
import { setupTelemetry } from '@wa-chat/shared';
setupTelemetry('wa-chat-api');
await import('./index.js');Then update package.json:
{
"start": "node --input-type=module --eval \"import('./dist/bootstrap.js')\""
}Or with tsx for development:
{
"dev": "tsx src/bootstrap.ts"
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/api/src/index.js` around lines 1 - 3, The telemetry setup call
setupTelemetry('wa-chat-api') must run before any app module imports so
auto-instrumentation can patch Express/Redis/BullMQ; move this initialization
into a tiny bootstrap entry point (e.g., create a new bootstrap file that
imports and calls setupTelemetry('wa-chat-api') and then dynamically imports the
existing index.js/app), update your start/dev scripts to run the bootstrap (or
use tsx for dev) so the bootstrap executes first, and remove or replace the
direct setupTelemetry call from apps/api/src/index.js to avoid
double-initialization.
| app.use((req, res, next) => { | ||
| const startTime = Date.now(); | ||
| res.on('finish', () => { | ||
| const durationMs = Date.now() - startTime; | ||
| const attributes = { | ||
| method: req.method, | ||
| route: req.route?.path || req.path, | ||
| status: String(res.statusCode), | ||
| }; | ||
| appMetrics.apiRequestCount.add(1, attributes); | ||
| appMetrics.apiLatency.record(durationMs, attributes); | ||
| if (res.statusCode >= 500) { | ||
| appMetrics.apiErrorCount.add(1, attributes); | ||
| } | ||
| }); | ||
| next(); | ||
| }); |
There was a problem hiding this comment.
Don't fall back to raw request paths in metric labels.
When req.route is unset, this emits concrete URLs as separate time series. That is exactly the sort of cardinality spike that hurts Prometheus when you need it most. Use a bounded fallback like 'unmatched', or explicitly stash a normalized route template.
📉 Suggested fix
app.use((req, res, next) => {
const startTime = Date.now();
res.on('finish', () => {
const durationMs = Date.now() - startTime;
+ const route = req.route?.path ?? 'unmatched';
const attributes = {
method: req.method,
- route: req.route?.path || req.path,
+ route,
status: String(res.statusCode),
};
appMetrics.apiRequestCount.add(1, attributes);
appMetrics.apiLatency.record(durationMs, attributes);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/api/src/index.js` around lines 280 - 296, The middleware that builds
metric labels (inside the app.use(...) block that sets startTime and constructs
attributes) currently falls back to req.path when req.route is undefined,
creating high-cardinality time series; change the route label to use a bounded
fallback (e.g., 'unmatched') or a normalized route template instead of req.path
by replacing the expression req.route?.path || req.path with something like
req.route?.path || 'unmatched' (or retrieve and stash a normalized route
template) so appMetrics.apiRequestCount, appMetrics.apiLatency, and
appMetrics.apiErrorCount receive a stable route label.
| traceContext: { | ||
| traceId: ingressContext.traceId, | ||
| correlationId: ingressContext.correlationId, | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether traceContext is used as actual distributed-tracing state anywhere downstream.
echo "=== trace propagation usage ==="
rg -n "traceContext|traceId|correlationId|traceparent|propagation|context\.active|getSpan|spanContext|startActiveSpan" apps packagesRepository: Ryanakml/ChatFuse
Length of output: 5182
🏁 Script executed:
#!/bin/bash
# Check the context around lines 473-476 in apps/api/src/index.js
echo "=== API code around lines 473-476 ==="
sed -n '420,490p' apps/api/src/index.js
echo ""
echo "=== Check ingressContext creation ==="
sed -n '400,435p' apps/api/src/index.js
echo ""
echo "=== Check imports at top of file ==="
head -40 apps/api/src/index.jsRepository: Ryanakml/ChatFuse
Length of output: 6261
🏁 Script executed:
#!/bin/bash
# Check how worker uses traceContext
echo "=== Worker code using traceContext ==="
sed -n '330,360p' apps/worker/src/index.ts
echo ""
echo "=== Check if OpenTelemetry is used in worker ==="
head -50 apps/worker/src/index.ts | grep -i "otel\|opentelemetry\|trace\|span"Repository: Ryanakml/ChatFuse
Length of output: 1042
🏁 Script executed:
#!/bin/bash
# Look for how ingressContext.traceId is created - check for randomBytes or createTraceId
echo "=== Search for createTraceId function ==="
rg -n "createTraceId|randomBytes" apps/api/src/index.js -A 3
echo ""
echo "=== Check IngressContext interface ==="
rg -n "IngressContext|ingressContext" apps/api/src/index.js | head -20Repository: Ryanakml/ChatFuse
Length of output: 1835
🏁 Script executed:
#!/bin/bash
# Check the attachIngressTraceContext middleware
echo "=== Looking for attachIngressTraceContext middleware ==="
rg -n "attachIngressTraceContext|const attachIngressTraceContext|function attachIngressTraceContext" apps/api/src/index.js -A 10
echo ""
echo "=== Check getIngressTraceContext function ==="
rg -n "getIngressTraceContext|const getIngressTraceContext" apps/api/src/index.js -B 2 -A 8Repository: Ryanakml/ChatFuse
Length of output: 3325
🏁 Script executed:
#!/bin/bash
# Check setupTelemetry in the shared package
echo "=== Check setupTelemetry implementation ==="
rg -n "setupTelemetry|export.*setupTelemetry" packages/shared/src/ -A 5
echo ""
echo "=== Check if there's any OpenTelemetry configuration ==="
rg -n "@opentelemetry|otel|OpenTelemetry" packages/shared/src/ -lRepository: Ryanakml/ChatFuse
Length of output: 2931
🏁 Script executed:
#!/bin/bash
# Check full setupTelemetry implementation
echo "=== Full setupTelemetry implementation ==="
cat packages/shared/src/telemetry.tsRepository: Ryanakml/ChatFuse
Length of output: 4277
🏁 Script executed:
#!/bin/bash
# Check buildIngressTraceContext function
echo "=== buildIngressTraceContext function ==="
rg -n "buildIngressTraceContext|const buildIngressTraceContext" apps/api/src/index.js -B 2 -A 15Repository: Ryanakml/ChatFuse
Length of output: 1995
Propagate the active OpenTelemetry span context, not a synthetic trace ID.
The code generates a new trace ID with randomBytes at line 302 of buildIngressTraceContext, which creates a trace-shaped string unrelated to the actual HTTP request span. Since setupTelemetry enables getNodeAutoInstrumentations(), the webhook POST request already has an active span from automatic HTTP instrumentation. However, this span context is never captured. The worker receives only the synthetic ID and cannot reconstruct or join the distributed trace.
Pass the active span context downstream so the worker can continue the trace. Capture it with trace.getSpan(context.active())?.spanContext() and pass the full span context (not just traceId), ideally as traceparent or equivalent W3C format for proper trace propagation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/api/src/index.js` around lines 473 - 476, The traceContext currently
uses a synthetic traceId from buildIngressTraceContext; instead capture the
active OpenTelemetry span and propagate its full span context downstream instead
of the synthetic ID. In buildIngressTraceContext (and where traceContext is
assigned to ingressContext/traceContext) call
trace.getSpan(context.active())?.spanContext() and construct a W3C traceparent
(or include the full spanContext object) and set that value on
ingressContext.traceContext (or a traceparent field) so the worker receives the
real span information; ensure setupTelemetry/getNodeAutoInstrumentations remains
enabled and use the spanContext rather than randomBytes-generated IDs.
| import { setupTelemetry, logger } from '@wa-chat/shared'; | ||
| // Initialize telemetry before other imports if possible, or right at the top | ||
| setupTelemetry('wa-chat-api'); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the main entry point and understand the project structure
git ls-files apps/api/src/ | head -20Repository: Ryanakml/ChatFuse
Length of output: 588
🏁 Script executed:
# Look at the specific file mentioned in the review
cat -n apps/api/src/index.tsRepository: Ryanakml/ChatFuse
Length of output: 27383
🏁 Script executed:
# Search for dotenv.config() calls to understand when env is loaded
rg "dotenv\.config|config()" apps/api/src/ -A 2 -B 2Repository: Ryanakml/ChatFuse
Length of output: 2432
🏁 Script executed:
# Check if there are other entry points or bootstrap files
fd -e ts -e js apps/api/src/ | grep -E "(main|bootstrap|entry|server)" | head -10Repository: Ryanakml/ChatFuse
Length of output: 418
🏁 Script executed:
# Search for where apps/api/src/index.ts is imported from
rg "from.*['\"].*apps/api/src['\"]|from.*['\"]\.\./" apps/ --type ts --type js | head -20Repository: Ryanakml/ChatFuse
Length of output: 987
Move telemetry initialization behind env loading or into a function.
Module-level setupTelemetry() at line 3 executes before dotenv.config() at line 732. This means any code importing apps/api/src/index.ts starts the SDK before environment variables are loaded—creating test pollution and potential misconfiguration in consuming processes. Move this call into a bootstrap function or after dotenv.config().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/api/src/index.ts` around lines 1 - 3, The module-level call to
setupTelemetry('wa-chat-api') runs before environment variables are loaded; move
this call out of top-level execution and into a bootstrap/initialization routine
that runs after dotenv.config() (or call it after the env-loading code), e.g.,
remove the direct setupTelemetry(...) invocation from the top of
apps/api/src/index.ts and instead invoke setupTelemetry inside your existing
bootstrap function or immediately after dotenv.config() so telemetry
initialization uses the proper env values; reference the setupTelemetry symbol
and the module-level invocation in index.ts when making the change.
| if (classifiedError.errorClass === 'permanent' && currentAttempt >= permanentAttempts) { | ||
| appMetrics.queueDlqCount.add(1, { error_class: 'permanent' }); | ||
| if (classifiedError.error instanceof UnrecoverableError) { | ||
| throw classifiedError.error; | ||
| } | ||
|
|
||
| throw new UnrecoverableError(classifiedError.error.message); | ||
| } | ||
|
|
||
| if (classifiedError.errorClass === 'transient') { | ||
| if (currentAttempt >= policies.retry.transientMaxAttempts) { | ||
| appMetrics.queueDlqCount.add(1, { error_class: 'transient' }); | ||
| } else { | ||
| appMetrics.queueRetryCount.add(1); | ||
| } | ||
| } |
There was a problem hiding this comment.
Emit retry/DLQ metrics only after the final outcome is known.
runIngressJob() increments these counters before routeFailedIngressJobToDlq() knows whether the job will actually retry and before the DLQ enqueue succeeds. That makes queueDlqCount over-report failed DLQ routes, and it also misses permanent retries entirely when permanentMaxAttempts > 1.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/worker/src/queue/consumer.ts` around lines 136 - 151, The code is
incrementing queueDlqCount/queueRetryCount too early in runIngressJob() before
routeFailedIngressJobToDlq() determines final outcome, and it also misses
counting permanent retries when permanentMaxAttempts > 1; remove the early
metric increments from the failure-handling branch that checks
classifiedError.errorClass and currentAttempt, and instead increment
appMetrics.queueDlqCount.add(...) or appMetrics.queueRetryCount.add(...) inside
routeFailedIngressJobToDlq() after the DLQ enqueue result is known (or in the
final retry decision path), ensuring you use classifiedError.errorClass and
currentAttempt vs. policies.retry.transientMaxAttempts / permanentAttempts to
decide DLQ vs retry and still preserve throwing UnrecoverableError for permanent
errors (as in the UnrecoverableError handling).
docker/prometheus.yml
Outdated
| global: | ||
| scrape_interval: 1s # Set small scrape interval since we're simulating fast | ||
|
|
||
| rule_files: | ||
| - /etc/prometheus/alert.rules.yml |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Prometheus, if global.evaluation_interval is omitted, what default is used, and does changing scrape_interval also change alert/rule evaluation frequency?
💡 Result:
- Default when
global.evaluation_intervalis omitted:1m(1 minute). (prometheus.io) - Does changing
scrape_intervalchange alert/rule evaluation frequency? No. Rule/alert evaluation runs onglobal.evaluation_interval(or a rule-group’s ownintervaloverride). Changingscrape_intervalonly changes how often targets are scraped, not how often rules are evaluated. (prometheus.io)
Citations:
- 1: https://prometheus.io/docs/prometheus/3.4/configuration/configuration/?utm_source=openai
- 2: https://prometheus.io/docs/prometheus/3.4/configuration/configuration/?utm_source=openai
Add evaluation_interval to match the fast scrape cadence.
With scrape_interval: 1s but no evaluation_interval configured, Prometheus will scrape metrics every second but evaluate alert rules every 60 seconds (the default). This mismatch means incident alerts can fire much later than expected, defeating the purpose of fast scraping in simulation.
Suggested fix
global:
scrape_interval: 1s # Set small scrape interval since we're simulating fast
+ evaluation_interval: 1s📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| global: | |
| scrape_interval: 1s # Set small scrape interval since we're simulating fast | |
| rule_files: | |
| - /etc/prometheus/alert.rules.yml | |
| global: | |
| scrape_interval: 1s # Set small scrape interval since we're simulating fast | |
| evaluation_interval: 1s | |
| rule_files: | |
| - /etc/prometheus/alert.rules.yml |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/prometheus.yml` around lines 1 - 5, Add an evaluation_interval under
the global Prometheus config to match the fast scrape cadence: in the global
block (where scrape_interval: 1s is set) add evaluation_interval: 1s so alert
rules (rule_files /etc/prometheus/alert.rules.yml) are evaluated every second
instead of the 60s default.
| traceContext?: { | ||
| traceId: string; | ||
| correlationId: string; | ||
| }; |
There was a problem hiding this comment.
Validate traceContext at runtime as well.
IngressJobPayload now exposes traceContext as two required strings, but Line 188 forwards any truthy object and isIngressJobPayload() still ignores the field entirely. That means malformed payloads like { traceContext: { traceId: 123 } } can be created and then pass assertIngressJobPayload(), which defeats the correlation guarantees this PR is adding.
Suggested fix
+const isTraceContext = (
+ value: unknown,
+): value is { traceId: string; correlationId: string } =>
+ isRecord(value) &&
+ isNonEmptyString(value.traceId) &&
+ isNonEmptyString(value.correlationId);
+
export const isIngressJobPayload = (value: unknown): value is IngressJobPayload => {
if (!isRecord(value)) {
return false;
}
@@
if (typeof value.receivedAt !== 'string' || Number.isNaN(Date.parse(value.receivedAt))) {
return false;
}
+
+ if (!(value.traceContext === undefined || isTraceContext(value.traceContext))) {
+ return false;
+ }
return isJsonValue(value.payload);
};
@@
export const createIngressJobPayload = (input: {
eventKey: string;
payload: JsonValue;
receivedAt?: string;
traceContext?: { traceId: string; correlationId: string };
}): IngressJobPayload => {
@@
if (Number.isNaN(Date.parse(receivedAt))) {
throw new Error('Ingress job receivedAt must be a valid ISO timestamp');
}
+
+ if (input.traceContext && !isTraceContext(input.traceContext)) {
+ throw new Error(
+ 'Ingress job traceContext must contain non-empty traceId and correlationId',
+ );
+ }
return {
schemaVersion: INGRESS_JOB_SCHEMA_VERSION,
eventKey,
payload: input.payload,Also applies to: 167-189
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/shared/src/index.ts` around lines 22 - 25, isIngressJobPayload
currently ignores traceContext and assertIngressJobPayload forwards any truthy
object, allowing malformed traceContext like { traceId: 123 }; update
isIngressJobPayload to validate that if payload.traceContext exists it is an
object with both traceId and correlationId as strings, and then update
assertIngressJobPayload to call this stricter check (rejecting payloads where
traceContext is missing keys or types), ensuring any place that forwards the
payload (the code path that passes traceContext along) only accepts payloads
that pass the new validation.
| processRef?.on?.('SIGTERM', () => { | ||
| sdk | ||
| .shutdown() | ||
| .then(() => logger.info('Telemetry shut down successfully')) | ||
| .catch((error) => logger.error({ error }, 'Error shutting down telemetry')) | ||
| .finally(() => { | ||
| processRef?.exit?.(0); | ||
| }); |
There was a problem hiding this comment.
Don't own process exit from the shared telemetry helper.
This SIGTERM handler always calls process.exit(0), even when sdk.shutdown() fails. In the worker, that races the existing shutdown hook and can cut off worker.close()/queue.close() while still reporting a clean exit.
✂️ Suggested direction
- processRef?.on?.('SIGTERM', () => {
- sdk
- .shutdown()
- .then(() => logger.info('Telemetry shut down successfully'))
- .catch((error) => logger.error({ error }, 'Error shutting down telemetry'))
- .finally(() => {
- processRef?.exit?.(0);
- });
- });
-
return sdk;Let the API/worker entrypoints call sdk.shutdown() from their own signal handlers after their service-specific cleanup completes.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| processRef?.on?.('SIGTERM', () => { | |
| sdk | |
| .shutdown() | |
| .then(() => logger.info('Telemetry shut down successfully')) | |
| .catch((error) => logger.error({ error }, 'Error shutting down telemetry')) | |
| .finally(() => { | |
| processRef?.exit?.(0); | |
| }); | |
| return sdk; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/shared/src/telemetry.ts` around lines 60 - 67, The SIGTERM handler
in the shared telemetry helper (the processRef?.on?('SIGTERM', ...) block that
calls sdk.shutdown()) must not call processRef?.exit?.(0) or otherwise force
process exit; remove the forced exit and avoid owning process lifecycle here.
Modify the handler so it only calls sdk.shutdown() (and logs success or error
via logger) but does not exit the process or swallow failures—allow the
API/worker entrypoints to perform their own shutdown sequencing (e.g.,
worker.close()/queue.close()) and then call sdk.shutdown() or exit as
appropriate.
| simulate(); | ||
|
|
||
| process.on('SIGINT', () => { | ||
| console.log('\nSutdown requested, stopping metrics server...'); |
There was a problem hiding this comment.
Typo: "Sutdown" should be "Shutdown".
🔧 Fix typo
process.on('SIGINT', () => {
- console.log('\nSutdown requested, stopping metrics server...');
+ console.log('\nShutdown requested, stopping metrics server...');
server.close(() => process.exit(0));
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| console.log('\nSutdown requested, stopping metrics server...'); | |
| process.on('SIGINT', () => { | |
| console.log('\nShutdown requested, stopping metrics server...'); | |
| server.close(() => process.exit(0)); | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/simulate-alerts.ts` at line 91, Fix the typo in the shutdown log
message: replace the misspelled string "Sutdown requested, stopping metrics
server..." used in the console.log call with "Shutdown requested, stopping
metrics server..." so the shutdown handler / log output (the console.log
invocation in scripts/simulate-alerts.ts) prints the correct word.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.github/workflows/check-prometheus-rules.yml (1)
2-2:⚠️ Potential issue | 🔴 CriticalRemove the unsupported workflow-level
descriptionkey.Line 2 is still invalid GitHub Actions syntax, so the workflow will fail validation before this job can run. This was already flagged on an earlier revision and still applies here.
🔧 Proposed fix
name: K3 Alerting Rules Check -description: Validates Prometheus alert rules +# Validates Prometheus alert rules on: [push, pull_request]#!/bin/bash python - <<'PY' import sys from pathlib import Path try: import yaml except ImportError: import subprocess subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "PyYAML"]) import yaml path = Path(".github/workflows/check-prometheus-rules.yml") data = yaml.safe_load(path.read_text()) allowed = {"name", "on", "jobs", "env", "permissions", "defaults", "concurrency", "run-name"} keys = list(data.keys()) invalid = [k for k in keys if k not in allowed] print("top-level keys:", keys) print("invalid keys:", invalid) PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/check-prometheus-rules.yml at line 2, Remove the unsupported top-level workflow key "description" from the GitHub Actions YAML (the workflow-level "description" present in check-prometheus-rules.yml); locate the "description:" entry at the top of the file and delete it (or move its text into the "name:" field or into a job-level step/comment) so the workflow only contains valid top-level keys like name, on, jobs, env, permissions, defaults, concurrency, and run-name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/check-prometheus-rules.yml:
- Around line 14-16: Download the published sha256sums file for Prometheus
alongside the prometheus-2.51.1.linux-amd64.tar.gz and verify the archive before
extracting and moving promtool; specifically, fetch sha256sums.txt from the same
release, run a checksum verification (e.g., use sha256sum -c or grep the
prometheus-2.51.1.linux-amd64.tar.gz line and pipe to sha256sum -c) and abort
the job if verification fails, then only tar xvf
prometheus-2.51.1.linux-amd64.tar.gz and sudo mv
prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/ when the checksum check
succeeds.
---
Duplicate comments:
In @.github/workflows/check-prometheus-rules.yml:
- Line 2: Remove the unsupported top-level workflow key "description" from the
GitHub Actions YAML (the workflow-level "description" present in
check-prometheus-rules.yml); locate the "description:" entry at the top of the
file and delete it (or move its text into the "name:" field or into a job-level
step/comment) so the workflow only contains valid top-level keys like name, on,
jobs, env, permissions, defaults, concurrency, and run-name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a3d953e7-b361-48d1-8674-e952374eacf5
📒 Files selected for processing (1)
.github/workflows/check-prometheus-rules.yml
| wget https://github.com/prometheus/prometheus/releases/download/v2.51.1/prometheus-2.51.1.linux-amd64.tar.gz | ||
| tar xvf prometheus-2.51.1.linux-amd64.tar.gz | ||
| sudo mv prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import json
import urllib.request
url = "https://api.github.com/repos/prometheus/prometheus/releases/tags/v2.51.1"
with urllib.request.urlopen(url) as r:
data = json.load(r)
assets = sorted(asset["name"] for asset in data.get("assets", []))
print("\n".join(assets))
print("\nsha256sums.txt present:", "sha256sums.txt" in assets)
PYRepository: Ryanakml/ChatFuse
Length of output: 2642
🏁 Script executed:
# Check if the workflow file exists and examine current state
git ls-files .github/workflows/ | head -20
cat .github/workflows/check-prometheus-rules.yml 2>/dev/null || echo "File not found"Repository: Ryanakml/ChatFuse
Length of output: 858
🌐 Web query:
Prometheus v2.51.1 release assets sha256sums GitHub
💡 Result:
For the Prometheus v2.51.1 GitHub release (dated 2024-03-27), the SHA256 checksums are published as release assets alongside the binaries. [1]
Release assets (direct download paths on GitHub Releases):
https://github.com/prometheus/prometheus/releases/download/v2.51.1/sha256sums.txt
https://github.com/prometheus/prometheus/releases/download/v2.51.1/sha256sums.txt.asc
The binaries themselves are under the same releases/download/v2.51.1/ path (e.g., prometheus-2.51.1.linux-amd64.tar.gz). [1][2]
Sources: [1] [2]
Add integrity verification for Prometheus archive before installation.
The workflow downloads an executable from the network and moves it to PATH without any integrity check. The Prometheus v2.51.1 release includes sha256sums.txt as a published asset—use it to verify the archive before extraction.
🔒 Proposed hardening
- name: Install Promtool
run: |
- wget https://github.com/prometheus/prometheus/releases/download/v2.51.1/prometheus-2.51.1.linux-amd64.tar.gz
- tar xvf prometheus-2.51.1.linux-amd64.tar.gz
- sudo mv prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/
+ set -euo pipefail
+ VERSION=2.51.1
+ ARCHIVE="prometheus-${VERSION}.linux-amd64.tar.gz"
+ DIR="prometheus-${VERSION}.linux-amd64"
+ wget "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/${ARCHIVE}"
+ wget "https://github.com/prometheus/prometheus/releases/download/v${VERSION}/sha256sums.txt"
+ grep " ${ARCHIVE}\$" sha256sums.txt | sha256sum -c -
+ tar xvf "${ARCHIVE}"
+ sudo mv "${DIR}/promtool" /usr/local/bin/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/check-prometheus-rules.yml around lines 14 - 16, Download
the published sha256sums file for Prometheus alongside the
prometheus-2.51.1.linux-amd64.tar.gz and verify the archive before extracting
and moving promtool; specifically, fetch sha256sums.txt from the same release,
run a checksum verification (e.g., use sha256sum -c or grep the
prometheus-2.51.1.linux-amd64.tar.gz line and pipe to sha256sum -c) and abort
the job if verification fails, then only tar xvf
prometheus-2.51.1.linux-amd64.tar.gz and sudo mv
prometheus-2.51.1.linux-amd64/promtool /usr/local/bin/ when the checksum check
succeeds.
Summary by CodeRabbit
New Features
Infrastructure
Alerts
Tests
Chores