Follow-up from PR #88 — observability blind spot in the metrics publisher path.
Functional description
The ApprovalMetricsPublisher Lambda emits CloudWatch metrics via the EMF (Embedded Metric Format) pattern: instead of calling PutMetricData directly, it logs structured JSON that CloudWatch Logs auto-extracts into metrics. EMF has its own per-account throttle ceiling — 100 EMF metric writes per second per account. Cross that ceiling and CloudWatch silently drops metrics with no error visible to the Lambda.
The publisher already self-rate-limits and emits an internal MetricEmitSkipped count when it self-limits, but there's no alarm on MetricEmitSkipped > 0. So if approvals scale to where the publisher hits the ceiling, the dashboard quietly underreports without anyone noticing.
This pairs with issue #4 (DLQ alarms): both are about "the metrics path is broken but the dashboard still shows old data." Filing as separate issues since the threshold-tuning conversation will be different (DLQ count = 1 is the right alarm; EMF skipped count needs more thought — burst spikes can be normal).
User-visible impact:
- Approval-volume metrics under-report at high load. Operators see lower
ApprovalRequestCount than reality.
- No signal to "scale up the Lambda's batch size" or "adjust EMF emission cadence."
- Capacity-planning conversations rely on CloudWatch numbers that are silently lower than actual.
Technical context
Where the metric is already emitted:
cdk/src/handlers/approval-metrics-publisher.ts — search for MetricEmitSkipped (or similar; the exact name needs verification). The publisher's self-rate-limit logic increments this when it skips an EMF write to stay under the ceiling.
What's missing:
cloudwatch.Alarm on MetricEmitSkipped > <threshold> over a sensible window.
- Threshold tuning: a low threshold (1 over 5 min) will fire on legitimate burst spikes; a high threshold (>100 over 1 hour) misses sustained problems. Recommend starting at "10 over a 15-minute window" as a conservative baseline; tune after observing real traffic.
Why EMF instead of PutMetricData:
- EMF batches 100x cheaper than PutMetricData for the same observability.
- The trade-off is the global per-account ceiling. PutMetricData has its own throttle but it's per-region per-account at a much higher number.
- The publisher's choice of EMF is the right one; this issue isn't asking to change that.
Proposed fix
Add the alarm in cdk/src/constructs/approval-metrics-publisher-consumer.ts:
new cloudwatch.Alarm(this, 'MetricEmitSkippedAlarm', {
metric: new cloudwatch.Metric({
namespace: 'ABCA/Cedar-HITL',
metricName: 'MetricEmitSkipped',
period: Duration.minutes(15),
statistic: 'Sum',
}),
threshold: 10,
evaluationPeriods: 1,
comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
alarmDescription: 'ApprovalMetricsPublisher hit the EMF rate ceiling and dropped metrics; consider increasing batch size or reducing per-event metric volume',
treatMissingData: TreatMissingData.NOT_BREACHING,
});
Acceptance criteria
Out of scope
- Refactoring the publisher to use PutMetricData (different cost/perf profile, separate decision).
- Adaptive batch-size tuning (publisher could observe its own skip rate and adjust; that's a separate enhancement).
- Multi-region failover for metric publishing (out of scope for ABCA today).
References
Functional description
The
ApprovalMetricsPublisherLambda emits CloudWatch metrics via the EMF (Embedded Metric Format) pattern: instead of callingPutMetricDatadirectly, it logs structured JSON that CloudWatch Logs auto-extracts into metrics. EMF has its own per-account throttle ceiling — 100 EMF metric writes per second per account. Cross that ceiling and CloudWatch silently drops metrics with no error visible to the Lambda.The publisher already self-rate-limits and emits an internal
MetricEmitSkippedcount when it self-limits, but there's no alarm onMetricEmitSkipped > 0. So if approvals scale to where the publisher hits the ceiling, the dashboard quietly underreports without anyone noticing.This pairs with issue #4 (DLQ alarms): both are about "the metrics path is broken but the dashboard still shows old data." Filing as separate issues since the threshold-tuning conversation will be different (DLQ count = 1 is the right alarm; EMF skipped count needs more thought — burst spikes can be normal).
User-visible impact:
ApprovalRequestCountthan reality.Technical context
Where the metric is already emitted:
cdk/src/handlers/approval-metrics-publisher.ts— search forMetricEmitSkipped(or similar; the exact name needs verification). The publisher's self-rate-limit logic increments this when it skips an EMF write to stay under the ceiling.What's missing:
cloudwatch.AlarmonMetricEmitSkipped > <threshold>over a sensible window.Why EMF instead of PutMetricData:
Proposed fix
Add the alarm in
cdk/src/constructs/approval-metrics-publisher-consumer.ts:Acceptance criteria
MetricEmitSkipped(or whatever the existing skipped-count metric is named — verify the name inapproval-metrics-publisher.tsfirst)alarmDescriptionlinks to itOut of scope
References
cdk/src/handlers/approval-metrics-publisher.tscdk/src/constructs/approval-metrics-publisher-consumer.ts