Skip to content

Add CloudWatch alarm for ApprovalMetricsPublisher EMF rate ceiling #114

@scoropeza

Description

@scoropeza

Follow-up from PR #88 — observability blind spot in the metrics publisher path.

Functional description

The ApprovalMetricsPublisher Lambda emits CloudWatch metrics via the EMF (Embedded Metric Format) pattern: instead of calling PutMetricData directly, it logs structured JSON that CloudWatch Logs auto-extracts into metrics. EMF has its own per-account throttle ceiling — 100 EMF metric writes per second per account. Cross that ceiling and CloudWatch silently drops metrics with no error visible to the Lambda.

The publisher already self-rate-limits and emits an internal MetricEmitSkipped count when it self-limits, but there's no alarm on MetricEmitSkipped > 0. So if approvals scale to where the publisher hits the ceiling, the dashboard quietly underreports without anyone noticing.

This pairs with issue #4 (DLQ alarms): both are about "the metrics path is broken but the dashboard still shows old data." Filing as separate issues since the threshold-tuning conversation will be different (DLQ count = 1 is the right alarm; EMF skipped count needs more thought — burst spikes can be normal).

User-visible impact:

  • Approval-volume metrics under-report at high load. Operators see lower ApprovalRequestCount than reality.
  • No signal to "scale up the Lambda's batch size" or "adjust EMF emission cadence."
  • Capacity-planning conversations rely on CloudWatch numbers that are silently lower than actual.

Technical context

Where the metric is already emitted:

  • cdk/src/handlers/approval-metrics-publisher.ts — search for MetricEmitSkipped (or similar; the exact name needs verification). The publisher's self-rate-limit logic increments this when it skips an EMF write to stay under the ceiling.

What's missing:

  • cloudwatch.Alarm on MetricEmitSkipped > <threshold> over a sensible window.
  • Threshold tuning: a low threshold (1 over 5 min) will fire on legitimate burst spikes; a high threshold (>100 over 1 hour) misses sustained problems. Recommend starting at "10 over a 15-minute window" as a conservative baseline; tune after observing real traffic.

Why EMF instead of PutMetricData:

  • EMF batches 100x cheaper than PutMetricData for the same observability.
  • The trade-off is the global per-account ceiling. PutMetricData has its own throttle but it's per-region per-account at a much higher number.
  • The publisher's choice of EMF is the right one; this issue isn't asking to change that.

Proposed fix

Add the alarm in cdk/src/constructs/approval-metrics-publisher-consumer.ts:

new cloudwatch.Alarm(this, 'MetricEmitSkippedAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'ABCA/Cedar-HITL',
    metricName: 'MetricEmitSkipped',
    period: Duration.minutes(15),
    statistic: 'Sum',
  }),
  threshold: 10,
  evaluationPeriods: 1,
  comparisonOperator: ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
  alarmDescription: 'ApprovalMetricsPublisher hit the EMF rate ceiling and dropped metrics; consider increasing batch size or reducing per-event metric volume',
  treatMissingData: TreatMissingData.NOT_BREACHING,
});

Acceptance criteria

  • Alarm exists on MetricEmitSkipped (or whatever the existing skipped-count metric is named — verify the name in approval-metrics-publisher.ts first)
  • Threshold is documented in the alarm description with rationale
  • Construct test verifies alarm presence
  • If a runbook exists for "publisher hit EMF ceiling," alarmDescription links to it

Out of scope

  • Refactoring the publisher to use PutMetricData (different cost/perf profile, separate decision).
  • Adaptive batch-size tuning (publisher could observe its own skip rate and adjust; that's a separate enhancement).
  • Multi-region failover for metric publishing (out of scope for ABCA today).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions