Skip to content

feat(telemetry): Add squad.decisions.* OTel metrics #29

@diberry

Description

@diberry

Problem

Zero squad.decisions.* metrics exist in the codebase. The decisions subsystem is completely dark — no gauges, no counters, no span attributes. You cannot detect when archival stops working or measure the token cost impact of bloated decisions.

From

Telemetry reviews of #20 and #21. Telemetry: 'The first PR should not be the archival fix — it should be the metrics.'

Proposed Metrics

Gauges (current state)

  • squad.decisions.size_bytes — decisions.md file size
  • squad.decisions.entry_count — number of decision entries
  • squad.decisions.age_oldest_days — age of oldest active entry
  • squad.decisions.inbox_depth — unmerged inbox files
  • squad.decisions.archive_size_bytes — archive file size

Counters (operations)

  • squad.decisions.archive_runs — Scribe archival executions
  • squad.decisions.entries_archived — entries moved per run
  • squad.decisions.bytes_archived — bytes recovered per run

Span Attributes (per spawn)

  • agent.decisions_size_bytes on every agent spawn span
  • context_utilization_pct — context window usage

Collection Points

  • Coordinator session start (baseline)
  • Every agent spawn (span attribute)
  • Scribe run (pre/post archival)

Alerting Thresholds

  • size_bytes >20KB: warn | >50KB: error
  • inbox_depth >10: warn | >25: error
  • archive_runs stale + size_bytes rising: error

Owner

Telemetry (Aspire & Observability)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgo:needs-researchNeeds investigationsquadSquad triage inbox — Lead will assign to a membersquad:fidoAssigned to FIDO (Quality Owner)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions