Skip to content

feat: add distributed-coordination skills from Part 4 blog (auth isolation, stale locks, message serialization, notification routing)#452

Closed
tamirdresher wants to merge 1 commit intobradygaster:mainfrom
tamirdresher:tamirdresher/feat/distributed-coordination-skills
Closed

feat: add distributed-coordination skills from Part 4 blog (auth isolation, stale locks, message serialization, notification routing)#452
tamirdresher wants to merge 1 commit intobradygaster:mainfrom
tamirdresher:tamirdresher/feat/distributed-coordination-skills

Conversation

@tamirdresher
Copy link
Copy Markdown
Collaborator

Summary

Four skills earned from running 8 Ralph instances in production, documented in blog post ''When Eight Ralphs Fight Over One Login''.

Each skill maps 1:1 to a textbook distributed systems pattern.

Skills Added

\gh-auth-isolation\

Problem: Multiple Ralphs on the same machine fight over ~/.config/gh/hosts.yml. One Ralph calls \gh auth switch\ and clobbers every other Ralph's auth. 37 consecutive failures in 3 hours.
Fix: Per-process \GH_CONFIG_DIR\ — each process gets its own isolated gh config.
DS pattern: State partitioning

\stale-lock-detection\

Problem: A crash leaves a lock file with a dead PID. The next Ralph start refuses to run, guarding against a process that no longer exists.
Fix: Three-layer guard: named OS mutex (auto-released on crash) + PID validation + lockfile with registered cleanup.
DS pattern: Lease-based locking with failure detection (same as ZooKeeper ephemeral nodes)

\message-serialization\

Problem: A 7KB multiline prompt passed via \Start-Process --ArgumentList\ gets interpreted by Windows as the command name. Result: ''command not found: Ralph, Go! MAXIMIZE PARALLELISM...''
Fix: Write prompt to temp file, pass file path as argument.
DS pattern: Message indirection (same as gRPC + protobuf)

\

otification-routing
Problem: 20+ notifications/day flooding one Teams channel. Failure alerts buried in tech news. Everyone stops reading.
Fix: \ eams-channels.json\ routing config + channel-tag convention. Each notification type goes to the right channel.
DS pattern: Pub-sub topic routing (same as Kafka topics / RabbitMQ routing keys)

What's Included

  • .squad/skills/gh-auth-isolation/SKILL.md\
  • .squad/skills/stale-lock-detection/SKILL.md\
  • .squad/skills/message-serialization/SKILL.md\
  • .squad/skills/notification-routing/SKILL.md\
  • \docs/src/content/docs/features/distributed-coordination.md\ — feature guide
  • \docs/src/content/blog/029-distributed-coordination-skills.md\ — blog post
  • \ est/skills.test.ts\ — 24 new tests (structure + content validation for each skill)
  • \ est/docs-build.test.ts\ — \distributed-coordination\ added to \EXPECTED_FEATURES\

Tests

\
✓ test/skills.test.ts (51 tests) — all pass
\\

Relates to #858 (tamirdresher/tamresearch1)

Four skills earned from running 8 Ralph instances in production:

- gh-auth-isolation: per-process GH_CONFIG_DIR to prevent multi-Ralph
  auth races (the 37-consecutive-failure incident)
- stale-lock-detection: three-layer guard (OS mutex + PID scan + lockfile)
  to detect and clear locks from crashed agents
- message-serialization: temp-file indirection for large CLI prompts that
  break shell argument parsing (7KB prompt-as-command-name bug)
- notification-routing: pub-sub topic routing config to prevent notification
  firehose when running many concurrent agents

Each skill maps to a textbook distributed systems pattern:
state partitioning, lease-based locking, message indirection, pub-sub routing.

Includes:
- 4 SKILL.md files in .squad/skills/
- Feature doc: docs/features/distributed-coordination.md
- Blog post: docs/blog/029-distributed-coordination-skills.md
- 24 new tests validating each skill's structure and content

Refs: tamirdresher/tamresearch1#858
Blog: https://tamirdresher.github.io/blog/2026/03/17/scaling-ai-part4-distributed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tamirdresher
Copy link
Copy Markdown
Collaborator Author

Hey @bradygaster — this PR contains four skills from my Part 4 blog post (''When Eight Ralphs Fight Over One Login'').

All four are earned-in-production patterns from running 8 concurrent Ralphs on the same machine. Each maps to a classic distributed systems pattern that others will hit when scaling Squad beyond a single agent loop:

Skill Problem Solved Pattern
\gh-auth-isolation\ Multi-Ralph auth race (37 consecutive failures) State partitioning
\stale-lock-detection\ Stale PID lockfile blocking restart after crash Lease-based locking
\message-serialization\ 7KB prompt treated as command name by Windows shell Message indirection
\
otification-routing\ Notification firehose — all alerts to one channel Pub-sub topic routing

All 51 tests pass. Feature doc + blog post included.

@bradygaster
Copy link
Copy Markdown
Owner

Closing — distributed coordination is shelved (see #332). Will reopen when unblocked.

chrislomonico pushed a commit to clomonico/squad that referenced this pull request Mar 26, 2026
…er#449, bradygaster#452) (bradygaster#467)

- System messages: change prefix from '▸ system:' to '[system]' bracket convention
- Welcome hint: prefer lead/coordinator/architect agent over first agent

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants