[FEATURE] Performance optimization for deep dependency trees

**Is your feature request related to a problem? Please describe.**

As APM adoption grows and packages develop deeper transitive dependency trees, several algorithmic bottlenecks in the install and compile paths become significant:

| Bottleneck | Location | Current Complexity | Impact |
|---|---|---|---|
| Primitive conflict detection | `primitives/models.py` `_add_with_conflict_detection()` | **O(m²)** — linear list scan per add | Scales quadratically with primitive count |
| Cycle detection path tracking | `deps/apm_resolver.py` `detect_circular_dependencies()` | **O(V×D)** — `in` operator on `List[str]` | Degrades with tree depth |
| `get_nodes_at_depth()` during flatten | `deps/dependency_graph.py` | **O(V × max_depth)** — full node scan per depth level | Repeated scans across 50 depth levels |
| `from_apm_yml()` YAML parsing | `models/apm_package.py` | **25 call sites, 0 caching** — same files parsed 20-50× per operation | Unnecessary disk I/O and CPU |
| `get_config()` reads | `config.py` | **5-10 disk reads per operation**, no caching | Redundant file I/O |
| `read_constitution()` | `compilation/constitution.py` | **Read 3× per compile** from `claude_formatter.py` and `injector.py` | Redundant file I/O |
| Inheritance chain computation | `compilation/context_optimizer.py` `_get_inheritance_chain()` | **Recomputed every call**, no caching | Redundant O(D) walks per instruction |
| Package downloads | `cli.py` install loop | **Fully sequential** — O(N×T) wall-clock | 10 pkgs × 5s = 50s vs ~10s parallel |
| GitHub API resilience | `deps/github_downloader.py` | **No 429/rate-limit handling** | Fails immediately under throttling |

**What works well today:** BFS tree building uses `deque` + `Set` for O(1) dedup (`queued_keys`), glob pattern results are cached in `ContextOptimizer`, flattening uses `Dict`-based `FlatDependencyMap` with O(1) lookups, and output assembly uses `list.append()` + `str.join()` (not string concatenation). These should be preserved.

**Describe the solution you'd like**

A phased performance improvement plan with no breaking changes. All optimizations are internal — same inputs produce identical outputs. Lock file format v1 preserved.

### Phase 1 — Performance Baseline (Benchmarks & Fixtures)
_Entry gate — must complete before optimization work begins._

- [ ] Add `pytest-benchmark` test suite in `tests/benchmarks/` covering: dependency resolution (50+ packages, 5 depth levels), compilation (200+ primitives, 100+ directories), primitive conflict detection (500+ primitives), `from_apm_yml()` parse throughput, inheritance chain computation
- [ ] Add deep dependency tree synthetic fixtures in `tests/fixtures/`: 100 packages, 10 depth levels, diamond dependencies, large primitive counts
- [ ] Record baseline measurements in `tests/benchmarks/BASELINE.md`

### Phase 2 — Data Structure Fixes (High Impact, Low Risk)
_Can proceed in parallel with Phase 3._

- [ ] **Conflict detection O(m²) → O(m):** Add `Dict[str, int]` name→index alongside each primitive list in `PrimitiveCollection`. Check dict (O(1)) instead of scanning `for i, existing in enumerate(collection)`. File: `src/apm_cli/primitives/models.py`
- [ ] **Cycle detection O(V×D) → O(V+E):** Add companion `current_path_set: Set[str]` in `detect_circular_dependencies()`. Use set for O(1) `in` checks, keep list only for cycle-path reporting. File: `src/apm_cli/deps/apm_resolver.py`
- [ ] **Depth-index for flatten:** Pre-compute `Dict[int, List[DependencyNode]]` in `DependencyTree` on node insertion. `get_nodes_at_depth()` becomes O(1). File: `src/apm_cli/deps/dependency_graph.py`

### Phase 3 — Caching Layer (High Impact, Medium Risk)
_Can proceed in parallel with Phase 2._

- [ ] **`from_apm_yml()` parse cache:** Module-level `Dict[Path, APMPackage]` keyed by resolved absolute path. No invalidation needed within a CLI invocation. Add `clear_apm_yml_cache()` for test isolation. File: `src/apm_cli/models/apm_package.py`
- [ ] **`get_config()` cache:** Read-once-per-process module-level cache. File: `src/apm_cli/config.py`
- [ ] **`read_constitution()` cache:** Cache after first read. File: `src/apm_cli/compilation/constitution.py`
- [ ] **Inheritance chain cache:** `Dict[Path, List[Path]]` in `ContextOptimizer`, populated on first call per directory. File: `src/apm_cli/compilation/context_optimizer.py`

### Phase 4 — Network Parallelization (Highest Wall-Clock Impact, Higher Risk)
_Depends on Phases 2-3 stability._

- [ ] **Parallel package downloads:** `concurrent.futures.ThreadPoolExecutor` (default 4 workers) for the flat install list after dependency resolution. Thread-safe Rich progress bars, collect-all-errors pattern, cap at 4-6 concurrent clones. Files: `src/apm_cli/cli.py`, `src/apm_cli/deps/github_downloader.py`
- [ ] **Git sparse-checkout for subdirectory packages:** Use `git sparse-checkout` (git 2.25+) with full-clone fallback for virtual subdirectory packages. File: `src/apm_cli/deps/github_downloader.py`
- [ ] _(Stretch)_ Pre-fetch next BFS depth level in parallel during transitive resolution

### Phase 5 — Resilience & Robustness (Medium Impact, Low Risk)
_Independent of other phases._

- [ ] **Rate limit handling:** Inspect `X-RateLimit-Remaining` and `Retry-After` headers. Exponential backoff with max 3 retries for 429/503. File: `src/apm_cli/deps/github_downloader.py`
- [ ] **Skip-if-exists optimization:** When lockfile has a commit SHA and local clone `HEAD` matches, skip re-download. Files: `src/apm_cli/deps/github_downloader.py`, `src/apm_cli/cli.py`

**Describe alternatives you've considered**

- **Async/await rewrite:** Full migration to `asyncio` for network I/O. Rejected as too invasive for the current stage — `ThreadPoolExecutor` achieves most of the parallel download benefit with far less refactoring.
- **Database-backed caching (SQLite):** Persistent cross-invocation cache for parsed manifests. Rejected as over-engineering — CLI invocations are short-lived, and simple in-memory caches per process are sufficient.
- **CDN-based package hosting:** Pre-built package tarballs on a CDN instead of git clones. Deferred — requires infrastructure outside APM's scope and git-based resolution works well for the current package scale.
- **npm-style global cache directory:** A `~/.apm/cache/` shared across all projects. Considered for Phase 5 skip-if-exists but deferred — lockfile-based local dedup handles the common case.

**Additional context**

**Scope boundaries:**
- All changes are internal — no CLI flag changes (except potentially `--parallel-downloads N` in Phase 4), no API changes, no lock file format changes.
- Backward compatibility is a hard requirement: identical inputs must produce byte-identical outputs.
- Phase 1 benchmarks serve as the regression gate for all subsequent phases.

**Dependency graph for phases:**
```
Phase 1 (baseline) ──┬──> Phase 2 (data structures) ──┬──> Phase 4 (parallelization)
                      └──> Phase 3 (caching)     ──────┘
                                                        Phase 5 (resilience) [independent]
```

**Key files involved:**

| File | Concern |
|---|---|
| `src/apm_cli/primitives/models.py` | O(m²) conflict detection |
| `src/apm_cli/deps/apm_resolver.py` | Cycle detection, BFS builder, flatten |
| `src/apm_cli/deps/dependency_graph.py` | `get_nodes_at_depth()`, tree structures |
| `src/apm_cli/models/apm_package.py` | Uncached `from_apm_yml()` (25 call sites) |
| `src/apm_cli/config.py` | Uncached `get_config()` |
| `src/apm_cli/compilation/constitution.py` | Uncached `read_constitution()` |
| `src/apm_cli/compilation/context_optimizer.py` | Uncached inheritance chains |
| `src/apm_cli/cli.py` | Sequential install loop |
| `src/apm_cli/deps/github_downloader.py` | No parallelism, no rate limits |

**Design constraint — cache invalidation:** The "no invalidation within a process" strategy is intentional. APM is a short-lived CLI tool; files don't change mid-operation. If APM ever becomes a long-running service, this must be revisited.

**Thread safety note for Phase 4:** Rich progress bars and GitPython may need thread-safe wrappers. A spike/prototype is recommended before committing to `ThreadPoolExecutor`. Alternative approach: `asyncio` + `subprocess` for git operations.

Bottleneck	Location	Current Complexity	Impact
Primitive conflict detection	`primitives/models.py` `_add_with_conflict_detection()`	O(m²) — linear list scan per add	Scales quadratically with primitive count
Cycle detection path tracking	`deps/apm_resolver.py` `detect_circular_dependencies()`	O(V×D) — `in` operator on `List[str]`	Degrades with tree depth
`get_nodes_at_depth()` during flatten	`deps/dependency_graph.py`	O(V × max_depth) — full node scan per depth level	Repeated scans across 50 depth levels
`from_apm_yml()` YAML parsing	`models/apm_package.py`	25 call sites, 0 caching — same files parsed 20-50× per operation	Unnecessary disk I/O and CPU
`get_config()` reads	`config.py`	5-10 disk reads per operation, no caching	Redundant file I/O
`read_constitution()`	`compilation/constitution.py`	Read 3× per compile from `claude_formatter.py` and `injector.py`	Redundant file I/O
Inheritance chain computation	`compilation/context_optimizer.py` `_get_inheritance_chain()`	Recomputed every call, no caching	Redundant O(D) walks per instruction
Package downloads	`cli.py` install loop	Fully sequential — O(N×T) wall-clock	10 pkgs × 5s = 50s vs ~10s parallel
GitHub API resilience	`deps/github_downloader.py`	No 429/rate-limit handling	Fails immediately under throttling

File	Concern
`src/apm_cli/primitives/models.py`	O(m²) conflict detection
`src/apm_cli/deps/apm_resolver.py`	Cycle detection, BFS builder, flatten
`src/apm_cli/deps/dependency_graph.py`	`get_nodes_at_depth()`, tree structures
`src/apm_cli/models/apm_package.py`	Uncached `from_apm_yml()` (25 call sites)
`src/apm_cli/config.py`	Uncached `get_config()`
`src/apm_cli/compilation/constitution.py`	Uncached `read_constitution()`
`src/apm_cli/compilation/context_optimizer.py`	Uncached inheritance chains
`src/apm_cli/cli.py`	Sequential install loop
`src/apm_cli/deps/github_downloader.py`	No parallelism, no rate limits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Performance optimization for deep dependency trees #171

Phase 1 — Performance Baseline (Benchmarks & Fixtures)

Phase 2 — Data Structure Fixes (High Impact, Low Risk)

Phase 3 — Caching Layer (High Impact, Medium Risk)

Phase 4 — Network Parallelization (Highest Wall-Clock Impact, Higher Risk)

Phase 5 — Resilience & Robustness (Medium Impact, Low Risk)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Performance optimization for deep dependency trees #171

Description

Phase 1 — Performance Baseline (Benchmarks & Fixtures)

Phase 2 — Data Structure Fixes (High Impact, Low Risk)

Phase 3 — Caching Layer (High Impact, Medium Risk)

Phase 4 — Network Parallelization (Highest Wall-Clock Impact, Higher Risk)

Phase 5 — Resilience & Robustness (Medium Impact, Low Risk)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions