Skip to content

CI: Cache image on main branch#7

Closed
saturley-hall wants to merge 2 commits into
mainfrom
harrison/main_branch_cache
Closed

CI: Cache image on main branch#7
saturley-hall wants to merge 2 commits into
mainfrom
harrison/main_branch_cache

Conversation

@saturley-hall
Copy link
Copy Markdown
Member

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 4, 2025

Test Results

 2 files   2 suites   26s ⏱️
71 tests 71 ✅ 0 💤 0 ❌
89 runs  88 ✅ 1 💤 0 ❌

Results for commit e688847.

@saturley-hall
Copy link
Copy Markdown
Member Author

Closing to reopen

ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Bug fixes:
- #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache
- #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout
  (bare-layout path was missing the writable-dir check present in HF_HOME test)

Minor cleanups:
- #4: Move `import pytest` from inside download_lora() to module top-level
  (lora_utils.py is test-only infra; pytest is always available)
- #5: Replace pytestconfig.getoption("--models-dir") re-checks in
  predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR")
  (_models_dir_env runs first and sets the var; single source of truth)

New coverage tests:
- #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test
  verifying pytest_configure exits with returncode=2 on bad path
- #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora()
  raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set

Not addressed: #3 (keep gpu_0 per project guidelines and previous review
retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
furionw added a commit that referenced this pull request May 12, 2026
…later

Reader-first ordering. Old order buried Quick start at position 8 of 9;
new order surfaces the runnable commands above all the reference
sections (FSx vs EBS, label conventions, aiperf SHA rationale).

New section order:
  1. Title + 3-config table
  2. Pre-requisites           (was #3)
  3. Quick start              (was #8 — promoted)
  4. Directory layout         (was #7 — now serves as map for the rest)
  5. Hardware targets         (was #2 — now pure reference; invocation
                                       examples moved into Quick start)
  6. Storage                  (was #5)
  7. aiperf install           (was #6)
  8. Naming & ownership       (was #4)
  9. Notes                    (unchanged)

Also drop the stray "We use ebs by default" sentence — it contradicted
both the Storage section and the actual yaml (where the PVC block is
fully commented out, no default storage class is set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tanmayv25 added a commit that referenced this pull request May 16, 2026
…cycle

Address senior-architect review of the metrics infrastructure:

#1 (two parallel Prometheus paths): merge `register_prometheus` +
`setup_component_metrics` into one `setup_metrics(ctx) -> MetricsBindings`
method on the Rust `LLMEngine` trait. The trait surface shrinks from
7 → 6 methods; both expfmt bridging and the structured component
publisher go through one place. The Python ABC keeps its two methods —
the PyO3 bridge adapts (engine-author API unchanged for Python).

#3 + #8 (framework lifecycle gauges gated on engine opt-in):
- `cleanup_time_seconds` and `drain_time_seconds` move to Rust-side
  `LifecycleGauges` (prometheus crate gauges registered via the
  runtime's `MetricsRegistry`). Worker constructs them after
  `engine.start()` succeeds; observes during cleanup/drain. Emits
  regardless of engine opt-in.
- Drop `set_cleanup_time` / `set_drain_time` from the
  `ComponentMetricsPublisher` trait — engine no longer responsible.
- Drop the corresponding methods from `PyComponentMetricsPublisher`.
- `model_load_time_seconds` STAYS on Python `LLMBackendMetrics` for
  parity with the legacy entry points (both legacy `main.py` and the
  unified bridge populate it). The unified bridge now constructs
  `LLMBackendMetrics` UNCONDITIONALLY so the gauge emits even when
  the engine returns no per-rank sources.

#5 (kv_cache_hit_rate Optional semantics): clearer docstring on both
Rust `ComponentSnapshot` and the Python ABC — tri-state explicitly
documented (`None` = no data, `0.0` = legitimate zero-hit).

Deferred with rationale (explained in the post-review architect's note):
- #2 (HashMap<MetricKey, f64> ComponentSnapshot): real future-proofing
  but no current bug; 5 fields for 5 signals isn't worth the refactor
  cost yet. Revisit at ~10+ signals.
- #4 (per-source cadence): explicit defer in original review.
- #7 (conformance in CI): workflow change, separate from trait surface.

Conformance kit collapses `check_register_prometheus` +
`check_component_metrics` into `check_setup_metrics`.

All 68 backend-common unit tests pass. Mocker example updated.
PyO3 bridge wheel rebuilt successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant