feat: nixl metadata store and retrieved from etcd#6
Conversation
|
For the |
meaning we should expose the url that was eventually used - like a property? |
I think Ryan in the previous repo wanted to use the |
Yes. There doesn't seem much value in writing our own etcd python library to wrap a rust one. AFAICT our bindings just do normal etcd things. That would be less to maintain. Once users start using this they will want leases, locks, everything etcd offers. |
I think eventually we will rename this to something like KvStore() on the namespace object instead of etcd_client on the runtime ... to give it a thin abstraction and hide the etcd detail here - |
initially we used the plain python library - but direction was to reuse the rust implementation - agree - we will want to move it to already match with the lease for the component - and have the namespace applied automatically - so it'll be more of a KVStore in the namespace then an etcd client ... |
You mean like nim-nvllm has had since mid January? (update internal URL) You could choose to do discovery with etcd or nats. Ryan O dropped it in his furious weekend re-write that became the initial commit to triton distributed. I'm stuck in a time loop. |
|
We should maybe warn people not to rely on the etcd bindings if we're going to put them behind an abstraction eventually. |
Co-authored-by: hongkuanz <hongkuanz@nvidia.com> Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com> Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local> Co-authored-by: Neelay Shah <neelays@ipp2-0493.ipp2u1.colossus.nvidia.com> Co-authored-by: Neelay Shah <neelays@ipp1-1941.ipp1a1.colossus.nvidia.com> Co-authored-by: ishandhanani <ishandhanani@gmail.com> Co-authored-by: Neelay Shah <neelays@4u8g-gen-0078.ipp3a2.colossus.nvidia.com> Co-authored-by: ptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>
Fixes all actionable items from the second review: Bug fixes: - #1: Change returncode=4 → returncode=2 in pytest_configure exit (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED) - #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original) Test quality: - #7: Add missing assertions to test_apply_hf_home_layout (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE) - #8: Use monkeypatch in tests 3 & 4 for proper env isolation (prevents pre-existing env vars from leaking on test failure) Design / correctness: - #3: Fix _models_dir_env docstring ("exactly once" → "once per worker") - #4: Add comment noting TRANSFORMERS_CACHE deprecation - #5: Update --models-dir help text and docs to reflect both supported layouts (bare HF_HUB_CACHE and HF_HOME), not just bare - #10: Restore pytest.skip() in download_lora() (test-only infra); remove now-redundant guard from minio_lora_service fixture - #11: Raise hub/ detection log to WARNING with guidance - #12: Replace shutil.rmtree(ignore_errors=True) with try/except so cleanup failures are logged rather than silently swallowed Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester test deferred — complex due to conftest dependencies, low severity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>
Bug fixes: - #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache - #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout (bare-layout path was missing the writable-dir check present in HF_HOME test) Minor cleanups: - #4: Move `import pytest` from inside download_lora() to module top-level (lora_utils.py is test-only infra; pytest is always available) - #5: Replace pytestconfig.getoption("--models-dir") re-checks in predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR") (_models_dir_env runs first and sets the var; single source of truth) New coverage tests: - #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test verifying pytest_configure exits with returncode=2 on bad path - #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora() raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set Not addressed: #3 (keep gpu_0 per project guidelines and previous review retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: rrubin <rrubin@nvidia.com>
…later Reader-first ordering. Old order buried Quick start at position 8 of 9; new order surfaces the runnable commands above all the reference sections (FSx vs EBS, label conventions, aiperf SHA rationale). New section order: 1. Title + 3-config table 2. Pre-requisites (was #3) 3. Quick start (was #8 — promoted) 4. Directory layout (was #7 — now serves as map for the rest) 5. Hardware targets (was #2 — now pure reference; invocation examples moved into Quick start) 6. Storage (was #5) 7. aiperf install (was #6) 8. Naming & ownership (was #4) 9. Notes (unchanged) Also drop the stray "We use ebs by default" sentence — it contradicted both the Storage section and the actual yaml (where the PVC block is fully commented out, no default storage class is set). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What does the PR do?
Checklist
<commit_type>: <Title>Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
Where should the reviewer start?
Test plan:
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)