Skip to content

feat: nixl metadata store and retrieved from etcd#6

Merged
nnshah1 merged 110 commits into
mainfrom
nnshah1-vllm-nixl-etcd
Mar 4, 2025
Merged

feat: nixl metadata store and retrieved from etcd#6
nnshah1 merged 110 commits into
mainfrom
nnshah1-vllm-nixl-etcd

Conversation

@nnshah1
Copy link
Copy Markdown
Contributor

@nnshah1 nnshah1 commented Mar 4, 2025

What does the PR do?

  • Adds ability to store and retrieve nixl metadata into etcd
  • Adds python bindings for runtime etcd client

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Comment thread lib/bindings/python/rust/lib.rs
Copy link
Copy Markdown
Contributor

@ishandhanani ishandhanani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N🧊

Comment thread lib/bindings/python/rust/lib.rs
@grahamking
Copy link
Copy Markdown
Contributor

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

@nnshah1
Copy link
Copy Markdown
Contributor Author

nnshah1 commented Mar 4, 2025

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

@ishandhanani
Copy link
Copy Markdown
Contributor

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

I think Ryan in the previous repo wanted to use the DistributedRuntime

@grahamking
Copy link
Copy Markdown
Contributor

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

Yes. There doesn't seem much value in writing our own etcd python library to wrap a rust one. AFAICT our bindings just do normal etcd things. That would be less to maintain. Once users start using this they will want leases, locks, everything etcd offers.

@nnshah1
Copy link
Copy Markdown
Contributor Author

nnshah1 commented Mar 4, 2025

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

I think eventually we will rename this to something like KvStore() on the namespace object instead of etcd_client on the runtime ...

to give it a thin abstraction and hide the etcd detail here -

@nnshah1
Copy link
Copy Markdown
Contributor Author

nnshah1 commented Mar 4, 2025

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

Yes. There doesn't seem much value in writing our own etcd python library to wrap a rust one. AFAICT our bindings just do normal etcd things. That would be less to maintain. Once users start using this they will want leases, locks, everything etcd offers.

initially we used the plain python library - but direction was to reuse the rust implementation - agree - we will want to move it to already match with the lease for the component - and have the namespace applied automatically - so it'll be more of a KVStore in the namespace then an etcd client ...

@grahamking
Copy link
Copy Markdown
Contributor

grahamking commented Mar 4, 2025

For the etcd bindings specifically, wouldn't it be better to give the user the address / port, so they can use a standard etcd library?

meaning we should expose the url that was eventually used - like a property?

I think eventually we will rename this to something like KvStore() on the namespace object instead of etcd_client on the runtime ...

to give it a thin abstraction and hide the etcd detail here -

You mean like nim-nvllm has had since mid January? (update internal URL)

You could choose to do discovery with etcd or nats. --net-kv [nats|etc|<also urls>].

Ryan O dropped it in his furious weekend re-write that became the initial commit to triton distributed. I'm stuck in a time loop.

@grahamking
Copy link
Copy Markdown
Contributor

We should maybe warn people not to rely on the etcd bindings if we're going to put them behind an abstraction eventually.

@nnshah1 nnshah1 requested a review from statiraju March 4, 2025 20:47
@nnshah1 nnshah1 merged commit 57d19b5 into main Mar 4, 2025
@nnshah1 nnshah1 deleted the nnshah1-vllm-nixl-etcd branch March 4, 2025 21:10
kylehh pushed a commit to kylehh/dynamo that referenced this pull request Apr 11, 2025
Co-authored-by: hongkuanz <hongkuanz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com>
Co-authored-by: Piotr Tarasiewicz Nvidia <ptarasiewicznv@Piotrs-MacBook-Pro.local>
Co-authored-by: Neelay Shah <neelays@ipp2-0493.ipp2u1.colossus.nvidia.com>
Co-authored-by: Neelay Shah <neelays@ipp1-1941.ipp1a1.colossus.nvidia.com>
Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Co-authored-by: Neelay Shah <neelays@4u8g-gen-0078.ipp3a2.colossus.nvidia.com>
Co-authored-by: ptarasiewiczNV <104908264+ptarasiewiczNV@users.noreply.github.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Fixes all actionable items from the second review:

Bug fixes:
- #1: Change returncode=4 → returncode=2 in pytest_configure exit
  (4 is reserved by pytest for EXIT_NOTESTSCOLLECTED)
- #2: Add comment clarifying HF_HUB_OFFLINE double-clear is safe
  (already in _MODELS_DIR_ENV_KEYS; loop correctly restores original)

Test quality:
- #7: Add missing assertions to test_apply_hf_home_layout
  (HF_HUB_OFFLINE, TRANSFORMERS_OFFLINE, DYNAMO_MODELS_DIR, TRANSFORMERS_CACHE)
- #8: Use monkeypatch in tests 3 & 4 for proper env isolation
  (prevents pre-existing env vars from leaking on test failure)

Design / correctness:
- #3: Fix _models_dir_env docstring ("exactly once" → "once per worker")
- #4: Add comment noting TRANSFORMERS_CACHE deprecation
- #5: Update --models-dir help text and docs to reflect both supported
  layouts (bare HF_HUB_CACHE and HF_HOME), not just bare
- #10: Restore pytest.skip() in download_lora() (test-only infra);
  remove now-redundant guard from minio_lora_service fixture
- #11: Raise hub/ detection log to WARNING with guidance
- #12: Replace shutil.rmtree(ignore_errors=True) with try/except
  so cleanup failures are logged rather than silently swallowed

Not addressed: #6 (keep gpu_0 per project marker policy), #9 (pytester
test deferred — complex due to conftest dependencies, low severity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
ranrubin added a commit that referenced this pull request Apr 20, 2026
Bug fixes:
- #1: Add monkeypatch env isolation to test_apply_sets_writable_transformers_cache
- #2: Add TRANSFORMERS_CACHE assertion to test_apply_bare_cache_layout
  (bare-layout path was missing the writable-dir check present in HF_HOME test)

Minor cleanups:
- #4: Move `import pytest` from inside download_lora() to module top-level
  (lora_utils.py is test-only infra; pytest is always available)
- #5: Replace pytestconfig.getoption("--models-dir") re-checks in
  predownload_models/predownload_tokenizers with os.environ.get("DYNAMO_MODELS_DIR")
  (_models_dir_env runs first and sets the var; single source of truth)

New coverage tests:
- #7: test_models_dir_nonexistent_exits_with_code_2 — subprocess test
  verifying pytest_configure exits with returncode=2 on bad path
- #8: test_download_lora_skips_in_models_dir_mode — verifies download_lora()
  raises pytest.skip.Exception when DYNAMO_MODELS_DIR is set

Not addressed: #3 (keep gpu_0 per project guidelines and previous review
retraction), #6 (hook ordering is guaranteed), #9 (complex, low priority)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: rrubin <rrubin@nvidia.com>
furionw added a commit that referenced this pull request May 12, 2026
…later

Reader-first ordering. Old order buried Quick start at position 8 of 9;
new order surfaces the runnable commands above all the reference
sections (FSx vs EBS, label conventions, aiperf SHA rationale).

New section order:
  1. Title + 3-config table
  2. Pre-requisites           (was #3)
  3. Quick start              (was #8 — promoted)
  4. Directory layout         (was #7 — now serves as map for the rest)
  5. Hardware targets         (was #2 — now pure reference; invocation
                                       examples moved into Quick start)
  6. Storage                  (was #5)
  7. aiperf install           (was #6)
  8. Naming & ownership       (was #4)
  9. Notes                    (unchanged)

Also drop the stray "We use ebs by default" sentence — it contradicted
both the Storage section and the actual yaml (where the PVC block is
fully commented out, no default storage class is set).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants