Skip to content

Add Hub.snapshot_download for HF Hub repo IDs#6

Open
duncanita wants to merge 3 commits into
skryl:mainfrom
duncanita:feat/hf-hub-downloader
Open

Add Hub.snapshot_download for HF Hub repo IDs#6
duncanita wants to merge 3 commits into
skryl:mainfrom
duncanita:feat/hf-hub-downloader

Conversation

@duncanita
Copy link
Copy Markdown

Summary

Resolves the cli.md caveat that --model <hf_repo_id> "does not work yet — current runtime loading expects a local model directory". After this change, mlx_lm generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --prompt '…' resolves the repo, downloads the snapshot, and runs — no pre-downloaded local path required.

Design

  • New module MlxLm::Hub with one public API: snapshot_download(repo_id, revision:, allow_patterns:, token:, cache_dir:, endpoint:). Pure Ruby + stdlib net/http; no new gem deps.
  • Cache layout matches huggingface_hub Python: <cache>/models--<org>--<repo>/{refs/<rev>, snapshots/<sha>/<rel_path>}. Caches are mutually reusable — a model previously fetched via Python is picked up as-is, and vice versa.
  • Respects env vars the Python client uses: HF_HUB_CACHE, HF_HOME, HF_ENDPOINT, HF_TOKEN.
  • Handles HTTP redirects (HF serves file bodies from a CDN) and resolves relative Location headers against the current URL. Scopes Authorization: Bearer to huggingface.co hosts — redirected CDNs (S3/Cloudfront) reject bearer tokens.
  • LoadUtils.load now short-circuits File.directory?(path) for local paths; otherwise treats the argument as a repo id and routes through Hub.snapshot_download with allow_patterns: ["*.json", "*.txt", "*.safetensors", "*.jinja", "*.model"]. Added a revision: kwarg (defaults to "main").

Dependency

This branch is based on top of #5 (the U32 weight-loading fix). Without that fix, the downloader would successfully fetch mlx-community 4-bit models but they'd still fail on the first QuantizedEmbedding forward. If #5 merges first, this branch rebases to a single Hub commit (+149/-5 across 3 files); otherwise the 3 commits can be merged together.

Test plan

  • Warm-cache smoke: mlx_lm generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --prompt 'The capital of France is' --max-tokens 12 --temp 0.0Paris. (identical to pre-change local-path run).
  • Cold-cache smoke: Hub.snapshot_download(..., allow_patterns: ['config.json','tokenizer_config.json'], cache_dir: '/tmp/…') to a throwaway dir — API call succeeds, redirects are followed, 2 files land at the correct snapshots/<sha>/ path, refs/main is written.
  • Local-path back-compat: passing a directory still short-circuits without network.
  • CI.

Notes / possible follow-ups (not in this PR)

  • Blob dedup. Python huggingface_hub stores one copy in blobs/<etag> and symlinks into snapshots/<sha>/. This PR writes files directly into snapshots/<sha>/ for simplicity. Dedup adds a little code and modest disk savings for users who pin multiple revisions of the same repo — happy to add it if wanted.
  • ETag / integrity checks. Not verified in v1; happy to add.
  • Resume on interrupted downloads. Not in v1; failed downloads re-fetch from 0 next run.
  • Progress output. Net-silent today.

_tensor_to_mlx declared U32 and other integer dtypes in DTYPE_UNPACK
but never branched on them in the if/elsif chain. Packed 4-bit
quantized weights (stored as uint32 in mlx-community safetensors) fell
through to the F32 fallback and were decoded as garbage floats,
causing `[dequantize] The matrix should be given as a uint32` on the
first QuantizedEmbedding forward.

Reproduces on mlx-community/Llama-3.2-1B-Instruct-4bit and presumably
every 4-bit model.
The if/elsif chain in _tensor_to_mlx duplicated the DTYPE_UNPACK
constant declared at the top of the file — which is how the U16/U32/
I8/I16 branches went missing from the chain in the first place. Table-
driven lookup keeps the mapping in one place.

F16 and BF16 stay as explicit branches because they take a different
code path (uint16 stage + .view cast). Unknown-dtype F32 fallback is
preserved to match prior behavior.

Uses __send__ instead of send because MLX::Core defines a `send`
method (takes 2..4 args) that would shadow Object#send.
Resolves the cli.md caveat that `--model <hf_repo_id>` "does not work
yet — current runtime loading expects a local model directory".

MlxLm::Hub.snapshot_download fetches a model snapshot from
huggingface.co using pure Ruby + stdlib net/http. Cache layout matches
huggingface_hub Python's (models--<org>--<repo>/{refs,snapshots}/...)
so caches are mutually reusable — a model downloaded via Python
mlx_lm is used as-is, and vice-versa.

LoadUtils.load now accepts either a local path or an HF repo id like
"mlx-community/Llama-3.2-1B-Instruct-4bit", plus an optional revision:
parameter (default "main"). File.directory? short-circuits the local
path; otherwise the repo is resolved via the hub.

Respects env vars the Python client uses: HF_HUB_CACHE, HF_HOME,
HF_ENDPOINT, HF_TOKEN. Handles HTTP redirects (HF CDN) and resolves
relative Location headers against the current URL. Scoped Authorization
to huggingface.co hosts (redirected CDNs reject bearer tokens).

Tested against mlx-community/Llama-3.2-1B-Instruct-4bit:

    bundle exec exe/mlx_lm generate \
      --model mlx-community/Llama-3.2-1B-Instruct-4bit \
      --prompt 'The capital of France is'

produces `Paris.` on both a warm cache and a fresh cache (verified
with allow_patterns to a throwaway cache_dir).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant