Add Hub.snapshot_download for HF Hub repo IDs by duncanita · Pull Request #6 · skryl/mlx-ruby-lm

duncanita · 2026-04-19T14:35:42Z

Summary

Resolves the cli.md caveat that --model <hf_repo_id> "does not work yet — current runtime loading expects a local model directory". After this change, mlx_lm generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --prompt '…' resolves the repo, downloads the snapshot, and runs — no pre-downloaded local path required.

Design

New module MlxLm::Hub with one public API: snapshot_download(repo_id, revision:, allow_patterns:, token:, cache_dir:, endpoint:). Pure Ruby + stdlib net/http; no new gem deps.
Cache layout matches huggingface_hub Python: <cache>/models--<org>--<repo>/{refs/<rev>, snapshots/<sha>/<rel_path>}. Caches are mutually reusable — a model previously fetched via Python is picked up as-is, and vice versa.
Respects env vars the Python client uses: HF_HUB_CACHE, HF_HOME, HF_ENDPOINT, HF_TOKEN.
Handles HTTP redirects (HF serves file bodies from a CDN) and resolves relative Location headers against the current URL. Scopes Authorization: Bearer to huggingface.co hosts — redirected CDNs (S3/Cloudfront) reject bearer tokens.
LoadUtils.load now short-circuits File.directory?(path) for local paths; otherwise treats the argument as a repo id and routes through Hub.snapshot_download with allow_patterns: ["*.json", "*.txt", "*.safetensors", "*.jinja", "*.model"]. Added a revision: kwarg (defaults to "main").

Dependency

This branch is based on top of #5 (the U32 weight-loading fix). Without that fix, the downloader would successfully fetch mlx-community 4-bit models but they'd still fail on the first QuantizedEmbedding forward. If #5 merges first, this branch rebases to a single Hub commit (+149/-5 across 3 files); otherwise the 3 commits can be merged together.

Test plan

Warm-cache smoke: mlx_lm generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --prompt 'The capital of France is' --max-tokens 12 --temp 0.0 → Paris. (identical to pre-change local-path run).
Cold-cache smoke: Hub.snapshot_download(..., allow_patterns: ['config.json','tokenizer_config.json'], cache_dir: '/tmp/…') to a throwaway dir — API call succeeds, redirects are followed, 2 files land at the correct snapshots/<sha>/ path, refs/main is written.
Local-path back-compat: passing a directory still short-circuits without network.
CI.

Notes / possible follow-ups (not in this PR)

Blob dedup. Python huggingface_hub stores one copy in blobs/<etag> and symlinks into snapshots/<sha>/. This PR writes files directly into snapshots/<sha>/ for simplicity. Dedup adds a little code and modest disk savings for users who pin multiple revisions of the same repo — happy to add it if wanted.
ETag / integrity checks. Not verified in v1; happy to add.
Resume on interrupted downloads. Not in v1; failed downloads re-fetch from 0 next run.
Progress output. Net-silent today.

_tensor_to_mlx declared U32 and other integer dtypes in DTYPE_UNPACK but never branched on them in the if/elsif chain. Packed 4-bit quantized weights (stored as uint32 in mlx-community safetensors) fell through to the F32 fallback and were decoded as garbage floats, causing `[dequantize] The matrix should be given as a uint32` on the first QuantizedEmbedding forward. Reproduces on mlx-community/Llama-3.2-1B-Instruct-4bit and presumably every 4-bit model.

The if/elsif chain in _tensor_to_mlx duplicated the DTYPE_UNPACK constant declared at the top of the file — which is how the U16/U32/ I8/I16 branches went missing from the chain in the first place. Table- driven lookup keeps the mapping in one place. F16 and BF16 stay as explicit branches because they take a different code path (uint16 stage + .view cast). Unknown-dtype F32 fallback is preserved to match prior behavior. Uses __send__ instead of send because MLX::Core defines a `send` method (takes 2..4 args) that would shadow Object#send.

Resolves the cli.md caveat that `--model <hf_repo_id>` "does not work yet — current runtime loading expects a local model directory". MlxLm::Hub.snapshot_download fetches a model snapshot from huggingface.co using pure Ruby + stdlib net/http. Cache layout matches huggingface_hub Python's (models--<org>--<repo>/{refs,snapshots}/...) so caches are mutually reusable — a model downloaded via Python mlx_lm is used as-is, and vice-versa. LoadUtils.load now accepts either a local path or an HF repo id like "mlx-community/Llama-3.2-1B-Instruct-4bit", plus an optional revision: parameter (default "main"). File.directory? short-circuits the local path; otherwise the repo is resolved via the hub. Respects env vars the Python client uses: HF_HUB_CACHE, HF_HOME, HF_ENDPOINT, HF_TOKEN. Handles HTTP redirects (HF CDN) and resolves relative Location headers against the current URL. Scoped Authorization to huggingface.co hosts (redirected CDNs reject bearer tokens). Tested against mlx-community/Llama-3.2-1B-Instruct-4bit: bundle exec exe/mlx_lm generate \ --model mlx-community/Llama-3.2-1B-Instruct-4bit \ --prompt 'The capital of France is' produces `Paris.` on both a warm cache and a fresh cache (verified with allow_patterns to a throwaway cache_dir).

duncanita added 3 commits April 19, 2026 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hub.snapshot_download for HF Hub repo IDs#6

Add Hub.snapshot_download for HF Hub repo IDs#6
duncanita wants to merge 3 commits into
skryl:mainfrom
duncanita:feat/hf-hub-downloader

duncanita commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

duncanita commented Apr 19, 2026

Summary

Design

Dependency

Test plan

Notes / possible follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant