Add Hub.snapshot_download for HF Hub repo IDs#6
Open
duncanita wants to merge 3 commits into
Open
Conversation
_tensor_to_mlx declared U32 and other integer dtypes in DTYPE_UNPACK but never branched on them in the if/elsif chain. Packed 4-bit quantized weights (stored as uint32 in mlx-community safetensors) fell through to the F32 fallback and were decoded as garbage floats, causing `[dequantize] The matrix should be given as a uint32` on the first QuantizedEmbedding forward. Reproduces on mlx-community/Llama-3.2-1B-Instruct-4bit and presumably every 4-bit model.
The if/elsif chain in _tensor_to_mlx duplicated the DTYPE_UNPACK constant declared at the top of the file — which is how the U16/U32/ I8/I16 branches went missing from the chain in the first place. Table- driven lookup keeps the mapping in one place. F16 and BF16 stay as explicit branches because they take a different code path (uint16 stage + .view cast). Unknown-dtype F32 fallback is preserved to match prior behavior. Uses __send__ instead of send because MLX::Core defines a `send` method (takes 2..4 args) that would shadow Object#send.
Resolves the cli.md caveat that `--model <hf_repo_id>` "does not work
yet — current runtime loading expects a local model directory".
MlxLm::Hub.snapshot_download fetches a model snapshot from
huggingface.co using pure Ruby + stdlib net/http. Cache layout matches
huggingface_hub Python's (models--<org>--<repo>/{refs,snapshots}/...)
so caches are mutually reusable — a model downloaded via Python
mlx_lm is used as-is, and vice-versa.
LoadUtils.load now accepts either a local path or an HF repo id like
"mlx-community/Llama-3.2-1B-Instruct-4bit", plus an optional revision:
parameter (default "main"). File.directory? short-circuits the local
path; otherwise the repo is resolved via the hub.
Respects env vars the Python client uses: HF_HUB_CACHE, HF_HOME,
HF_ENDPOINT, HF_TOKEN. Handles HTTP redirects (HF CDN) and resolves
relative Location headers against the current URL. Scoped Authorization
to huggingface.co hosts (redirected CDNs reject bearer tokens).
Tested against mlx-community/Llama-3.2-1B-Instruct-4bit:
bundle exec exe/mlx_lm generate \
--model mlx-community/Llama-3.2-1B-Instruct-4bit \
--prompt 'The capital of France is'
produces `Paris.` on both a warm cache and a fresh cache (verified
with allow_patterns to a throwaway cache_dir).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves the cli.md caveat that
--model <hf_repo_id>"does not work yet — current runtime loading expects a local model directory". After this change,mlx_lm generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --prompt '…'resolves the repo, downloads the snapshot, and runs — no pre-downloaded local path required.Design
MlxLm::Hubwith one public API:snapshot_download(repo_id, revision:, allow_patterns:, token:, cache_dir:, endpoint:). Pure Ruby + stdlibnet/http; no new gem deps.huggingface_hubPython:<cache>/models--<org>--<repo>/{refs/<rev>, snapshots/<sha>/<rel_path>}. Caches are mutually reusable — a model previously fetched via Python is picked up as-is, and vice versa.HF_HUB_CACHE,HF_HOME,HF_ENDPOINT,HF_TOKEN.Locationheaders against the current URL. ScopesAuthorization: Bearertohuggingface.cohosts — redirected CDNs (S3/Cloudfront) reject bearer tokens.LoadUtils.loadnow short-circuitsFile.directory?(path)for local paths; otherwise treats the argument as a repo id and routes throughHub.snapshot_downloadwithallow_patterns: ["*.json", "*.txt", "*.safetensors", "*.jinja", "*.model"]. Added arevision:kwarg (defaults to"main").Dependency
This branch is based on top of #5 (the U32 weight-loading fix). Without that fix, the downloader would successfully fetch mlx-community 4-bit models but they'd still fail on the first
QuantizedEmbeddingforward. If #5 merges first, this branch rebases to a single Hub commit (+149/-5 across 3 files); otherwise the 3 commits can be merged together.Test plan
mlx_lm generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --prompt 'The capital of France is' --max-tokens 12 --temp 0.0→Paris.(identical to pre-change local-path run).Hub.snapshot_download(..., allow_patterns: ['config.json','tokenizer_config.json'], cache_dir: '/tmp/…')to a throwaway dir — API call succeeds, redirects are followed, 2 files land at the correctsnapshots/<sha>/path,refs/mainis written.Notes / possible follow-ups (not in this PR)
huggingface_hubstores one copy inblobs/<etag>and symlinks intosnapshots/<sha>/. This PR writes files directly intosnapshots/<sha>/for simplicity. Dedup adds a little code and modest disk savings for users who pin multiple revisions of the same repo — happy to add it if wanted.