feat(hf): support writing and reading from both http and xet#7185
feat(hf): support writing and reading from both http and xet#7185Xuanwo merged 53 commits intoapache:mainfrom
Conversation
|
I'm still working on a couple of things, like implementing oio::Write instead of OneShotWrite which requires some additional changes in my xet-core PR. |
b403fa5 to
c8c3e8a
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Thank you very much for working on this! Great job, here is my first round of review.
| .with_context("service", HF_SCHEME)), | ||
| None => Ok(RepoType::Model), | ||
| }?; | ||
| let repo_type = self.config.repo_type; |
|
|
||
| /// Read a middle range of a known file. | ||
| #[tokio::test] | ||
| #[ignore = "requires network access"] |
There was a problem hiding this comment.
I suggest adding a tests.rs file and guiding users to run tests in an environment such as OPNEDAL_SERVICE_TEST_HF.
| // When xet is disabled, preserve whatever HTTP client is already set | ||
| // on `info` (important for mock-based unit tests). | ||
| #[cfg(feature = "xet")] | ||
| let no_redirect_client = if xet_enabled { |
There was a problem hiding this comment.
Service shouldn't build their own http clients. But we can figure this out later.
There was a problem hiding this comment.
We need a client not following the redirects to retrieve the X-Xet-Hash header. If I'm not mistaken we need another http client for this, not sure how else could I handle it.
Since a new revamped Xet client is in the works (which will be publised and this PR should eventually depend on) we may be able to delegate this task to the Xet client's side. What do you think @hoytak?
5cd5a17 to
dc870a8
Compare
| } | ||
|
|
||
| /// Build the XET token API URL for this repository. | ||
| pub fn xet_token_url(&self, endpoint: &str, token_type: &str) -> String { |
There was a problem hiding this comment.
If this fits in use cases of opendal, I'd recommend to add a create_pr: bool parameter to this function, so that if
- someone uses a
HF_TOKENwith only read permission to a repo, and - he wants to submit a pull request to this repo with xet files,
he can obtain a Xet token with write permission by constructing the below xet token url:
format!("{endpoint}/api/{repo_type}/{repo_id}/xet-write-token?create_pr=1")
Replace the custom HfBucketSinkNode (1,050 lines of XET-specific code) with a standard ObjectStore implementation backed by OpenDAL's HF service. HF URLs now flow through the same FileSink path as S3/GCS/Azure, requiring only a thin build_hf() builder in a new hf.rs module. Key changes: - Add crates/polars-io/src/cloud/hf.rs: HF URL parsing, token extraction, and OpenDAL ObjectStore construction (~175 lines) - Wire CloudType::Hf in object_store_setup.rs to call build_hf(), matching the pattern used by build_aws/build_gcp/build_azure - Delete custom sink: hf_bucket/ directory (4 files, 721 lines), HfBucketSinkNode (260 lines), IR lowering special-case, PhysNodeKind::HfBucketSink variant - Rename feature flag hf_bucket_sink -> hf across 9 Cargo.toml files - Bump object_store_opendal compatibility from object_store 0.12 to 0.13 Dependencies: opendal + object_store_opendal (local path deps for now, will switch to published crate versions once apache/opendal#7185 ships). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an ObjectStore implementation for `hf://` URLs backed by OpenDAL's
HF service, enabling `sink_parquet("hf://buckets/org/name/file.parquet")`
to stream directly to Hugging Face Storage Buckets.
The implementation follows the same pattern as existing cloud backends
(S3/GCS/Azure): a `build_hf()` function in a new `hf.rs` module
constructs the ObjectStore, and `object_store_setup.rs` calls it from
the `CloudType::Hf` match arm. HF URLs flow through the standard
FileSink path with no custom sink node or IR special-casing.
New files:
- crates/polars-io/src/cloud/hf.rs — URL parsing, token resolution,
OpenDAL ObjectStore construction
Feature flag: `hf` (opt-in, propagated through the workspace)
Dependencies: opendal + object_store_opendal (local path deps for now,
will switch to published crate versions once apache/opendal#7185 ships)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…x 404 listing - Switch hf-xet from pinned git rev to crates.io v1.5.0 - Remove per-service integration tests that duplicate behavior tests - Keep XET read tests (mbpp_operator) and bucket roundtrip test - Handle HF 404 as empty listing in HfLister - Remove serial_test dependency
…isolation - Remove regular (base64 inline) write path; always use XET protocol - Remove base64 dependency (no longer needed) - Remove preupload API and determine_upload_mode - Remove create_dir capability (HF doesn't natively support it, same as S3) - Fix bucket tree URL: remove revision segment for buckets - Fix bucket stat: use HEAD on resolve URL instead of paths-info API - Fix bucket delete: use bucket_batch(DeleteFile) instead of git commit - Fix repo_path: convert operator-relative paths to repo-absolute for commit/delete/batch payloads (fixes behavior tests with random root) - Create fresh XetSession per operation to avoid concurrent interference - Make writer close() retry-safe with CloseState enum tracking progress - Handle "Already completed" from commit.commit() gracefully on retry
- Restore prefix-based listing: when path doesn't end with '/', list parent directory and filter results by prefix. Fixes test_list_prefix, test_list_file_with_recursive, and test_list_dir_with_file_path. - Add retry with exponential backoff for CAS propagation delays in register_file(). The XET commit can complete before data is visible to the HF API (bucket_batch or LFS commit). Retries on "not found in Xet storage", "LFS pointer", and "Invalid file change" errors.
- Don't set version on write metadata (stat doesn't return commit OID, causing assert_metadata mismatch in test_write_returns_metadata) - Increase CAS propagation retry delays to 1s/2s/4s/8s (15s total) to handle slower LFS registration on datasets
- Add write_can_empty capability so SimulateLayer can implement create_dir by writing a zero-byte file at the directory path - Strip trailing '/' from bucket paths in register_file() since the bucket batch API rejects paths ending with '/'
write_can_empty doesn't help HF because the SimulateLayer writes zero-byte files at directory paths, but HF's directory concept comes from path hierarchy, not zero-byte objects. The create_dir tests remain unsupported (same as before write_can_empty was added).
…lity 8 behavior tests in async_list.rs call op.create_dir() without checking the create_dir capability, causing failures on backends like HF where directories are implicit from file paths. Add early-return guards matching the pattern already used by other tests (test_stat_dir, test_delete_empty_dir, test_read_with_dir_path). Also clean up HF backend: remove .gitkeep/create_dir workarounds, remove trailing-slash stripping from bucket_batch, remove .gitkeep filtering from lister.
…nfigs Rename .github/services/huggingface to .github/services/hf so the test framework uses OPENDAL_HF_* env vars (matching the hf scheme). Add two new CI test variants with ephemeral repos: - huggingface_bucket: creates a private bucket per run, deletes after - huggingface_dataset: creates a private dataset per run, sequential Both use hf CLI for repo lifecycle and 1Password for credentials. Repo names include GITHUB_RUN_ID for uniqueness. Requires 1Password vault entries under services/hf: - token (HF write token) - owner (HF user or org, e.g., "opendal")
- Add xet_upload_commit() and xet_download_group() to HfCore, encapsulating session creation, token refresh URL and headers - Restore XetSession as a field on HfCore (shared factory) - Make xet_token_refresh_headers() private - Simplify writer: remove CloseState enum, use Option<XetStreamUpload> - Simplify reader: replace inline XET setup with core.xet_download_group() - Rename upload_commit field to commit in HfWriter
…ucket Align naming to clearly indicate both methods do the same thing (register/delete files) for different repo types: - commit_git: git-based repos (model/dataset/space) - commit_bucket: bucket repos via NDJSON batch API
Writer: - Rename commit/stream fields to xet_commit/xet_stream for clarity - Move XET CAS finalization into commit() method - Rename register_file() to commit() - Add docstrings explaining the XET-only upload flow Deleter: - Extract shared NotFound error handling from commit_git/commit_bucket Config: - Hoist test imports to module level
Remove retry_on_cas_delay helper — CAS propagation retries should be handled by OpenDAL's RetryLayer via temporary error marking, not by custom retry loops in the writer. Simplify commit() to call commit_bucket/commit_git directly.
Replace verbose `self.core.repo.repo_type == RepoType::Bucket` checks with `self.core.is_bucket()` across backend, writer, reader, deleter. Remove now-unused RepoType imports from those modules.
- Rename RepoType to HfRepoType for clarity and namespace consistency - Import HfUri and HfRepoType at the top of core.rs - Add docstrings to HfCore::is_bucket() and HfCore::uri()
- Remove HfListerWrapper — bucket tree API supports recursive=false and returns directory entries, so HierarchyLister is unnecessary - Simplify list() to always use PageLister<HfLister> - Pass recursive=false explicitly for buckets (API defaults to recursive) - Move is_bucket() from HfCore to HfRepo, use repo.is_bucket() everywhere - Default PathInfo.size to 0 (bucket directory entries omit it)
…Repo - Add DeletedFolder type and deletedFolders field to git commit payload matching HF's API which distinguishes file vs folder deletions - Split directory paths (trailing /) into deletedFolders, file paths into deletedFiles when committing to git repos - Move git_commit_url() from HfUri to HfRepo (commit URL depends on repo, not on a specific file path) - Make test_remove_all_with_prefix_exists skip gracefully when the service has real directory semantics
|
@Xuanwo I updated the tests to use the behavior test suite properly though we should set up the HF credentials for GH actions. |
- Rename huggingface_bucket → hf_bucket, huggingface_dataset → hf_dataset - Use pre-created repos with 1Password secrets (ephemeral cleanup can't run after tests in the current CI action architecture) - Fix test_remove_all_with_prefix_exists: use a probe write to detect file/dir coexistence support, keep original test paths intact 1Password entries needed under services/hf: - token, bucket_repo_id, dataset_repo_id
Point opendal and object_store_opendal at kszucs/opendal@4c70bd8 (hf-revamp branch) which uses published hf-xet 1.5.0 from crates.io. Once apache/opendal#7185 merges and publishes, these become simple version deps (e.g. opendal = "0.56"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Xuanwo
left a comment
There was a problem hiding this comment.
Thank yor for working on this, let's rock!
apache/opendal#7185 merged today — HF backend XET support is now on upstream main. Pin to commit 8d3dbcc3ef until a release ships. Next: once opendal publishes a release with services-hf, swap these git deps for a version number (opendal = "0.56" or whatever ships). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@Xuanwo , I'm looking to leverage this HF storage buckets feat with Daft's OpenDAL integration and was wondering when the next release is planned for? |
Which issue does this PR close?
Work-in-progress, but generally:
I opened the PR for better visibility and early feedback.
Depends on huggingface/xet-core#642
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?
AI Usage Statement