Skip to content

feat(hf): support writing and reading from both http and xet#7185

Merged
Xuanwo merged 53 commits intoapache:mainfrom
kszucs:hf-revamp
Apr 8, 2026
Merged

feat(hf): support writing and reading from both http and xet#7185
Xuanwo merged 53 commits intoapache:mainfrom
kszucs:hf-revamp

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Feb 9, 2026

Which issue does this PR close?

Work-in-progress, but generally:

  • add writing support
  • support reading using the xet protocol (opt-in)
  • support writing using the xet protocol (opt-in)

I opened the PR for better visibility and early feedback.

Depends on huggingface/xet-core#642

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

AI Usage Statement

@kszucs kszucs marked this pull request as ready for review February 10, 2026 09:12
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. releases-note/feat The PR implements a new feature or has a title that begins with "feat" labels Feb 10, 2026
Comment thread .github/workflows/ci_core.yml Outdated
Comment thread core/services/huggingface/src/core.rs Outdated
Comment thread core/services/huggingface/Cargo.toml Outdated
Comment thread core/Cargo.toml Outdated
Comment thread core/services/hf/src/config.rs
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Feb 11, 2026

I'm still working on a couple of things, like implementing oio::Write instead of OneShotWrite which requires some additional changes in my xet-core PR.

Copy link
Copy Markdown
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for working on this! Great job, here is my first round of review.

Comment thread core/services/hf/Cargo.toml Outdated
Comment thread core/services/hf/Cargo.toml Outdated
Comment thread core/services/hf/src/backend.rs Outdated
Comment thread core/services/hf/src/backend.rs Outdated
Comment thread core/services/hf/src/backend.rs Outdated
.with_context("service", HF_SCHEME)),
None => Ok(RepoType::Model),
}?;
let repo_type = self.config.repo_type;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this change.

Comment thread core/services/hf/src/backend.rs Outdated

/// Read a middle range of a known file.
#[tokio::test]
#[ignore = "requires network access"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a tests.rs file and guiding users to run tests in an environment such as OPNEDAL_SERVICE_TEST_HF.

Comment thread core/services/hf/src/core.rs Outdated
// When xet is disabled, preserve whatever HTTP client is already set
// on `info` (important for mock-based unit tests).
#[cfg(feature = "xet")]
let no_redirect_client = if xet_enabled {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service shouldn't build their own http clients. But we can figure this out later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a client not following the redirects to retrieve the X-Xet-Hash header. If I'm not mistaken we need another http client for this, not sure how else could I handle it.

Since a new revamped Xet client is in the works (which will be publised and this PR should eventually depend on) we may be able to delegate this task to the Xet client's side. What do you think @hoytak?

Comment thread core/services/hf/src/core.rs Outdated
Comment thread core/services/hf/src/core.rs Outdated
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Mar 31, 2026

@Xuanwo I keep updating the PR with the new xet changes, @seanses and @hoytak are working on the xet package. I don't have an exact estimate when can we publish it though.

}

/// Build the XET token API URL for this repository.
pub fn xet_token_url(&self, endpoint: &str, token_type: &str) -> String {
Copy link
Copy Markdown

@seanses seanses Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fits in use cases of opendal, I'd recommend to add a create_pr: bool parameter to this function, so that if

  1. someone uses a HF_TOKEN with only read permission to a repo, and
  2. he wants to submit a pull request to this repo with xet files,
    he can obtain a Xet token with write permission by constructing the below xet token url:
    format!("{endpoint}/api/{repo_type}/{repo_id}/xet-write-token?create_pr=1")

davanstrien added a commit to davanstrien/polars that referenced this pull request Apr 2, 2026
Replace the custom HfBucketSinkNode (1,050 lines of XET-specific code)
with a standard ObjectStore implementation backed by OpenDAL's HF service.

HF URLs now flow through the same FileSink path as S3/GCS/Azure,
requiring only a thin build_hf() builder in a new hf.rs module.

Key changes:
- Add crates/polars-io/src/cloud/hf.rs: HF URL parsing, token
  extraction, and OpenDAL ObjectStore construction (~175 lines)
- Wire CloudType::Hf in object_store_setup.rs to call build_hf(),
  matching the pattern used by build_aws/build_gcp/build_azure
- Delete custom sink: hf_bucket/ directory (4 files, 721 lines),
  HfBucketSinkNode (260 lines), IR lowering special-case,
  PhysNodeKind::HfBucketSink variant
- Rename feature flag hf_bucket_sink -> hf across 9 Cargo.toml files
- Bump object_store_opendal compatibility from object_store 0.12 to 0.13

Dependencies: opendal + object_store_opendal (local path deps for now,
will switch to published crate versions once apache/opendal#7185 ships).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
davanstrien added a commit to davanstrien/polars that referenced this pull request Apr 2, 2026
Add an ObjectStore implementation for `hf://` URLs backed by OpenDAL's
HF service, enabling `sink_parquet("hf://buckets/org/name/file.parquet")`
to stream directly to Hugging Face Storage Buckets.

The implementation follows the same pattern as existing cloud backends
(S3/GCS/Azure): a `build_hf()` function in a new `hf.rs` module
constructs the ObjectStore, and `object_store_setup.rs` calls it from
the `CloudType::Hf` match arm. HF URLs flow through the standard
FileSink path with no custom sink node or IR special-casing.

New files:
- crates/polars-io/src/cloud/hf.rs — URL parsing, token resolution,
  OpenDAL ObjectStore construction

Feature flag: `hf` (opt-in, propagated through the workspace)

Dependencies: opendal + object_store_opendal (local path deps for now,
will switch to published crate versions once apache/opendal#7185 ships)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
kszucs added 17 commits April 4, 2026 19:08
…x 404 listing

- Switch hf-xet from pinned git rev to crates.io v1.5.0
- Remove per-service integration tests that duplicate behavior tests
- Keep XET read tests (mbpp_operator) and bucket roundtrip test
- Handle HF 404 as empty listing in HfLister
- Remove serial_test dependency
…isolation

- Remove regular (base64 inline) write path; always use XET protocol
- Remove base64 dependency (no longer needed)
- Remove preupload API and determine_upload_mode
- Remove create_dir capability (HF doesn't natively support it, same as S3)
- Fix bucket tree URL: remove revision segment for buckets
- Fix bucket stat: use HEAD on resolve URL instead of paths-info API
- Fix bucket delete: use bucket_batch(DeleteFile) instead of git commit
- Fix repo_path: convert operator-relative paths to repo-absolute for
  commit/delete/batch payloads (fixes behavior tests with random root)
- Create fresh XetSession per operation to avoid concurrent interference
- Make writer close() retry-safe with CloseState enum tracking progress
- Handle "Already completed" from commit.commit() gracefully on retry
- Restore prefix-based listing: when path doesn't end with '/', list
  parent directory and filter results by prefix. Fixes test_list_prefix,
  test_list_file_with_recursive, and test_list_dir_with_file_path.
- Add retry with exponential backoff for CAS propagation delays in
  register_file(). The XET commit can complete before data is visible
  to the HF API (bucket_batch or LFS commit). Retries on "not found
  in Xet storage", "LFS pointer", and "Invalid file change" errors.
- Don't set version on write metadata (stat doesn't return commit OID,
  causing assert_metadata mismatch in test_write_returns_metadata)
- Increase CAS propagation retry delays to 1s/2s/4s/8s (15s total)
  to handle slower LFS registration on datasets
- Add write_can_empty capability so SimulateLayer can implement
  create_dir by writing a zero-byte file at the directory path
- Strip trailing '/' from bucket paths in register_file() since
  the bucket batch API rejects paths ending with '/'
write_can_empty doesn't help HF because the SimulateLayer writes
zero-byte files at directory paths, but HF's directory concept comes
from path hierarchy, not zero-byte objects. The create_dir tests
remain unsupported (same as before write_can_empty was added).
…lity

8 behavior tests in async_list.rs call op.create_dir() without checking
the create_dir capability, causing failures on backends like HF where
directories are implicit from file paths. Add early-return guards
matching the pattern already used by other tests (test_stat_dir,
test_delete_empty_dir, test_read_with_dir_path).

Also clean up HF backend: remove .gitkeep/create_dir workarounds,
remove trailing-slash stripping from bucket_batch, remove .gitkeep
filtering from lister.
…nfigs

Rename .github/services/huggingface to .github/services/hf so the
test framework uses OPENDAL_HF_* env vars (matching the hf scheme).

Add two new CI test variants with ephemeral repos:
- huggingface_bucket: creates a private bucket per run, deletes after
- huggingface_dataset: creates a private dataset per run, sequential

Both use hf CLI for repo lifecycle and 1Password for credentials.
Repo names include GITHUB_RUN_ID for uniqueness.

Requires 1Password vault entries under services/hf:
- token (HF write token)
- owner (HF user or org, e.g., "opendal")
- Add xet_upload_commit() and xet_download_group() to HfCore,
  encapsulating session creation, token refresh URL and headers
- Restore XetSession as a field on HfCore (shared factory)
- Make xet_token_refresh_headers() private
- Simplify writer: remove CloseState enum, use Option<XetStreamUpload>
- Simplify reader: replace inline XET setup with core.xet_download_group()
- Rename upload_commit field to commit in HfWriter
…ucket

Align naming to clearly indicate both methods do the same thing
(register/delete files) for different repo types:
- commit_git: git-based repos (model/dataset/space)
- commit_bucket: bucket repos via NDJSON batch API
Writer:
- Rename commit/stream fields to xet_commit/xet_stream for clarity
- Move XET CAS finalization into commit() method
- Rename register_file() to commit()
- Add docstrings explaining the XET-only upload flow

Deleter:
- Extract shared NotFound error handling from commit_git/commit_bucket

Config:
- Hoist test imports to module level
Remove retry_on_cas_delay helper — CAS propagation retries should be
handled by OpenDAL's RetryLayer via temporary error marking, not by
custom retry loops in the writer. Simplify commit() to call
commit_bucket/commit_git directly.
Replace verbose `self.core.repo.repo_type == RepoType::Bucket` checks
with `self.core.is_bucket()` across backend, writer, reader, deleter.
Remove now-unused RepoType imports from those modules.
- Rename RepoType to HfRepoType for clarity and namespace consistency
- Import HfUri and HfRepoType at the top of core.rs
- Add docstrings to HfCore::is_bucket() and HfCore::uri()
- Remove HfListerWrapper — bucket tree API supports recursive=false
  and returns directory entries, so HierarchyLister is unnecessary
- Simplify list() to always use PageLister<HfLister>
- Pass recursive=false explicitly for buckets (API defaults to recursive)
- Move is_bucket() from HfCore to HfRepo, use repo.is_bucket() everywhere
- Default PathInfo.size to 0 (bucket directory entries omit it)
…Repo

- Add DeletedFolder type and deletedFolders field to git commit payload
  matching HF's API which distinguishes file vs folder deletions
- Split directory paths (trailing /) into deletedFolders, file paths
  into deletedFiles when committing to git repos
- Move git_commit_url() from HfUri to HfRepo (commit URL depends on
  repo, not on a specific file path)
- Make test_remove_all_with_prefix_exists skip gracefully when the
  service has real directory semantics
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Apr 6, 2026

@Xuanwo I updated the tests to use the behavior test suite properly though we should set up the HF credentials for GH actions.

kszucs added 2 commits April 6, 2026 16:04
- Rename huggingface_bucket → hf_bucket, huggingface_dataset → hf_dataset
- Use pre-created repos with 1Password secrets (ephemeral cleanup can't
  run after tests in the current CI action architecture)
- Fix test_remove_all_with_prefix_exists: use a probe write to detect
  file/dir coexistence support, keep original test paths intact

1Password entries needed under services/hf:
- token, bucket_repo_id, dataset_repo_id
davanstrien added a commit to davanstrien/polars that referenced this pull request Apr 7, 2026
Point opendal and object_store_opendal at kszucs/opendal@4c70bd8
(hf-revamp branch) which uses published hf-xet 1.5.0 from crates.io.

Once apache/opendal#7185 merges and publishes, these become simple
version deps (e.g. opendal = "0.56").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank yor for working on this, let's rock!

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Apr 8, 2026
@Xuanwo Xuanwo merged commit 8d3dbcc into apache:main Apr 8, 2026
382 of 383 checks passed
davanstrien added a commit to davanstrien/polars that referenced this pull request Apr 14, 2026
apache/opendal#7185 merged today — HF backend XET support is now on
upstream main. Pin to commit 8d3dbcc3ef until a release ships.

Next: once opendal publishes a release with services-hf, swap these
git deps for a version number (opendal = "0.56" or whatever ships).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@everettVT
Copy link
Copy Markdown

@Xuanwo , I'm looking to leverage this HF storage buckets feat with Daft's OpenDAL integration and was wondering when the next release is planned for?

@Xuanwo
Copy link
Copy Markdown
Member

Xuanwo commented Apr 20, 2026

@Xuanwo , I'm looking to leverage this HF storage buckets feat with Daft's OpenDAL integration and was wondering when the next release is planned for?

Running! #7396

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer releases-note/feat The PR implements a new feature or has a title that begins with "feat" size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants