Skip to content

ci(hf): isolate CI repos and stabilize binding tests#7368

Merged
Xuanwo merged 23 commits intomainfrom
xuanwo/fix-hf-xet-runtime
Apr 9, 2026
Merged

ci(hf): isolate CI repos and stabilize binding tests#7368
Xuanwo merged 23 commits intomainfrom
xuanwo/fix-hf-xet-runtime

Conversation

@Xuanwo
Copy link
Copy Markdown
Member

@Xuanwo Xuanwo commented Apr 8, 2026

Which issue does this PR close?

Closes #.
Related: #7367

Rationale for this change

HF behavior tests are currently unstable in CI for two different reasons. The Java binding still crashes on Linux in the HF blocking path, and the shared HF CI repositories introduce cross-job interference for writable test cases.

What changes are included in this PR?

This PR switches HF CI jobs to create and clean up temporary repos or buckets per job so writable behavior tests do not race on the same remote state. It also keeps the Java HF behavior case disabled while #7367 is still open, and restores the missing create_dir capability gating in the Go and Node.js HF bucket list suites.

Are there any user-facing changes?

No.

AI Usage Statement

Used GPT-5 for CI log inspection, patch preparation, and PR description drafting.

@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. releases-note/fix The PR fixes a bug or has a title that begins with "fix" labels Apr 8, 2026
HF dataset tests were failing with 412 errors because multiple CI jobs
(core, Go, Node.js, Java) committed to the same shared git-based repo
simultaneously. Each job now creates its own temporary HF repo/bucket
and deletes it after the job via a node20 action post-run hook.
@Xuanwo Xuanwo changed the title fix(services/hf): bind xet session to active runtime ci(bindings/java): disable hf behavior tests Apr 8, 2026
@dosubot dosubot Bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Apr 8, 2026
@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Apr 8, 2026
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 8, 2026
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Apr 8, 2026
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Apr 8, 2026
@Xuanwo Xuanwo changed the title ci(bindings/java): disable hf behavior tests ci(hf): isolate CI repos and stabilize binding tests Apr 8, 2026
@kszucs
Copy link
Copy Markdown
Member

kszucs commented Apr 9, 2026

Seems like within single jobs there are still commits races. @Xuanwo could you try making condition not match retryable?

diff --git a/core/services/hf/src/error.rs b/core/services/hf/src/error.rs
index c0e36616d..84c83f8e5 100644
--- a/core/services/hf/src/error.rs
+++ b/core/services/hf/src/error.rs
@@ -44,7 +44,7 @@ pub(super) fn parse_error(resp: Response<Buffer>) -> Error {
     let (kind, retryable) = match parts.status {
         StatusCode::NOT_FOUND => (ErrorKind::NotFound, false),
         StatusCode::UNAUTHORIZED | StatusCode::FORBIDDEN => (ErrorKind::PermissionDenied, false),
-        StatusCode::PRECONDITION_FAILED => (ErrorKind::ConditionNotMatch, false),
+        StatusCode::PRECONDITION_FAILED => (ErrorKind::ConditionNotMatch, true),
         StatusCode::INTERNAL_SERVER_ERROR
         | StatusCode::BAD_GATEWAY
         | StatusCode::SERVICE_UNAVAILABLE

@Xuanwo
Copy link
Copy Markdown
Member Author

Xuanwo commented Apr 9, 2026

Seems like within single jobs there are still commits races. @Xuanwo could you try making condition not match retryable?

Thanks! And updated.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Apr 9, 2026
@kszucs
Copy link
Copy Markdown
Member

kszucs commented Apr 9, 2026

@Xuanwo this retry mechanism also tries to upload the xet blobs again raising an Already completed error. It would be better to enable retries just for the committing phase, though that would require custom retry logic with backon.

@kszucs
Copy link
Copy Markdown
Member

kszucs commented Apr 9, 2026

Could you try to cherry pick fa165d5 ?

@Xuanwo
Copy link
Copy Markdown
Member Author

Xuanwo commented Apr 9, 2026

@Xuanwo this retry mechanism also tries to upload the xet blobs again raising an Already completed error. It would be better to enable retries just for the committing phase, though that would require custom retry logic with backon.

I believe the root cause here is our writer's close is not re-enter safe. So I think the better solution is to store the finished file info and retry the commit.

@kszucs
Copy link
Copy Markdown
Member

kszucs commented Apr 9, 2026

I can try to make it re-enter safe e.g. using a state enum. Let my try that.

@Xuanwo
Copy link
Copy Markdown
Member Author

Xuanwo commented Apr 9, 2026

I can try to make it re-enter safe e.g. using a state enum. Let my try that.

Thank you for that! But I will cover this 🤟

@Xuanwo
Copy link
Copy Markdown
Member Author

Xuanwo commented Apr 9, 2026

Only one test failure now, working on it

failures:

---- behavior::test_remove_all_basic ----
test panicked: all objects should be removed


failures:
    behavior::test_remove_all_basic

test result: FAILED. 91 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 27.53s

@Xuanwo
Copy link
Copy Markdown
Member Author

Xuanwo commented Apr 9, 2026

Cool, let's go!

@Xuanwo Xuanwo merged commit 9595f8c into main Apr 9, 2026
441 of 444 checks passed
@Xuanwo Xuanwo deleted the xuanwo/fix-hf-xet-runtime branch April 9, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer releases-note/fix The PR fixes a bug or has a title that begins with "fix" size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants