[bugfix] fix hf token fetch and deletion logic by slin1237 · Pull Request #122 · ome-projects/ome

slin1237 · 2025-07-02T01:55:42Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes multiple critical issues in the model-agent that were preventing proper model downloads and management in production environments:

1. RBAC Permissions Issue

Added permission for model-agent to access secrets at cluster scope
Previously failing with "secrets is forbidden" errors

2. HTTP 429 Rate Limiting

Implemented proper rate limit handling with exponential backoff and jitter
Added Retry-After header parsing for server-suggested delays
Prevents thundering herd problem in large clusters with many agents

3. Nil Pointer Dereference

Fixed crashes in progress logging when progressManager.logger was nil
Added proper nil checks throughout the codebase

4. Model Deletion Path Mismatch

Fixed incorrect path construction during model deletion
Now uses consistent getDestPath() for both download and delete operations

5. Flexible Secret Key Configuration

Made secret key name configurable via model parameters
Previously hardcoded to "token", now supports custom keys
Updated documentation with examples

6. Download Cancellation Support

Added context cancellation for in-progress downloads
Prevents stuck downloads when resources are deleted
Tracks active downloads and cancels them appropriately

7. HuggingFace Token Authentication

Fixed critical bug where tokens were retrieved but not passed to downloads
Added WithDownloadToken option to pass tokens per download
Resolves HTTP 401 errors for gated models

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

The HF token fix is the most critical - it was causing authentication failures for all gated models
The rate limiting improvements are essential for large-scale deployments
All changes maintain backward compatibility
Tests have been updated and are passing

Does this PR introduce a user-facing change?

- Fixed model-agent authentication failures for gated HuggingFace models
- Added configurable secret key names for authentication tokens (defaults to "token")
- Improved reliability with better rate limit handling and download cancellation
- Fixed model deletion not properly cleaning up downloaded files

gemini-code-assist

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a series of crucial bug fixes and reliability enhancements for the model-agent, primarily focusing on improving HuggingFace model download stability, authentication, and lifecycle management. The changes aim to prevent common issues like rate limiting, authentication failures, and stuck downloads, making the agent more robust for production environments.

Highlights

HuggingFace Token Authentication Fix: I've addressed a critical bug where HuggingFace tokens were retrieved but not properly passed to download operations, resolving HTTP 401 errors for gated models. This is now handled by a new WithDownloadToken option.
Enhanced Rate Limiting Handling: I've implemented robust rate limit handling for HuggingFace API calls, including exponential backoff with jitter, parsing of Retry-After headers, and increased retry attempts. A staggered start mechanism based on node names has also been added to prevent 'thundering herd' issues in large clusters.
Download Cancellation Support: I've added support for cancelling in-progress downloads. The model agent now tracks active downloads and can gracefully cancel them when a model resource is deleted, preventing stuck downloads and improving resource management.
Flexible Secret Key Configuration: The secret key name used to fetch HuggingFace tokens is now configurable via model parameters (defaults to 'token'), allowing for greater flexibility in secret management. Documentation has been updated to reflect this.
Model Deletion Path Consistency: I've fixed an issue where model deletion used an inconsistent path construction, ensuring that getDestPath() is used uniformly for both download and delete operations, leading to proper cleanup.
General Stability Improvements: I've included fixes for nil pointer dereferences in progress logging and added checks for incomplete files before renaming during downloads, enhancing overall stability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several important bug fixes and improvements for the model-agent, primarily focused on enhancing reliability in production environments. Key changes include fixing Hugging Face token authentication, adding robust rate-limiting with exponential backoff, and implementing download cancellation. The changes are well-structured and address critical issues.

My review has identified a few areas for improvement:

A potential bug in the startup jitter calculation that could cause it to not work for some nodes.
A race condition in the download cancellation logic due to the use of time.Sleep for synchronization.
A misleading comment in the download logic.

Addressing these points will further improve the robustness and maintainability of the agent.

gemini-code-assist · 2025-07-02T01:57:25Z

+		// Wait a bit for download to stop
+		time.Sleep(2 * time.Second)


Using a hardcoded time.Sleep is not a reliable way to ensure a download has been cancelled and has stopped writing to files. This creates a race condition where file deletion might start while the download is still in progress, potentially causing errors or leaving the filesystem in an inconsistent state.

A more robust approach would involve a proper synchronization mechanism, such as a sync.WaitGroup, to confirm that the download goroutine has fully terminated before proceeding with deletion.

gemini-code-assist · 2025-07-02T01:57:25Z

+		hash := 0
+		for _, c := range cfg.nodeName {
+			hash = hash*31 + int(c)
+		}


The current hash calculation for the jitter delay can result in a negative hash value due to integer overflow. In Go, the % operator with a negative number yields a negative result (e.g., -5 % 30 is -5). Consequently, time.Sleep will receive a negative duration and return immediately, defeating the purpose of the jitter for some node names.

To ensure the delay is always non-negative, the hash should be calculated using an unsigned integer type.

Suggested change

hash := 0

for _, c := range cfg.nodeName {

hash = hash*31 + int(c)

}

var hash uint32

for _, c := range cfg.nodeName {

hash = hash*31 + uint32(c)

}

gemini-code-assist · 2025-07-02T01:57:25Z

+
 	// Move the file to final destination
 	if err := os.Rename(incompletePath, destPath); err != nil {
+		// Check if it's a cross-device error and try copy instead


This comment is misleading. The code below it checks for os.IsNotExist and returns an error, but it does not handle cross-device move errors (which would be syscall.EXDEV on Unix) or attempt to copy the file as the comment suggests. The comment should be updated to accurately describe the code's behavior.

Suggested change

// Check if it's a cross-device error and try copy instead

// Add more context if the rename failed because the source file disappeared.

[bugfix] fix hf token fetch and deletion logic

ba179b1

slin1237 requested a review from pallasathena92 as a code owner July 2, 2025 01:55

github-actions Bot added the bug Something isn't working label Jul 2, 2025

gemini-code-assist Bot reviewed Jul 2, 2025

View reviewed changes

slin1237 merged commit 52edf70 into main Jul 2, 2025
25 checks passed

slin1237 deleted the slin/bug-fix-3 branch July 2, 2025 02:01

slin1237 added a commit that referenced this pull request Dec 22, 2025

[bugfix] fix hf token fetch and deletion logic (#122)

1e49caa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] fix hf token fetch and deletion logic#122

[bugfix] fix hf token fetch and deletion logic#122
slin1237 merged 1 commit into
mainfrom
slin/bug-fix-3

slin1237 commented Jul 2, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jul 2, 2025

Uh oh!

gemini-code-assist Bot Jul 2, 2025

Uh oh!

gemini-code-assist Bot Jul 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		// Wait a bit for download to stop
		time.Sleep(2 * time.Second)

	// Check if it's a cross-device error and try copy instead
	// Add more context if the rename failed because the source file disappeared.

Conversation

slin1237 commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

1. RBAC Permissions Issue

2. HTTP 429 Rate Limiting

3. Nil Pointer Dereference

4. Model Deletion Path Mismatch

5. Flexible Secret Key Configuration

6. Download Cancellation Support

7. HuggingFace Token Authentication

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

slin1237 commented Jul 2, 2025 •

edited

Loading