Skip to content

[bugfix] fix hf token fetch and deletion logic#122

Merged
slin1237 merged 1 commit into
mainfrom
slin/bug-fix-3
Jul 2, 2025
Merged

[bugfix] fix hf token fetch and deletion logic#122
slin1237 merged 1 commit into
mainfrom
slin/bug-fix-3

Conversation

@slin1237
Copy link
Copy Markdown
Collaborator

@slin1237 slin1237 commented Jul 2, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes multiple critical issues in the model-agent that were preventing proper model downloads and management in production environments:

1. RBAC Permissions Issue

  • Added permission for model-agent to access secrets at cluster scope
  • Previously failing with "secrets is forbidden" errors

2. HTTP 429 Rate Limiting

  • Implemented proper rate limit handling with exponential backoff and jitter
  • Added Retry-After header parsing for server-suggested delays
  • Prevents thundering herd problem in large clusters with many agents

3. Nil Pointer Dereference

  • Fixed crashes in progress logging when progressManager.logger was nil
  • Added proper nil checks throughout the codebase

4. Model Deletion Path Mismatch

  • Fixed incorrect path construction during model deletion
  • Now uses consistent getDestPath() for both download and delete operations

5. Flexible Secret Key Configuration

  • Made secret key name configurable via model parameters
  • Previously hardcoded to "token", now supports custom keys
  • Updated documentation with examples

6. Download Cancellation Support

  • Added context cancellation for in-progress downloads
  • Prevents stuck downloads when resources are deleted
  • Tracks active downloads and cancels them appropriately

7. HuggingFace Token Authentication

  • Fixed critical bug where tokens were retrieved but not passed to downloads
  • Added WithDownloadToken option to pass tokens per download
  • Resolves HTTP 401 errors for gated models

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  • The HF token fix is the most critical - it was causing authentication failures for all gated models
  • The rate limiting improvements are essential for large-scale deployments
  • All changes maintain backward compatibility
  • Tests have been updated and are passing

Does this PR introduce a user-facing change?

- Fixed model-agent authentication failures for gated HuggingFace models
- Added configurable secret key names for authentication tokens (defaults to "token")
- Improved reliability with better rate limit handling and download cancellation
- Fixed model deletion not properly cleaning up downloaded files

@slin1237 slin1237 requested a review from pallasathena92 as a code owner July 2, 2025 01:55
@github-actions github-actions Bot added the bug Something isn't working label Jul 2, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a series of crucial bug fixes and reliability enhancements for the model-agent, primarily focusing on improving HuggingFace model download stability, authentication, and lifecycle management. The changes aim to prevent common issues like rate limiting, authentication failures, and stuck downloads, making the agent more robust for production environments.

Highlights

  • HuggingFace Token Authentication Fix: I've addressed a critical bug where HuggingFace tokens were retrieved but not properly passed to download operations, resolving HTTP 401 errors for gated models. This is now handled by a new WithDownloadToken option.
  • Enhanced Rate Limiting Handling: I've implemented robust rate limit handling for HuggingFace API calls, including exponential backoff with jitter, parsing of Retry-After headers, and increased retry attempts. A staggered start mechanism based on node names has also been added to prevent 'thundering herd' issues in large clusters.
  • Download Cancellation Support: I've added support for cancelling in-progress downloads. The model agent now tracks active downloads and can gracefully cancel them when a model resource is deleted, preventing stuck downloads and improving resource management.
  • Flexible Secret Key Configuration: The secret key name used to fetch HuggingFace tokens is now configurable via model parameters (defaults to 'token'), allowing for greater flexibility in secret management. Documentation has been updated to reflect this.
  • Model Deletion Path Consistency: I've fixed an issue where model deletion used an inconsistent path construction, ensuring that getDestPath() is used uniformly for both download and delete operations, leading to proper cleanup.
  • General Stability Improvements: I've included fixes for nil pointer dereferences in progress logging and added checks for incomplete files before renaming during downloads, enhancing overall stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several important bug fixes and improvements for the model-agent, primarily focused on enhancing reliability in production environments. Key changes include fixing Hugging Face token authentication, adding robust rate-limiting with exponential backoff, and implementing download cancellation. The changes are well-structured and address critical issues.

My review has identified a few areas for improvement:

  • A potential bug in the startup jitter calculation that could cause it to not work for some nodes.
  • A race condition in the download cancellation logic due to the use of time.Sleep for synchronization.
  • A misleading comment in the download logic.

Addressing these points will further improve the robustness and maintainability of the agent.

Comment thread pkg/modelagent/gopher.go
Comment on lines +405 to +406
// Wait a bit for download to stop
time.Sleep(2 * time.Second)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a hardcoded time.Sleep is not a reliable way to ensure a download has been cancelled and has stopped writing to files. This creates a race condition where file deletion might start while the download is still in progress, potentially causing errors or leaving the filesystem in an inconsistent state.

A more robust approach would involve a proper synchronization mechanism, such as a sync.WaitGroup, to confirm that the download goroutine has fully terminated before proceeding with deletion.

Comment thread cmd/model-agent/main.go
Comment on lines +229 to +232
hash := 0
for _, c := range cfg.nodeName {
hash = hash*31 + int(c)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current hash calculation for the jitter delay can result in a negative hash value due to integer overflow. In Go, the % operator with a negative number yields a negative result (e.g., -5 % 30 is -5). Consequently, time.Sleep will receive a negative duration and return immediately, defeating the purpose of the jitter for some node names.

To ensure the delay is always non-negative, the hash should be calculated using an unsigned integer type.

Suggested change
hash := 0
for _, c := range cfg.nodeName {
hash = hash*31 + int(c)
}
var hash uint32
for _, c := range cfg.nodeName {
hash = hash*31 + uint32(c)
}


// Move the file to final destination
if err := os.Rename(incompletePath, destPath); err != nil {
// Check if it's a cross-device error and try copy instead
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is misleading. The code below it checks for os.IsNotExist and returns an error, but it does not handle cross-device move errors (which would be syscall.EXDEV on Unix) or attempt to copy the file as the comment suggests. The comment should be updated to accurately describe the code's behavior.

Suggested change
// Check if it's a cross-device error and try copy instead
// Add more context if the rename failed because the source file disappeared.

@slin1237 slin1237 merged commit 52edf70 into main Jul 2, 2025
25 checks passed
@slin1237 slin1237 deleted the slin/bug-fix-3 branch July 2, 2025 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant