Skip to content

[bugfix] fix: resolve model agent download loops and timeout issues#126

Merged
slin1237 merged 1 commit into
mainfrom
slin/model-agent-bug
Jul 3, 2025
Merged

[bugfix] fix: resolve model agent download loops and timeout issues#126
slin1237 merged 1 commit into
mainfrom
slin/model-agent-bug

Conversation

@slin1237
Copy link
Copy Markdown
Collaborator

@slin1237 slin1237 commented Jul 3, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes critical issues with the model agent that were causing infinite download loops and download failures for large models:

  1. Infinite Download Loop Fix: The model agent was updating ClusterBaseModel specs after parsing model configurations, which triggered other agents to re-download
    the same model. This created an infinite loop where models were continuously downloaded. Fixed by preventing model agents from updating the ClusterBaseModel CRD.

  2. Download Timeout Fix: Large model files (>10GB) were failing with "context deadline exceeded" errors because the HTTP client timeout was set to only 10
    minutes. Increased the timeout to 2 hours to accommodate large model downloads.

  3. Race Condition Fix: Multiple workers downloading the same model could encounter a race condition where the .incomplete file was missing when trying to rename
    it. Added proper handling to check if the destination file already exists with the correct size.

  4. Log Flooding Fix: Disabled detailed HuggingFace hub progress logging that was creating excessive log entries during downloads.

Which issue(s) this PR fixes:

Fixes # (Please link the issue if one exists)

Special notes for your reviewer:

  • The ClusterBaseModel spec update was happening in config_parser.go after model metadata extraction
  • The timeout increase is in pkg/hfutil/hub/constants.go - changed from 10 minutes to 2 hours
  • The race condition handling is in pkg/hfutil/hub/download.go in the downloadToTmpAndMove function
  • These changes have been tested with large model downloads (400GB+ models)

Does this PR introduce a user-facing change?

Fixed model agent infinite download loops and timeout errors when downloading large models. Model downloads are now more reliable with proper timeout handling and
race condition prevention.

@github-actions github-actions Bot added the bug Something isn't working label Jul 3, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several critical issues within the model agent's download mechanism, aiming to significantly enhance the reliability and stability of model downloads. It resolves problems such as infinite download loops, timeout failures for large files, and race conditions during concurrent downloads, while also reducing log verbosity.

Highlights

  • Infinite Download Loop Fix: Prevented the model agent from updating ClusterBaseModel CRD specs after parsing model configurations, which previously triggered continuous re-downloads and infinite loops. (Note: The specific code change for this fix was not included in the provided patch, but it is a stated intent of the PR.)
  • Download Timeout Increase: Increased the HTTP client timeout for model downloads from 10 minutes to 2 hours to accommodate large model files (e.g., >10GB) and prevent 'context deadline exceeded' errors.
  • Race Condition Handling: Improved the robustness of the download process by adding checks to handle race conditions during file renaming. The system now verifies if the destination file already exists with the correct size, preventing errors when multiple workers might attempt to download or move the same file concurrently.
  • Log Verbosity Reduction: Disabled detailed HuggingFace hub progress logging to reduce excessive log entries that were flooding the logs during model downloads.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses several critical issues related to model downloads, including infinite loops, timeouts, race conditions, and log flooding. The changes look good and align with the described fixes. I've made one suggestion in pkg/hfutil/hub/download.go to improve the implementation of the race condition fix by removing a redundant system call and making error handling more explicit. Overall, a solid set of fixes.

Comment on lines +285 to +293
if _, destErr := os.Stat(destPath); destErr == nil {
// Destination exists, verify size
destInfo, _ := os.Stat(destPath)
if metadata.Size > 0 && destInfo.Size() == metadata.Size {
// File already moved successfully, remove incomplete file if it exists
os.Remove(incompletePath)
return nil
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a good fix for the race condition. However, the implementation can be improved:

  • Redundant os.Stat: The os.Stat(destPath) call on line 287 is redundant. The FileInfo can be retrieved from the os.Stat call on line 285, which avoids an unnecessary system call.
  • Ignored Error: The error from the second os.Stat on line 287 is ignored. It's better practice to handle or at least check errors.
  • Implicitly Ignored Error: The error from os.Remove(incompletePath) on line 290 is also ignored. It's better to make this explicit using the blank identifier (_ = ...) to show this is intentional.

Here is a suggested change that addresses these points.

Suggested change
if _, destErr := os.Stat(destPath); destErr == nil {
// Destination exists, verify size
destInfo, _ := os.Stat(destPath)
if metadata.Size > 0 && destInfo.Size() == metadata.Size {
// File already moved successfully, remove incomplete file if it exists
os.Remove(incompletePath)
return nil
}
}
if destInfo, destErr := os.Stat(destPath); destErr == nil {
// Destination exists, verify size
if metadata.Size > 0 && destInfo.Size() == metadata.Size {
// File already moved successfully, remove incomplete file if it exists.
// We can ignore the error since another worker might have already removed it.
_ = os.Remove(incompletePath)
return nil
}
}

@slin1237 slin1237 merged commit 8e2356f into main Jul 3, 2025
24 checks passed
@slin1237 slin1237 deleted the slin/model-agent-bug branch July 3, 2025 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant