SwanLab: add id and resume support for resuming runs (fixes #43698)#43739
SwanLab: add id and resume support for resuming runs (fixes #43698)#43739Rayyan-Oumlil wants to merge 1 commit intohuggingface:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for resuming SwanLab runs when using Trainer with resume_from_checkpoint, fixing issue #43698. Previously, resuming training always created a new SwanLab experiment because there was no way to pass id and resume parameters to swanlab.init().
Changes:
- Added support for
SWANLAB_RUN_ID/SWANLAB_IDandSWANLAB_RESUMEenvironment variables - Updated
SwanLabCallback.setup()to read these env vars and pass them toswanlab.init() - Added documentation for the new environment variables in the callback's docstring
| - **SWANLAB_RUN_ID** (`str`, *optional*, defaults to `None`): | ||
| The SwanLab run ID (21-character string) to resume. When set together with `SWANLAB_RESUME`, enables | ||
| resuming a previous run so that `trainer.train(resume_from_checkpoint=...)` continues the same | ||
| experiment instead of creating a new one. The run ID can be found in the experiment's Environment tab | ||
| or in the URL on the SwanLab dashboard. |
There was a problem hiding this comment.
The documentation mentions "SWANLAB_ID" as an alternative to "SWANLAB_RUN_ID", but this is only documented in the implementation code (line 2317), not in the docstring. For consistency and clarity, the docstring should document both environment variable names.
Consider updating the docstring at line 2281 to mention both names, for example:
"SWANLAB_RUN_ID (or SWANLAB_ID) (str, optional, defaults to None):"
This would make it clear to users that either environment variable name can be used.
| if swanlab_resume is not None: | ||
| if swanlab_resume.lower() in ("true", "1"): | ||
| init_args["resume"] = True | ||
| elif swanlab_resume.lower() in ("false", "0"): | ||
| init_args["resume"] = False | ||
| elif swanlab_resume.lower() in ("allow", "must", "never"): | ||
| init_args["resume"] = swanlab_resume.lower() | ||
| else: | ||
| init_args["resume"] = swanlab_resume |
There was a problem hiding this comment.
The environment variable parsing for boolean values doesn't follow the established pattern in this codebase. Other integrations in this file use either ENV_VARS_TRUE_VALUES (e.g., MLflowCallback at lines 1300-1301) or utility functions like is_env_variable_true() for consistent parsing.
For consistency with the codebase, consider using the standard approach:
swanlab_resume_env = os.getenv("SWANLAB_RESUME", None)
if swanlab_resume_env is not None:
resume_upper = swanlab_resume_env.upper()
if resume_upper in ENV_VARS_TRUE_VALUES or resume_upper == "ALLOW":
init_args["resume"] = "allow" # or True, depending on SwanLab's API
elif resume_upper in ("FALSE", "0", "NEVER"):
init_args["resume"] = "never" # or False
elif resume_upper == "MUST":
init_args["resume"] = "must"
else:
init_args["resume"] = swanlab_resume_envThis approach:
- Follows the established pattern (see src/transformers/integrations/integration_utils.py:1300-1301)
- Handles case-insensitivity consistently
- Avoids potential AttributeError if the value is not a string
- Uses the same values that other integrations recognize ("TRUE", "1", "FALSE", "0")
MekkCyber
left a comment
There was a problem hiding this comment.
Thank you @Rayyan-Oumlil ! sorry this pr is superseded by #43719
Fixes #43698
Summary
When using
Trainerwith SwanLab and resuming training (trainer.train(resume_from_checkpoint=...)), the integration previously had no way to passidandresumetoswanlab.init(), so a new experiment was always created instead of continuing the existing one.This PR adds support for resuming a previous SwanLab run via environment variables (same pattern as MLflow's
MLFLOW_RUN_ID):"allow"/True(resume if run exists, else create new),"must"(resume only), or"never"/False(always create new).Changes
SwanLabCallback.setup(), readSWANLAB_RUN_ID/SWANLAB_IDandSWANLAB_RESUMEfrom the environment and pass them toswanlab.init().SWANLAB_RESUMEfor common values (true/false,allow/must/never) so both string and boolean-like env values work.Usage
When resuming training, set the env vars before calling
trainer.train(resume_from_checkpoint=...):This keeps metrics and history in a single SwanLab run across restarts.