Skip to content

[BUG] DecoupledCheckpointEngine hangs indefinitely due to missing timeouts and process health checks #7741

@Rakshit-gen

Description

@Rakshit-gen

Describe the bug

The DecoupledCheckpointEngine has multiple critical issues that can cause training jobs to hang indefinitely:

  1. No timeout on process join: cleanup() calls self.ckpt_process.join() with no timeout - if the checkpoint process crashes or hangs, the training script hangs forever.

  2. No timeout on event wait: commit() calls self.save_event.wait() with no timeout - if the checkpoint process dies before setting the event, training hangs forever.

  3. Assertion used for runtime validation: Line 119 uses assert info == self.commit_info which is disabled with python -O, allowing silent data corruption in production.

  4. __del__ calls blocking cleanup: The destructor calls cleanup(), so any exception during training can cause the program to hang on exit.

  5. No process health checks: No verification that ckpt_process is still alive before waiting on events or queues.

To Reproduce

  1. Enable decoupled checkpointing in DeepSpeed config
  2. Start a training job that saves checkpoints
  3. Kill or crash the checkpoint subprocess (e.g., OOM, signal, or disk full)
  4. Observe that the main training process hangs indefinitely on save_event.wait() or ckpt_process.join()

Alternatively:

  1. Run training with python -O (optimized mode)
  2. Trigger a checkpoint with mismatched commit_info
  3. The assert is skipped, potentially corrupting checkpoint state

Expected behavior

  • Checkpoint operations should have timeouts and fail gracefully with clear error messages
  • Process health should be checked before blocking waits
  • Runtime validation should use proper if statements, not assertions
  • Training should be able to recover or exit cleanly if checkpointing fails

Affected Code

File: deepspeed/runtime/checkpoint_engine/decoupled_checkpoint_engine.py

Line Issue Severity
119 assert info == self.commit_info - disabled with -O Critical
123 self.save_event.wait() - no timeout Critical
155 self.ckpt_process.join() - no timeout Critical
99-100 __del__ calls cleanup() - can hang on destruction High

System info

  • Affects: All systems using DecoupledCheckpointEngine
  • OS: Any
  • GPU: Any
  • Python: Any version

Proposed Fix

# Replace assert with proper validation
if info != self.commit_info:
    raise ValueError(f"Checkpoint commit info mismatch: expected {self.commit_info}, got {info}")

# Add timeout to event wait with process health check
TIMEOUT_SECONDS = 300  # 5 minutes
while not self.save_event.wait(timeout=10):
    if not self.ckpt_process.is_alive():
        raise RuntimeError("Checkpoint process died unexpectedly")

# Add timeout to process join
self.ckpt_process.join(timeout=TIMEOUT_SECONDS)
if self.ckpt_process.is_alive():
    self.ckpt_process.terminate()
    raise RuntimeError("Checkpoint process failed to exit within timeout")

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions