Skip to content

Enhance DDP strategy and improve multiprocessing handling#61

Closed
romeokienzler wants to merge 13 commits into
mainfrom
improve_iterate_support
Closed

Enhance DDP strategy and improve multiprocessing handling#61
romeokienzler wants to merge 13 commits into
mainfrom
improve_iterate_support

Conversation

@romeokienzler
Copy link
Copy Markdown
Collaborator

  • allow overriding batch_size via the CLI for all major commands,
  • enhance distributed training initialization to avoid port collisions and backend issues
  • ensure DataLoader workers are spawned safely with CUDA.

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
…k' globally on Linux

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
…er collision avoidance

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
@romeokienzler romeokienzler requested a review from albanpuech May 5, 2026 07:45
Signed-off-by: Romeo Kienzler <5694071+romeokienzler@users.noreply.github.com>
Comment thread gridfm_graphkit/cli.py
# Ensure MASTER_PORT is set to a free port to avoid EADDRINUSE
if not os.environ.get("MASTER_PORT"):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as _s:
_s.bind(("", 0))
romeokienzler and others added 4 commits May 5, 2026 13:02
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
…s in main function

Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <romeo.kienzler1@ibm.com>
Signed-off-by: Romeo Kienzler <5694071+romeokienzler@users.noreply.github.com>
@romeokienzler
Copy link
Copy Markdown
Collaborator Author

@albanpuech closing since the precommit hook made this unreviewable, will recreate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants