Adding torch accelerator to ddp-tutorial-series example#1376
Adding torch accelerator to ddp-tutorial-series example#1376dggaytan wants to merge 2 commits intopytorch:mainfrom
Conversation
✅ Deploy Preview for pytorch-examples-preview canceled.
|
| os.environ["MASTER_PORT"] = "12355" | ||
| torch.cuda.set_device(rank) | ||
| init_process_group(backend="nccl", rank=rank, world_size=world_size) | ||
| os.environ["MASTER_PORT"] = "12453" |
There was a problem hiding this comment.
It was an error from my side, I've changed the port
| if torch.accelerator.is_available(): | ||
| device_type = torch.accelerator.current_accelerator() | ||
| torch.accelerator.set_device_idx(rank) | ||
| device: torch.device = torch.device(f"{device_type}:{rank}") |
There was a problem hiding this comment.
| device: torch.device = torch.device(f"{device_type}:{rank}") | |
| device = torch.device(f"{device_type}:{rank}") |
| device_type = torch.accelerator.current_accelerator() | ||
| torch.accelerator.set_device_idx(rank) | ||
| device: torch.device = torch.device(f"{device_type}:{rank}") | ||
| torch.accelerator.device_index(rank) |
There was a problem hiding this comment.
There is no such API device_index() in 2.7: https://docs.pytorch.org/docs/stable/accelerator.html
What is it doing? You did set index 2 lines above...
There was a problem hiding this comment.
Ok, device_index() will appear only in 2.8: https://docs.pytorch.org/docs/main/generated/torch.accelerator.device_index.html#torch.accelerator.device_index. And this is a context manager, i.e. you need to use it as with device_index(). I don't see why you are using it here. And recently merged #1375 attempts to do the same. I think it will need a fix as well.
There was a problem hiding this comment.
It does not make sense to call context manager without with. Did you intend to call set_device_index() instead?
There was a problem hiding this comment.
yes, I'm making the changes, thanks
|
|
||
| # torch.cuda.set_device(rank) | ||
| # init_process_group(backend="xccl", rank=rank, world_size=world_size) |
There was a problem hiding this comment.
Remove comments:
| # torch.cuda.set_device(rank) | |
| # init_process_group(backend="xccl", rank=rank, world_size=world_size) |
|
|
||
| world_size = torch.cuda.device_count() | ||
| world_size = torch.accelerator.device_count() | ||
| print(world_size) |
There was a problem hiding this comment.
Remove or convert to descriptive message:
| print(world_size) |
| device_type = torch.accelerator.current_accelerator() | ||
| device: torch.device = torch.device(f"{device_type}:{rank}") | ||
| torch.accelerator.device_index(rank) | ||
| print(f"Running on rank {rank} on device {device}") |
There was a problem hiding this comment.
I have hard time to understand this code block. It does not make sense to me in multiple places. Why you name what current_accelerator() as device_type if you return it from ddp_setup() in the same way as you return device for CPU path? Does ddp_setup() return different values? Next something is happening with the rank which is also not quite clear.
I think what you are trying to achieve is closer to this:
| device_type = torch.accelerator.current_accelerator() | |
| device: torch.device = torch.device(f"{device_type}:{rank}") | |
| torch.accelerator.device_index(rank) | |
| print(f"Running on rank {rank} on device {device}") | |
| torch.accelerator.set_device_index(rank) | |
| device = torch.accelerator.current_accelerator() | |
| print(f"Running on rank {rank} on device {device}") |
There was a problem hiding this comment.
yes, so... there is a function on this file called _load_snapshot where it gets the snapshot directly from the device in which is being run, and in my first tests it was not getting the snapshot at all, so I changed it to device_type to get only the XPU variable.
Now, I've tested again with only the "device" variable and it worked, sorry for the maze 🤓
I'm updating it with your suggestion, thanks
| print(f"Running on rank {rank} on device {device}") | ||
| backend = torch.distributed.get_default_backend_for_device(device) | ||
| torch.distributed.init_process_group(backend=backend) | ||
| return device_type |
There was a problem hiding this comment.
and respective to above:
| return device_type | |
| return device |
| init_process_group(backend="nccl") | ||
| rank = int(os.environ["LOCAL_RANK"]) | ||
| if torch.accelerator.is_available(): | ||
| device_type = torch.accelerator.current_accelerator() |
| @@ -1 +1 @@ | |||
| torch>=1.11.0 No newline at end of file | |||
| torch>=2.7 No newline at end of file | |||
There was a problem hiding this comment.
add new line in end of file
| # example.py | ||
|
|
||
| echo "Launching ${1:-example.py} with ${2:-2} gpus" | ||
| torchrun --nnodes=1 --nproc_per_node=${2:-2} --rdzv_id=101 --rdzv_endpoint="localhost:5972" ${1:-example.py} No newline at end of file |
There was a problem hiding this comment.
add new line in end of file
2c0eb8f to
2ca1a5c
Compare
| os.environ["MASTER_PORT"] = "12355" | ||
| torch.cuda.set_device(rank) | ||
| init_process_group(backend="nccl", rank=rank, world_size=world_size) | ||
| os.environ["MASTER_PORT"] = "12455" |
There was a problem hiding this comment.
It's still different port number.
| device_type = torch.accelerator.current_accelerator() | ||
| torch.accelerator.set_device_idx(rank) | ||
| device: torch.device = torch.device(f"{device_type}:{rank}") | ||
| torch.accelerator.device_index(rank) |
There was a problem hiding this comment.
It does not make sense to call context manager without with. Did you intend to call set_device_index() instead?
| if torch.accelerator.is_available(): | ||
| device_type = torch.accelerator.current_accelerator() | ||
| device = torch.device(f"{device_type}:{rank}") | ||
| torch.accelerator.device_index(rank) |
| optimizer: torch.optim.Optimizer, | ||
| save_every: int, | ||
| snapshot_path: str, | ||
| device |
There was a problem hiding this comment.
would be nice to have type designation here:
| device | |
| device: torch.device, |
| init_process_group(backend="nccl") | ||
| rank = int(os.environ["LOCAL_RANK"]) | ||
| if torch.accelerator.is_available(): | ||
| device_type = torch.accelerator.current_accelerator() |
Signed-off-by: dggaytan <diana.gaytan.munoz@intel.com>
Signed-off-by: dggaytan <diana.gaytan.munoz@intel.com>
cc9f51e to
67b4a05
Compare
|
continuing in #1393 for clean comments and changes |
Adding accelerator to ddp tutorials examples
Support for multiple accelerators:
ddp_setupfunctions inmultigpu.py,multigpu_torchrun.py, andmultinode.pyto usetorch.acceleratorfor device management. The initialization of process groups now dynamically selects the backend based on the device type, with a fallback to CPU if no accelerator is available.Trainerclasses inmultigpu_torchrun.pyandmultinode.pyto accept adeviceparameter and use it for model placement and snapshot loading.Improvements to example execution:
run_example.shto simplify running tutorial examples with configurable GPU counts and node settings.run_distributed_examples.shto include a new function for running all DDP tutorial series examples.Dependency updates:
requirements.txtto2.7to ensure compatibility with the newtorch.acceleratorAPI.CC: @msaroufim @malfet @dvrogozh