Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: llama |
| if "tp" not in device_mesh.mesh_dim_names: | ||
| raise ValueError( | ||
| "When using `tp_plan`, the `device_mesh` must contain a 'tp' dimension. " | ||
| "Please provide a valid `device_mesh`." | ||
| ) |
There was a problem hiding this comment.
I don't think we should enforce 'tp' in the device mesh! for inference we never use that!~
| device_mesh = device_mesh["tp"] | ||
| tp_size = device_mesh["tp"].size() | ||
| device_map = torch.device(f"{device_mesh.device_type}:{int(os.environ['LOCAL_RANK'])}") |
There was a problem hiding this comment.
only do this if tp exists in it!
| # 'user_content.pt' indicates model state_dict saved with smp >= 1.10 | ||
| Path(os.path.join(output_dir, "user_content.pt")).touch() | ||
| # We are in N-D parallelism if we have parallelism_config set, so we check accelerate if we're on a to_save rank | ||
| elif (getattr(self.accelerator, "parallelism_config")) is not None: |
| if "tp" not in device_mesh.mesh_dim_names: | ||
| raise ValueError( | ||
| "When using `tp_plan`, the `device_mesh` must contain a 'tp' dimension. " | ||
| "Please provide a valid `device_mesh`." | ||
| ) | ||
| device_mesh = device_mesh["tp"] |
There was a problem hiding this comment.
can it be ndim > 1 but not mesh dim names?
There was a problem hiding this comment.
Nope it can't IMO, how do you think we should take the correct submesh then?
There was a problem hiding this comment.
No no I just want to be sure as the basic initialization we do is without providing a mesh name!
There was a problem hiding this comment.
Oh, that works, we check for "tp" only if ndim > 1, therefore the basic initialization still works. We check for "tp" only if mesh.ndim > 1 AND user-provided mesh. If this passes we select the correct submesh ("tp") and use that as 1D mesh afterwards, as if it was created by us in initialize_tensor_parallelism
|
Fails unrelated, merging |
…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
…e#39693) * Feat: something * Feat: initial changes * tmp changes to unblock * Refactor * remove todo * Feat: docstring * Fix: saving of distributed model in trainer * Fix: distributed saving with trainer * Feat: add pure tp saving * Only require tp dim if ndim > 1 * Fix: default to None * Fix: better comments/errors * Fix: properly check tp_size attribute * Fix: properly check for None in tp_size --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
device_meshhave multiple dim #38949 but was by mistake reverted by Add ep #39501, we need this for upcoming accelerate/axolotl release.cc @ArthurZucker @SunMarc