From 4d9e5d7034e38041bc022b4dbc576bd8446cc21d Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Tue, 23 Mar 2021 22:26:23 -0700 Subject: [PATCH 1/2] [doc] pipeline As @g-karthik flagged in https://github.com/microsoft/DeepSpeed/pull/659#discussion_r600132598 my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you! --- docs/_tutorials/pipeline.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/_tutorials/pipeline.md b/docs/_tutorials/pipeline.md index 70790c82b301..dcd7666cfe1a 100644 --- a/docs/_tutorials/pipeline.md +++ b/docs/_tutorials/pipeline.md @@ -276,9 +276,9 @@ For example, a machine with 16 GPUs must have as much local CPU memory as 16 tim DeepSpeed provides a `LayerSpec` class that delays the construction of modules until the model layers have been partitioned across workers. -Then each worker will allocate only the layers it's assigned to. So, continuing the +Then each worker will allocate only the layers it's assigned to. So, comparing to the example from the previous paragraph, a machine with 16 GPUs will need to allocate a -total of 1x model size on its CPU, compared to 16x in the LayerSpec example. +total of 1x model size on its CPU memory and not 16x. Here is an example of the abbreviated AlexNet model, but expressed only with `LayerSpec`s. Note that the syntax is almost unchanged: `nn.ReLU(inplace=True)` From 1ca1570dc49e8b2b85290ceac5a33d8768bcac6f Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Tue, 23 Mar 2021 22:27:55 -0700 Subject: [PATCH 2/2] tweak --- docs/_tutorials/pipeline.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/_tutorials/pipeline.md b/docs/_tutorials/pipeline.md index dcd7666cfe1a..0d847ea18752 100644 --- a/docs/_tutorials/pipeline.md +++ b/docs/_tutorials/pipeline.md @@ -277,8 +277,8 @@ For example, a machine with 16 GPUs must have as much local CPU memory as 16 tim DeepSpeed provides a `LayerSpec` class that delays the construction of modules until the model layers have been partitioned across workers. Then each worker will allocate only the layers it's assigned to. So, comparing to the -example from the previous paragraph, a machine with 16 GPUs will need to allocate a -total of 1x model size on its CPU memory and not 16x. +example from the previous paragraph, using `LayerSpec` a machine with 16 GPUs will need to +allocate a total of 1x model size on its CPU memory and not 16x. Here is an example of the abbreviated AlexNet model, but expressed only with `LayerSpec`s. Note that the syntax is almost unchanged: `nn.ReLU(inplace=True)`