Skip to content

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 1 – core#1614

Merged
pggPL merged 39 commits intoNVIDIA:mainfrom
pggPL:nvinspect_core
Apr 16, 2025
Merged

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 1 – core#1614
pggPL merged 39 commits intoNVIDIA:mainfrom
pggPL:nvinspect_core

Conversation

@pggPL
Copy link
Collaborator

@pggPL pggPL commented Mar 25, 2025

Description

Core part of NVIDIA-DL-Framework-Inspect support - which affects the core TE files.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

pggPL added 3 commits March 25, 2025 14:22
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Mar 25, 2025

/te-ci pytorch L1

pggPL and others added 5 commits April 1, 2025 10:45
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Apr 1, 2025

/te-ci pytorch L1

tensor: torch.Tensor,
*,
out: Optional[Union[torch.Tensor, DebugQuantizedTensor]] = None,
dtype: torch.dtype = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weight caching for fake quantization for example. We want to save them in correct precision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this usecase TBH. Wouldn't you still want such tensor to have the dtype the same as the input tensor to the quantize call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose weight is stored in FP32, but activation is in BF16. Then we want weight after fake-quantization to be in BF16 to prevent cast each forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so that's the AMP scenario. the regular torch.nn.Linear just fails when given weight and activation in different types.

Copy link
Collaborator Author

@pggPL pggPL Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do it here anyways

weightmat = cast_if_needed(weight, activation_dtype)

pggPL and others added 9 commits April 2, 2025 11:35
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Apr 2, 2025

/te-ci pytorch L1

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
pggPL added 3 commits April 7, 2025 13:14
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Apr 7, 2025

/te-ci pytorch L1

tensor: torch.Tensor,
*,
out: Optional[Union[torch.Tensor, DebugQuantizedTensor]] = None,
dtype: torch.dtype = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this usecase TBH. Wouldn't you still want such tensor to have the dtype the same as the input tensor to the quantize call?

pggPL and others added 6 commits April 8, 2025 11:49
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
return [None] * 5
return [None] * 6
grad_input_quantizer = None
grad_weight_quantizer = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly oppose adding a quantizer for the grad weight. It doesn't make sense logically since we never plan to output a quantized grad weight, barring some major development in optimizers. I get that the asymmetry is weird, but the real problem is that output_quantizer and grad_input_quantizer are also illogical. We should be using the input_quantizer/grad_output_quantizer from the layer that consumes them, like what te.Sequential does, but that's very tricky to implement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but maybe one wants to do some experiments with weight quantization and debug tools helps with that. Moreover this quantizer enables us to log the statistics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timmoon10 Actually, I would argue that grad_weight_quantizer makes more sense compared to the other quantizers you listed since it is actually "local" vs the other ones logically belonging to the other operations. Even in te.Sequential the grad_weight quantizer would belong to the op that holds those weights.
I also don't agree that we never plan to output a quantized grad weight - while the convergence story of this would need to be understood, projects like MS-AMP tried to provide FP8 gradient functionality since it saves a lot of memory during training. So it is not completely out of the question that we will want to provide that as well.

Copy link
Collaborator

@timmoon10 timmoon10 Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MS-AMP case is a good example. I suppose the right answer would be to gracefully support multiple quantizer types in a layer. In most cases we would have grad_weight_quantizer=None. grad_weight_quantizer would only be non-trivial if we want quantized grads or we're doing something weird like debug mode.

Unresolving for future visibility.

@timmoon10 timmoon10 self-requested a review April 12, 2025 01:15
pggPL and others added 9 commits April 14, 2025 10:04
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Apr 15, 2025

/te-ci pytorch L1

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Copy link
Member

@ptrendx ptrendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side LGTM.

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully gone through the code, but things look reasonable to me. I would be fine merging.

@timmoon10 timmoon10 self-requested a review April 16, 2025 00:35
@pggPL
Copy link
Collaborator Author

pggPL commented Apr 16, 2025

/te-ci pytorch L1

@pggPL pggPL merged commit beaecf8 into NVIDIA:main Apr 16, 2025
24 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants