Skip to content

Add profiler#1

Merged
wanchaol merged 4 commits intopytorch:mainfrom
lessw2020:add_profiler
Jan 19, 2024
Merged

Add profiler#1
wanchaol merged 4 commits intopytorch:mainfrom
lessw2020:add_profiler

Conversation

@lessw2020
Copy link
Copy Markdown
Contributor

Adds option to do torch profile tracing via:
--run_profiler (T/F)
--profile_folder (str)

Traces are saved out with rank_X as part of the trace name.
rank_named_traces

Implemented as context wrapper around the main training loop.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 17, 2024
Copy link
Copy Markdown
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for the quick progress! I think this looks great, have a few comments about how to trigger the profiler inlined.

…rofiling.py file, global dumps folder, logging_utils.py
@lessw2020
Copy link
Copy Markdown
Contributor Author

pr is updated to address the previous feedback.
adds user config control for profiling via train_config.toml,
separate profiling.py file, global dumps folder, logging_utils.py.

Copy link
Copy Markdown
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, only have one comment about profiling flags

…filing, create custom named folders for traces
@wanchaol wanchaol merged commit f1e86e4 into pytorch:main Jan 19, 2024
lessw2020 added a commit that referenced this pull request Apr 18, 2024
Adds option to do torch profile tracing via:
--run_profiler  (T/F)
--profile_folder (str) 

Traces are saved out with rank_X as part of the trace name.
<img width="1711" alt="rank_named_traces"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/6eb3c3e0-6034-4d1f-8ea8-f43988755714">

Implemented as context wrapper around the main training loop.
jinsun-yoo pushed a commit to jinsun-yoo/torchtitan that referenced this pull request Oct 30, 2024
wconstab added a commit that referenced this pull request Nov 21, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:#27 std::error_code::default_error_condition() const from ??:0
[rank4]:#28 start_thread from ??:0
[rank4]:#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
H-Huang pushed a commit to H-Huang/torchtitan that referenced this pull request Dec 2, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:pytorch#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:pytorch#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:pytorch#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:pytorch#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:pytorch#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:pytorch#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:pytorch#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:pytorch#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:pytorch#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:pytorch#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:pytorch#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:pytorch#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:pytorch#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:pytorch#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:pytorch#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:pytorch#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:pytorch#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:pytorch#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:pytorch#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:pytorch#27 std::error_code::default_error_condition() const from ??:0
[rank4]:pytorch#28 start_thread from ??:0
[rank4]:pytorch#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
H-Huang pushed a commit to H-Huang/torchtitan that referenced this pull request Dec 17, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:pytorch#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:pytorch#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:pytorch#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:pytorch#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:pytorch#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:pytorch#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:pytorch#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:pytorch#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:pytorch#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:pytorch#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:pytorch#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:pytorch#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:pytorch#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:pytorch#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:pytorch#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:pytorch#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:pytorch#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:pytorch#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:pytorch#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:pytorch#27 std::error_code::default_error_condition() const from ??:0
[rank4]:pytorch#28 start_thread from ??:0
[rank4]:pytorch#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
H-Huang pushed a commit to H-Huang/torchtitan that referenced this pull request Dec 17, 2025
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:pytorch#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:pytorch#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:pytorch#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:pytorch#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:pytorch#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:pytorch#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:pytorch#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:pytorch#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:pytorch#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:pytorch#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:pytorch#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:pytorch#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:pytorch#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:pytorch#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:pytorch#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:pytorch#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:pytorch#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:pytorch#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:pytorch#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:pytorch#27 std::error_code::default_error_condition() const from ??:0
[rank4]:pytorch#28 start_thread from ??:0
[rank4]:pytorch#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
H-Huang pushed a commit to H-Huang/torchtitan that referenced this pull request Jan 22, 2026
recomputing MoE during backward

```
[rank4]: (Triggered internally at /data/users/whc/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
[rank4]:  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:Exception in thread Thread-2 (run_backward):
[rank4]:Traceback (most recent call last):
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 384, in stage_backward
[rank4]:    torch.autograd.backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/__init__.py", line 364, in backward
[rank4]:    _engine_run_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/autograd/graph.py", line 865, in _engine_run_backward
[rank4]:    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank4]:RuntimeError: NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank4]:Exception raised from throw_nccl_error at /data/users/whc/pytorch/torch/csrc/cuda/nccl.cpp:259 (most recent call first):
[rank4]:C++ CapturedTraceback:
[rank4]:pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank4]:pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
[rank4]:pytorch#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
[rank4]:pytorch#7 torch::cuda::nccl::detail::throw_nccl_error(torch::cuda::nccl::ncclResult) from ??:0
[rank4]:pytorch#8 torch::cuda::nccl::detail::NCCL_CHECK_TIMEOUT(torch::cuda::nccl::ncclResult, void*) from nccl.cpp:0
[rank4]:pytorch#9 torch::cuda::nccl::all2all_single_unequal_split(void*, unsigned long const*, unsigned long const*, void*, unsigned long const*, unsigned long const*, unsigned long, c10::ScalarType, void*
, c10::cuda::CUDAStream&) from ??:0
[rank4]:pytorch#10 c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from ?
?:0
[rank4]:pytorch#11 c10d::ops::(anonymous namespace)::alltoall_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
> const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from Ops.cpp:0
[rank4]:pytorch#12 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >
 (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vec
tor<long, std::allocator<long> >, bool, long), c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c
10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >
, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank4]:pytorch#13 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#14 c10::impl::BoxedKernelWrapper<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGrou
p, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long), void>::call(c10::Box
edKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup>
 > const&, std::vector<long, std::allocator<long> >, std::vector<long, std::allocator<long> >, bool, long) from :0
[rank4]:pytorch#15 c10d::ProcessGroup::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) from :0
[rank4]:pytorch#16 c10d::all_to_all_single(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ?
?:0
[rank4]:pytorch#17 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::
allocator<c10::IValue> >*) from :0
[rank4]:pytorch#18 void c10::BoxedKernel::make_boxed_function<&torch::autograd::basicAutogradNotImplementedFallbackImpl>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c
10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank4]:pytorch#19 c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> >), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__cxx11::basic_stri
ng<char, std::char_traits<char>, std::allocator<char> >) from :0
[rank4]:pytorch#20 std::vector<at::Tensor, std::allocator<at::Tensor> > torch::autograd::CppNode_apply_functional<(anonymous namespace)::AllToAllSingle>(std::vector<at::Tensor, std::allocator<at::Tensor> >
&&, torch::autograd::AutogradContext&, std::vector<bool, std::allocator<bool> > const&, std::vector<torch::autograd::VariableInfo, std::allocator<torch::autograd::VariableInfo> > const&, std::__cxx1
1::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from Functional.cpp:0
[rank4]:pytorch#21 torch::autograd::CppNode<(anonymous namespace)::AllToAllSingle>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from Functional.cpp:0
[rank4]:pytorch#22 torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) from :0
[rank4]:pytorch#23 torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueu
e> const&) from ??:0
[rank4]:pytorch#24 torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) from ??:0
[rank4]:pytorch#25 torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from ??:0
[rank4]:pytorch#26 torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) from :0
[rank4]:pytorch#27 std::error_code::default_error_condition() const from ??:0
[rank4]:pytorch#28 start_thread from ??:0
[rank4]:pytorch#29 __clone3 from :0
[rank4]:
[rank4]:
[rank4]:The above exception was the direct cause of the following exception:
[rank4]:
[rank4]:Traceback (most recent call last):
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[rank4]:    self.run()
[rank4]:  File "/home/whc/.conda/envs/pytorch-3.10/lib/python3.10/threading.py", line 953, in run
[rank4]:    self._target(*self._args, **self._kwargs)
[rank4]:  File "/data/users/whc/torchtitan/torchtitan/distributed/dual_pipe_v.py", line 254, in run_backward
[rank4]:    backward_stage.backward_one_chunk(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 799, in backward_one_chunk
[rank4]:    grads_input, _ = self.backward_maybe_with_nosync(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 653, in backward_maybe_with_nosync
[rank4]:    result = perform_backward(backward_type)()
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/stage.py", line 607, in <lambda>
[rank4]:    stage_backward(
[rank4]:  File "/data/users/whc/pytorch/torch/distributed/pipelining/_backward.py", line 425, in stage_backward
[rank4]:    raise RuntimeError(exc_msg) from e
[rank4]:RuntimeError:
[rank4]:        Failed to run stage backward:
[rank4]:        Stage output: ('Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)',)
[rank4]:        Output gradient: ('Tensor(torch.Size([1, 4096, 2048]), grad=False, dtype=torch.bfloat16)',)
[rank4]:        Input: ['Tensor(torch.Size([1, 4096, 2048]), grad=True, dtype=torch.bfloat16)']
```
@felipemello1 felipemello1 mentioned this pull request Feb 13, 2026
Erland366 added a commit to Erland366/torchtitan that referenced this pull request Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants