Reorganize the building of active-masked tensors in context#2929
Reorganize the building of active-masked tensors in context#2929tdene wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
f938a05 to
0c2c1c5
Compare
0c2c1c5 to
63c19dc
Compare
Phlip79
left a comment
There was a problem hiding this comment.
Please add tests for the new logic.
63c19dc to
682cfca
Compare
| self.token_to_position_in_request = torch.empty_like(self.token_to_input_ids) | ||
| self.token_to_local_position_within_kv_block = torch.empty_like(self.token_to_input_ids) | ||
|
|
||
| # Static tensor addresses of active slices to enable fast inference kernels. |
There was a problem hiding this comment.
what's your timeline for merging this .... this might be a good argument for re-ordering the context tensors and even simplifying what we already have."
I really don't like the idea of continually adding duplicated tensors, including the ones we already have, if all we need to do is bite the bullet and redesign things.
There was a problem hiding this comment.
At a high-level, I see this PR as the base that enables several bite-sized optimization PRs and benchmarks them to see how much improvement each optimization leads to - in order to prove their value in a clean apples-to-apples comparison.
One of these bite-sized optimization PRs is the one that reorders the context tensors.
That is, we could reorder the context tensor today, but we will not have a clear idea of how much throughput/latency we gain by doing so. Going through the work of building things on top of this PR would clearly prove how much of a practical win we can get from reordering the context tensor.
There was a problem hiding this comment.
But your point is valid. I'm saying that we should do things in order so that we have proper ablation studies. We do not have to actually merge things in that order though; I can do the ablation study, come back with results, and then say "Okay, let's just skip #2929 and go straight to re-ordering the context tensors."
There was a problem hiding this comment.
so will your follow-up PRs be relatively independent from this duplicated-tensor style? As in, if we re-order active & paused requests in the near future, will that require changing all of your follow-up PRs significantly?
| def build_active_slices(self, batch_size: int): | ||
| """Build the active slices of specific tensors. This is run on every forward step. | ||
|
|
||
| If the context is reordered to active -> paused -> finished, this can be graphed. |
There was a problem hiding this comment.
this method can be graphed? Or we wouldn't need this method at all?
There was a problem hiding this comment.
exactly == this method can be graphed?
-or-
exactly == we won't need this method at all?
| ) | ||
|
|
||
| # The following tensor slices are used in various kernels. | ||
| self.active_request_ids[:batch_size].copy_(self.request_ids[padded_slice]) |
There was a problem hiding this comment.
if we re-order active and paused, can all these copies be avoided?
There was a problem hiding this comment.
Everything listed in this PR, yes.
But all PRs that built on top of this one end up needing to add things to pad_active_slices - and that method is used to pad tensors on the interval active_request_count : padded_active_request_count. The need to pad means that we do need copies of the tensor, so that we don't overwrite real data.
We see that today in the attention metadata tensors, which currently do a copy up to padded_active_request_count followed by a pad on the interval active_request_count : padded_active_request_count. Exactly the same pattern is needed for graphed logprobs, graphed sampling, and graphing time-critical parts of _dynamic_step_context_bookkeeping.
There was a problem hiding this comment.
So, in a nutshell, here's what's happening:
- This method, as it is in this PR, copies several tensors that are used in CUDA graphs in follow-up PRs. All of these copies go away if we reorder the context tensors.
- There are also several tensors that are copied & padded in follow-up PRs. If we reorder the context tensors, those copies do not go away, but they become graphable.
There was a problem hiding this comment.
orthogonal to your PRs and possibly less practical, but I wonder if we can merge regular and padded tensors also and just have a single tensor. We care mostly about the cuda graph case, or at least that's our gold standard for performance, so it seems like we could just used the padded tensors in the non-cuda-graph case as well for simplicity
What does this PR do ?
This enables future optimizations.
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.