Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
298b74e to
fc4bafb
Compare
ArthurZucker
left a comment
There was a problem hiding this comment.
Nice! is this benched as well?
| # Also free CPU-offloaded cache for cancelled states | ||
| for state in cancelled_states: | ||
| self.offloading_manager.free_request_cpu_cache(state) |
There was a problem hiding this comment.
| # Also free CPU-offloaded cache for cancelled states | |
| for state in cancelled_states: | |
| self.offloading_manager.free_request_cpu_cache(state) | |
| self.offloading_manager.free_requests(cancelled_states) |
api wise isolate code ?
There was a problem hiding this comment.
Not sure I understand sorry!
There was a problem hiding this comment.
I mean can you wrap the looping on the cancelled states inside the free requests api? re my comments about memory transfer, you are better of with the full list
There was a problem hiding this comment.
Does not change much, this is a CPU only operation. Not even torch intervenes. Added a note about this.
| def _stream_ctx(self): | ||
| """Returns a context manager that runs enclosed ops on the compute stream, or a no-op when none is set.""" | ||
| return torch.cuda.stream(self._compute_stream) if self._compute_stream is not None else nullcontext() |
There was a problem hiding this comment.
I don't see this or the offloading manager being gated for this capability (but I mean for mps you have shared memory and IDK if neuron tpu etc have the same API)
There was a problem hiding this comment.
Not sure what you mean: this is not a no-op only if there is already a compute stream around.
There was a problem hiding this comment.
I just mean that torch.cuda will error out for mps!
There was a problem hiding this comment.
No I'm pretty sure we are safe because the ternary will only evaluate the CUDA part if there is a compute stream that's not None, and that will not happen on MPS.
| """Returns a context manager that runs enclosed ops on the compute stream, or a no-op when none is set.""" | ||
| return torch.cuda.stream(self._compute_stream) if self._compute_stream is not None else nullcontext() | ||
|
|
||
| def offload_one_request(self) -> None: |
There was a problem hiding this comment.
Makes sense to copy in a bulk after we just compute which pages we need to copy imo!
There was a problem hiding this comment.
That's what happen I think
There was a problem hiding this comment.
I don't think so no, each copy_ is a DMA / copy request, not async unless requested no?
There was a problem hiding this comment.
Ha yes I thought this was about restore. For the async part, I think it's ok because there is no async mode yet. As for the "bulk" part, currently we only offload one request at a time, so we group as much as possible but it's still the cache for one request only.
* Stacked commits * Fix compile and prcessoirs * Review compliance * Added None for cpu pool swap size * TODO and disable * First fixes * Small fixes * Fixes for 2s (1/n) * Fixes for 2s (2/n) * Fix for 2s (3/n) * End of fixes (4/4) * review compliance * nits
Summary
This PR adds CPU offloading to continuous batching. It's in raft until perf and test status are reported.
When the GPU KV cache is full and a request must be evicted, we check if there is enough VRAM to copy the request's KV cache to a pre-allocated pinned CPU buffer instead of discarding them. Otherwise we use the legacy "soft reset" path that forces a full re-prefill on restore. This avoids redundant compute when cache pressure is temporary.
Main additions:
preserved), and forking.
Performances
As expected, no major difference for generation without offloading:
Tests
The following tests pass: