🚨 [Kernels] Fix kernel function registration#45420
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Nice! Mostly nits to avoid relying on internals too much!
| Mode.TRAINING: FuncRepository( | ||
| repo_id="kernels-community/rotary", func_name="apply_rotary_transformers" | ||
| ), |
There was a problem hiding this comment.
Not sure why this wasnt included for training before but it runs with deepspeed just fine + https://github.com/Dao-AILab/flash-attention/blob/b65ae6b175f2438de55601695b6a21971fc5e429/flash_attn/layers/rotary.py#L38-L90
| def attach_hidden_kernels(module): | ||
| for name, fn in getattr(module, "_hidden_kernels", {}).items(): | ||
| if name not in dict(module.named_children()): | ||
| module.register_module(name, fn) | ||
|
|
||
| def detach_hidden_kernels(module): | ||
| for name in getattr(module, "_hidden_kernels", {}): | ||
| delattr(module, name) |
There was a problem hiding this comment.
Removed the internals structure and rely on native APIs instead as suggested
Kernels] Fix kernel function registrationKernels] Fix kernel function registration
| self.apply(attach_hidden_kernels) | ||
| try: |
There was a problem hiding this comment.
Would put the apply inside the try as well, but not a big deal at all!
* fix attmpt * proper fix - also works with deepspeed * rely less on internals and add rotary to training * move under the try as well
* fix attmpt * proper fix - also works with deepspeed * rely less on internals and add rotary to training * move under the try as well
Breaking change
🚨 Slightly breaking change: We no longer register the hidden
rotary_fn. Users shouldn't have relied on those but in any case marking it, e.g.self.rotary_fn(...)within the Attention module does not work anymore as the reference is deleted from now onDescription
As per title, we do not want to have proper
nn.Modules to be registered for kernels exchanged functions - they are not proper modules (and they are never called as such)! They act as exchange format for kernels but functionally they should stay as pure functions only.The exact reasons are numerous, but one recent example is deepspeed zero 3 which cannot handle this as the module is never properly called in the forward on the module directly (untracable) and it changes module structures after model construction (fixable by changing order of inits tbh).
This PR changes the core functionality to make the module registration temporarily under the parent module, discover the exchangable functions, and delete them from the visible interface. For BC purposes, we still keep a self reference that already exists (now as simple attribute, not module).