[LoRA]Add LoRA model converter with dynamic class inheritance#2923
[LoRA]Add LoRA model converter with dynamic class inheritance#2923mori360 wants to merge 1 commit intopytorch:mainfrom
Conversation
|
@claude review the pr |
|
Claude finished @mori360's task in 6m 48s —— View job PR Review: [LoRA] Add LoRA model converter with dynamic class inheritance
Overall this is a well-structured PR. The dynamic class inheritance approach is a good design choice for composing LoRA with arbitrary Below are findings organized by severity. Issues1. 2. 3. 4. No validation that Suggestions (Non-blocking)5. Consider reusing the shared 6. 7. Moving 8. The Positive Notes
|
Summary
Parallelism Compatibility
FSDP (Data Parallel) — Works out of the box. LoRA adapters (lora_a, lora_b) are registered as child nn.Modules of the wrapped linear, so fully_shard() treats them like any other submodule. Base weights are frozen (requires_grad=False) and LoRA weights are trainable — FSDP only all-gathers/reduces gradients for trainable parameters, so no wasted communication.
Pipeline Parallel (PP) — Works. LoRA doesn't change the model's module structure or the number of layers, so pipeline stage splitting is unaffected. Each stage's linear layers independently get LoRA adapters during convert().
Context Parallel (CP) — Works. CP operates on the attention module's sequence dimension, not on linear layer internals. LoRA adapters participate in the same forward pass as the base linear.
Tensor Parallel (TP) — Not supported yet. PyTorch's ColwiseParallel._partition_linear_fn (in
torch/distributed/tensor/parallel/style.py) uses named_parameters(recurse=True), which yields dotted names like
lora_a.weight from child modules. register_parameter() rejects names containing ".", causing a KeyError. RowwiseParallel
is safe (it explicitly accesses module.weight and module.bias only). This requires an upstream PyTorch fix to
ColwiseParallel._partition_linear_fn — either switching to recurse=False or explicitly accessing
module.weight/module.bias like RowwiseParallel does. Until then, LoRA should not be used with TP. The debug config
llama3_debugmodel_lora uses TP degree 1.
Expert Parallel (EP) — Works in principle. LoRA applies to nn.Linear subclasses, and MoE expert layers use
GroupedExperts (not nn.Linear), so LoRA would only wrap the shared/dense linear layers (attention projections, shared
expert FFN), not the routed experts. EP sharding of the expert dimension is unaffected.
Test Plan
Unit tests (pytest tests/unit_tests/test_model_converter.py):