Varlen and testing tweaks#408
Conversation
|
|
||
| # Takes ~6s, much more if it needs to compile, reducing the hidden size doesn't help. | ||
| @pytest.mark.slow | ||
| @pytest.mark.skip("Dropless MoE is broken") |
There was a problem hiding this comment.
btw, torch now comes with grouped_mm, which is much faster than naive looping... huggingface/transformers#42697
There was a problem hiding this comment.
I think it's very similar to our sparse linear kernel, though it doesn't seem to use padding so could be simpler. However, most of the difficulty is in the preparation and sparse data copy, so not sure it would help much by itself.
|
thanks for making this better and better! there was an issue where vlm training would crash if the model debug level > 0. this commit was an attempt of a fix it. I think you're already discovering that there are issues with distributed training... |
I saw this in #409 after it was merged and posted some comments. Concerning the vision dim, I think I missed a few bugs because I used the same hidden size for the vision and text models in the tests, will address. |
✨ Description
Add varlen implementation for Mamba (based on hybrid_dev)broken, see [bug] Can't compile varlen mamba with base image 25.11 #416test_varlen, also test attention and mambaget_stageutil to allow simpler usageMark MoE tests as broken(moved to Ensure compatibility between models and datasets #402)test_reverse_kl, but_torch_reverse_kl_forward_backwardis usingloss.backwardwhich is unlikely to work in a distributed setting.Current test failures:
Don't know about the first two, the other ones are gradient mismatches for reverse KL.