Proposal
The current TL config for Mistral sets its context at 32k, which results in ~34gb worth of attention masks being allocated (32768*32768*32 layers), putting it out of range for all consumer-grade single-gpu setups. As described in #479, a more memory efficient implementation of attention masks would be ideal, but in the short term it can just be capped at 2048/4096. If the longer context is needed, it could instead be parameterized, but I suspect that 4k is enough, at least for the short term.
I can put up a pr if needed.
Checklist
Proposal
The current TL config for Mistral sets its context at 32k, which results in ~34gb worth of attention masks being allocated (32768*32768*32 layers), putting it out of range for all consumer-grade single-gpu setups. As described in #479, a more memory efficient implementation of attention masks would be ideal, but in the short term it can just be capped at 2048/4096. If the longer context is needed, it could instead be parameterized, but I suspect that 4k is enough, at least for the short term.
I can put up a pr if needed.
Checklist