[Proposal] Change Mistral's config to reduce context size from 32k to 4k

### Proposal 

The current TL config for Mistral sets its context at 32k, which results in ~34gb worth of attention masks being allocated (32768\*32768\*32 layers), putting it out of range for all consumer-grade single-gpu setups. As described in #479, a more memory efficient implementation of attention masks would be ideal, but in the short term it can just be capped at 2048/4096. If the longer context is needed, it could instead be parameterized, but I suspect that 4k is enough, at least for the short term.

I can put up a pr if needed.

### Checklist

- [X] I have checked that there is no similar [issue](https://github.com/neelnanda-io/transformerlens/issues) in the repo (**required**)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Change Mistral's config to reduce context size from 32k to 4k #490

Proposal

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Change Mistral's config to reduce context size from 32k to 4k #490

Description

Proposal

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions