A ~10M parameter LLM that was designed for reasoning
TerryLM is a compact Transformer project for training and chatting with Terry, a tiny synthetic assistant. This model was disigned for now supports long context reasoning with sequences up to 25K tokens using efficient sliding window attention.
- Long Context Support: Handle 10K-25K token sequences with sliding window attention
- Memory Efficient: Gradient checkpointing and mixed precision training
- Reasoning Capabilities: Improved attention mechanism for better reasoning over long contexts
- Compact Architecture: 256-dimensional embeddings, 8 layers, 8 attention heads
- Generate Terry conversations:
python data/generate_terry_dataset.pyThis writes:
src/terry_daily_chat_train.jsonlsrc/terry_daily_chat_valid.jsonl
- Prepare tokenized training data:
python prepare_data.pyThis writes:
src/processed/terry_train_tokens.txtsrc/processed/terry_valid_tokens.txttokenizer/terry_byte/tokenizer_config.json
- Train:
python train.pyKey parameters in config.py:
@dataclass
class ModelConfig:
d_model: int = 256
n_layers: int = 8
n_heads: int = 8
max_seq_len: int = 8192 # Maximum sequence length
sliding_window: int = 2048 # Local attention window
use_sliding_window: bool = TrueThe project uses a local byte-level tokenizer with fixed special token IDs:
0:<|pad|>1:<|im_start|>2:<|im_end|>
python example_usage.py