This implementation is:
- decoder-only transformer
- no training
- autoregressive (
argmaxof model logits at each step) - no symbolic/carry solver branch at inference
- calibrated for 10-digit + 10-digit addition
- Counted parameters (
nn.Parameter): 22 - Trainable parameters: 0
- Stored weight buffers: 0
- 1 decoder layer
- hidden size = 3
- attention heads = 4
- KV heads = 1
- head dim = 2
- MLP hidden = 4
- vocab size = 10 (digit tokens only)
The compressed handwritten design follows the reference-style setup:
- large constant embedding channel for stable RMSNorm
- RoPE offset-targeted queries
- attention extracts previous/current aligned digits
- MLP implements carry/overflow logic via thresholded linear pieces
- tied embedding decode produces digit logits
Prompt tokens:
[0] + reverse(a_10_digits) + [0] + [0] + reverse(b_10_digits) + [0]
Generated tokens:
11 reversed sum digits (fixed length).
python generate_test_cases.py --n-digits 10 --size 100000 --seed 12345 --out data/heldout_autoreg_10digit.jsonlpython evaluate.py --cases data/heldout_autoreg_10digit.jsonl --n-digits 10 --batch-size 2048Observed:
total_parameters=22accuracy=1.000000on100000held-out cases
python stress_boundaries.py --digit-sizes 10 --cases-per-size 2000 --batch-size 1024n_digits > 10 is intentionally unsupported by this handwritten weight set.