This project trains a Deep Q-Network (DQN) agent to play 2048 using a strategy-guided reinforcement learning pipeline.
Direct link: assets/demo.mov
rl2048/: core RL code (environment, model, replay buffer, shaping, teacher policy)scripts/: executable training and play scriptsweb/: Flask-based visualizer UI and APImodels/: saved checkpoints and model weightstrain.py: main training entrypointplay.py: terminal play entrypointserve_web.py: web visualizer entrypoint
The model is a Q-network (rl2048/qnet.py) that predicts Q-values for the 4 actions:
0: up1: down2: left3: right
Input is the board encoded in log2 space:
0 -> 02 -> 14 -> 2- ...
Architecture:
- Tile embedding layer (
nn.Embedding) - Two convolution layers with ReLU
- Fully connected Q-head outputting 4 Q-values
At inference time, the policy selects the highest-Q legal move.
The environment is Game2048Env in rl2048/game.py.
Base reward:
- merge score increase from the move
Optional strategy shaping (--use_shaping):
- corner adherence improvement
- anchor-row fill improvement
- monotone snake improvement
- big-tile proximity improvement
- smoothness improvement
- empty-cell improvement
- trapped-small-tile reduction
Illegal/no-op moves are penalized strongly.
Default shaping weights:
merge_reward:1.0corner_bonus:2.0anchor_row_fill:0.5monotone_snake:0.2big_tile_proximity:0.02smoothness:0.05empty:0.3trap:0.3illegal_penalty:3.0
rl2048/strategy_teacher.py provides a deterministic teacher policy with an anchor corner (default TR).
Behavior:
- prefers anchor-preserving moves (
RIGHT/UPforTR) - uses fallback directions when needed
- avoids illegal moves
- prefers moves that allow restoring the corner quickly if displaced
It can generate demonstrations (.npz) with:
obs(log2 board)actionrewardnext_obsdone
Main script: train.py (calls scripts/train.py).
Training uses:
- Double DQN
- Huber loss
- target network updates
- replay buffer sampling
- epsilon-greedy exploration
Curriculum phases:
- Phase A (warm-up): teacher-only actions to seed replay
- Phase B (mixed): epsilon-greedy policy with teacher override probability decay
- Phase C (pure RL): normal RL policy without teacher override
pip install -r requirements.txtpython strategy_teacher.py --episodes 200 --out models/teacher_demos.npz --anchor_corner TRStrategy-guided run:
python train.py \
--use_shaping \
--anchor_corner TR \
--teacher_mix_start 0.3 \
--teacher_mix_end 0.1 \
--teacher_mix_decay_steps 1000000 \
--warmup_env_steps 100000 \
--demo_path models/teacher_demos.npz \
--ckpt models/checkpoint.ptBaseline-like run (no shaping, no demos):
python train.py --ckpt models/checkpoint.ptTraining outputs:
- checkpoint:
models/checkpoint.pt - model weights:
models/qnet_2048_dqn.pt
python play.py --model models/qnet_2048_dqn.pt --delay 0.1 --seed 123python serve_web.py --model models/qnet_2048_dqn.pt --host 127.0.0.1 --port 5000Open:
http://127.0.0.1:5000
The UI supports:
- manual moves
- single model step
- autoplay
- live score and max tile
- per-action Q-values
Evaluation inside training logs includes:
- average score
- average max tile
- best tile
- illegal move rate
- corner adherence rate
You can trigger these metrics by running training with periodic eval enabled (eval_every in scripts/train.py).
- Different devices (
cpu/mps) and random seeds produce different outcomes. - Teacher demos can improve early stability and reduce illegal-move loops.