Add Q-Learning AI to control Pac-Man autonomously#11
Add Q-Learning AI to control Pac-Man autonomously#11vck77 wants to merge 3 commits intogreyblue9:masterfrom
Conversation
Replaces keyboard-driven Pac-Man with a tabular Q-Learning agent that
learns to navigate the maze, eat pellets, and avoid ghosts through
self-play.
New file – pacman/q_learning_ai.py:
• QLearningAgent class with epsilon-greedy policy, Q(s,a) update rule,
and JSON persistence (q_table.json survives between runs).
• 13-feature binary state: walls × 4, dangerous-ghost × 4, ghost
vulnerable flag, pellet visible × 4.
• Wall-aware action selection (never picks an immediately blocked move
during exploitation).
Changes to pacman/pacman.pyw:
• AI_ENABLED = True flag at the top (set False to play manually).
• AIStep() called every frame: makes a movement decision every
AI_DECISION_INTERVAL (8) frames, computes rewards from score deltas
(+10 pellet, +100 power pellet, etc.), applies –500 death penalty
and +1000 level-win bonus, auto-restarts on game over, saves the
Q-table every 5 episodes.
• DrawAIStats() overlays episode count, epsilon, state count, and
total steps in the top-left corner during play.
• Manual keyboard input is preserved when AI_ENABLED = False.
https://claude.ai/code/session_01EKGJKXQ5ahXkGuXTyAVZsA
Reviewer's GuideIntroduces a tabular Q-learning agent that can autonomously control Pac-Man, wires it into the main game loop behind an AI_ENABLED flag, adds per-mode reward handling and auto-restart, and overlays basic AI training stats while persisting the learned Q-table across runs. Sequence diagram for AIStep Q-learning control loop and game integrationsequenceDiagram
participant GameLoop
participant AIStep
participant QLearningAgent as Agent
participant Player
participant Level as LevelObj
participant Game as GameObj
GameLoop->>AIStep: AIStep()
alt mode == 1 (normal gameplay)
AIStep->>AIStep: ai_frame_counter += 1
alt ai_frame_counter >= AI_DECISION_INTERVAL
AIStep->>Agent: get_state(Player, Ghosts, LevelObj, GameObj)
Agent-->>AIStep: curr_state
alt Agent.prev_state is not None
AIStep->>Agent: update(prev_state, prev_action, reward, curr_state)
AIStep->>Agent: decay_epsilon()
end
AIStep->>Agent: choose_action(curr_state, Player, LevelObj)
Agent-->>AIStep: action
AIStep->>AIStep: compute dx, dy from ACTION_VELS[action] * Player.speed
AIStep->>LevelObj: CheckIfHitWall(Player.x+dx, Player.y+dy, nearestRow, nearestCol)
alt no wall hit
AIStep->>Player: set velX = dx, velY = dy
end
AIStep->>Agent: set prev_state, prev_action, prev_score
end
else mode == 2 (death)
alt ai_prev_mode == 1 and Agent.prev_state is not None
AIStep->>Agent: update(prev_state, prev_action, -500, terminal_state)
AIStep->>Agent: decay_epsilon()
AIStep->>Agent: prev_state = None
end
else mode == 3 (game over)
AIStep->>Agent: episode += 1
alt Agent.episode % 5 == 0
AIStep->>Agent: save(AI_QTABLE_PATH)
end
AIStep->>Agent: prev_state = None
AIStep->>GameObj: StartNewGame()
else mode == 6 (level complete)
alt ai_prev_mode == 1 and Agent.prev_state is not None
AIStep->>Agent: update(prev_state, prev_action, 1000, terminal_state)
AIStep->>Agent: prev_state = None
end
end
AIStep->>AIStep: ai_prev_mode = GameObj.mode
AIStep->>GameLoop: return
GameLoop->>GameLoop: update entities and render
alt AI_ENABLED
GameLoop->>Agent: DrawAIStats() via HUD overlay
end
alt ESC pressed
GameLoop->>Agent: save(AI_QTABLE_PATH)
GameLoop->>GameLoop: sys.exit(0)
end
Class diagram for the new QLearningAgent AI controllerclassDiagram
class QLearningAgent {
<<class>>
+float alpha
+float gamma
+float epsilon
+float epsilon_min
+float epsilon_decay
+string qtable_path
+dict q_table
+int episode
+float total_reward
+int steps
+tuple prev_state
+int prev_action
+int prev_score
+list ACTIONS
+dict ACTION_VELS
+dict ACTION_NAMES
+QLearningAgent __init__(float alpha, float gamma, float epsilon, float epsilon_min, float epsilon_decay, string qtable_path)
+float _q(tuple state, int action)
+void _set_q(tuple state, int action, float value)
+tuple get_state(object player, dict ghosts, object level_obj, object game_obj)
+int choose_action(tuple state, object player, object level_obj)
+void update(tuple state, int action, float reward, tuple next_state)
+void decay_epsilon()
+void save(string path)
+void load(string path)
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="pacman/pacman.pyw" line_range="1432-1433" />
<code_context>
+ # --- Mode 2: Pac-Man just died -- apply death penalty ---
+ elif thisGame.mode == 2:
+ if ai_prev_mode == 1 and ai_agent.prev_state is not None:
+ terminal = (0,) * 13
+ ai_agent.update(ai_agent.prev_state, ai_agent.prev_action, -500, terminal)
+ ai_agent.decay_epsilon()
+ ai_agent.prev_state = None
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Consider handling terminal transitions without bootstrapping off a dummy next state.
Using a synthetic `terminal = (0,) * 13` still allows the update to bootstrap from whatever Q-values get learned for that dummy state, which can distort the intended -500 terminal penalty. Instead, handle terminal transitions without a next-state value (e.g., `new_q = current + alpha * (reward - current)` with no `gamma * max Q(s')`, or by allowing `next_state=None` and skipping the `best_next` term) so the terminal reward isn’t coupled to an arbitrary placeholder state.
Suggested implementation:
```
# --- Mode 2: Pac-Man just died -- apply death penalty ---
elif thisGame.mode == 2:
if ai_prev_mode == 1 and ai_agent.prev_state is not None:
# Terminal transition: no next state, so apply pure terminal penalty
ai_agent.update(ai_agent.prev_state, ai_agent.prev_action, -500, None)
ai_agent.decay_epsilon()
ai_agent.prev_state = None
```
To fully implement the suggested behavior, you should also adjust the `ai_agent.update` method (likely in the AI agent class) so that:
1. Its signature allows `next_state` to be `None`.
2. When `next_state is None`, it performs a non-bootstrapping terminal update, e.g.:
- `new_q = current_q + alpha * (reward - current_q)`
- i.e., do **not** add `gamma * max_a' Q(next_state, a')` in this branch.
3. When `next_state` is not `None`, keep the existing Q-learning update with the bootstrap term.
This ensures the `-500` death penalty is not coupled to any arbitrary placeholder state.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| terminal = (0,) * 13 | ||
| ai_agent.update(ai_agent.prev_state, ai_agent.prev_action, -500, terminal) |
There was a problem hiding this comment.
suggestion (bug_risk): Consider handling terminal transitions without bootstrapping off a dummy next state.
Using a synthetic terminal = (0,) * 13 still allows the update to bootstrap from whatever Q-values get learned for that dummy state, which can distort the intended -500 terminal penalty. Instead, handle terminal transitions without a next-state value (e.g., new_q = current + alpha * (reward - current) with no gamma * max Q(s'), or by allowing next_state=None and skipping the best_next term) so the terminal reward isn’t coupled to an arbitrary placeholder state.
Suggested implementation:
# --- Mode 2: Pac-Man just died -- apply death penalty ---
elif thisGame.mode == 2:
if ai_prev_mode == 1 and ai_agent.prev_state is not None:
# Terminal transition: no next state, so apply pure terminal penalty
ai_agent.update(ai_agent.prev_state, ai_agent.prev_action, -500, None)
ai_agent.decay_epsilon()
ai_agent.prev_state = None
To fully implement the suggested behavior, you should also adjust the ai_agent.update method (likely in the AI agent class) so that:
- Its signature allows
next_stateto beNone. - When
next_state is None, it performs a non-bootstrapping terminal update, e.g.:new_q = current_q + alpha * (reward - current_q)- i.e., do not add
gamma * max_a' Q(next_state, a')in this branch.
- When
next_stateis notNone, keep the existing Q-learning update with the bootstrap term.
This ensures the-500death penalty is not coupled to any arbitrary placeholder state.
Pressing + speeds up the AI training simulation by running multiple game-update iterations per rendered frame (1x/2x/4x/8x/16x). Pressing - slows it back down. Current speed is shown in the AI HUD overlay. https://claude.ai/code/session_01EKGJKXQ5ahXkGuXTyAVZsA
FollowNextPathWay() recursed after finding a new path, but if FindPath returned an empty string (ghost already at destination), it would recurse infinitely. Guard both recursive calls with `if self.currentPath:` so they only fire when the new path is non-empty. This was latent in the original code but became reliably triggered at 2x+ sim speed. https://claude.ai/code/session_01EKGJKXQ5ahXkGuXTyAVZsA
Replaces keyboard-driven Pac-Man with a tabular Q-Learning agent that learns to navigate the maze, eat pellets, and avoid ghosts through self-play.
New file – pacman/q_learning_ai.py:
• QLearningAgent class with epsilon-greedy policy, Q(s,a) update rule,
and JSON persistence (q_table.json survives between runs).
• 13-feature binary state: walls × 4, dangerous-ghost × 4, ghost
vulnerable flag, pellet visible × 4.
• Wall-aware action selection (never picks an immediately blocked move
during exploitation).
Changes to pacman/pacman.pyw:
• AI_ENABLED = True flag at the top (set False to play manually).
• AIStep() called every frame: makes a movement decision every
AI_DECISION_INTERVAL (8) frames, computes rewards from score deltas
(+10 pellet, +100 power pellet, etc.), applies –500 death penalty
and +1000 level-win bonus, auto-restarts on game over, saves the
Q-table every 5 episodes.
• DrawAIStats() overlays episode count, epsilon, state count, and
total steps in the top-left corner during play.
• Manual keyboard input is preserved when AI_ENABLED = False.
https://claude.ai/code/session_01EKGJKXQ5ahXkGuXTyAVZsA
Summary by Sourcery
Integrate a tabular Q-learning agent to autonomously control Pac-Man, with optional keyboard control preserved via a feature flag.
New Features:
Enhancements: