A Python implementation of Q-Learning reinforcement learning algorithm with optional Dyna-Q model-based planning for robot navigation in grid-world environments.
This project provides a flexible Q-Learning agent that can learn optimal navigation policies through trial and error. The agent builds a Q-table mapping state-action pairs to expected rewards, enabling it to navigate through grid environments while avoiding obstacles and reaching goals efficiently.
- Standard Q-Learning: Classic temporal difference learning algorithm
- Epsilon-Greedy Exploration: Balances exploration vs exploitation
- Dyna-Q Planning: Model-based learning for significantly faster convergence
- Configurable Parameters: Easily tune learning rate, discount factor, and exploration
- Grid-World Navigation: Designed for discrete state-action environments
- Policy Extraction: Methods to visualize and analyze learned behavior
- Python 3.6+
- NumPy
pip install numpySimply download or clone the repository:
git clone <repository-url>
cd qlearning-navigationfrom QLearner import QLearner
# Create a Q-Learning agent
learner = QLearner(
num_states=100, # 10x10 grid world
num_actions=4, # N, E, S, W movements
alpha=0.2, # Learning rate
gamma=0.9, # Discount factor
rar=0.5, # Initial exploration rate
radr=0.99, # Exploration decay
dyna=200, # Dyna-Q planning steps
verbose=False
)
# Initialize episode
initial_action = learner.querysetstate(start_state)
# Learning loop
for step in range(max_steps):
# Environment determines next state and reward
next_state, reward = environment.step(initial_action)
# Agent learns and selects next action
next_action = learner.query(next_state, reward)
if environment.is_terminal(next_state):
breakStates are encoded as integers using the formula:
state = row * 10 + column
For a 10x10 grid, valid states range from 0 to 99.
Four discrete actions are available:
- 0: North (move up)
- 1: East (move right)
- 2: South (move down)
- 3: West (move left)
The agent updates Q-values using the temporal difference learning rule:
Q(s,a) ← Q(s,a) + α[r + γ·max_a' Q(s',a') - Q(s,a)]
Where:
α(alpha): Learning rateγ(gamma): Discount factorr: Immediate rewards, a: Current state and actions': Next state
When enabled (dyna > 0), the agent:
- Learns from real experience: Updates Q-table with actual environment interactions
- Builds an internal model: Stores transition and reward information
- Simulates experiences: Samples from memory to generate additional training data
- Plans ahead: Updates Q-table with simulated experiences for faster learning
This hybrid approach combines the sample efficiency of model-based methods with the simplicity of model-free Q-learning.
| Parameter | Type | Default | Description |
|---|---|---|---|
num_states |
int | 100 | Total number of states in environment |
num_actions |
int | 4 | Number of available actions |
alpha |
float | 0.2 | Learning rate (0.0-1.0) |
gamma |
float | 0.9 | Discount factor for future rewards (0.0-1.0) |
rar |
float | 0.5 | Random action rate (exploration probability) |
radr |
float | 0.99 | Random action decay rate per step |
dyna |
int | 0 | Number of planning steps per real experience |
verbose |
bool | False | Enable debug output |
- Higher alpha (e.g., 0.3-0.5): Faster learning but less stable
- Lower alpha (e.g., 0.1-0.2): More stable but slower convergence
- Higher gamma (e.g., 0.95-0.99): Values long-term rewards more
- Lower gamma (e.g., 0.8-0.9): Focuses on immediate rewards
- Dyna = 0: Pure model-free Q-learning
- Dyna = 50-200: Good balance for most problems
- Dyna > 200: Faster learning but higher computational cost
Initialize the agent at the start of an episode.
- Args:
s- Starting state - Returns: Initial action
Main learning method called after each action.
- Args:
s_prime- New state after actionr- Reward received
- Returns: Next action to take
Returns a copy of the Q-table for analysis.
Extracts the greedy policy (best action per state).
Returns the state value function V(s) = max_a Q(s,a).
Resets the exploration rate (useful between episodes).
qlearning-navigation/
├── QLearner.py # Main Q-Learning implementation
├── README.md # This file
└── LICENSE # MIT License
- Memory: O(num_states × num_actions) for Q-table
- With Dyna: Additional O(num_states × num_actions) for model storage
- Time per step: O(1) for Q-learning, O(dyna) for planning
- Convergence: Typically faster with Dyna enabled (50-200 steps recommended)
MIT License - see LICENSE file for details
Timothy Bradford
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.)
- Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292
- Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning
Contributions are welcome! Please feel free to submit a Pull Request.
- Support for stochastic environments
- Double Q-Learning implementation
- Function approximation for large state spaces
- Visualization tools for policy and value functions
- Priority sweeping for Dyna-Q
- Experience replay buffer