Skip to content

SverreNystad/MuZero

Repository files navigation

MuZero

This project is a Python implementation of the MuZero algorithm created by Google Deepmind, designed to be used with the OpenAI Gym environment. It is built to be modular and extensible, allowing for easy integration with different Gymnasium environments and neural network configurations.

📋 Table of contents

Check out our presentation video below for an in-depth overview:

Watch MuZero Presentation

Our MuZero in action, demonstrating performance on the Car Racing game in OpenAI Gym:

Watch Results

Overview

MuZero is an advanced model-based reinforcement learning algorithm that jointly learns a dynamics model, a value function, and a policy through self-play training. Unlike traditional systems like AlphaZero, MuZero does not require explicit knowledge of the true environment dynamics. Instead the model gets the last 32 frames as input and it learns its own abstract internal representation of the environment. This allows for using Monte Carlo Tree Search (MCTS) to plan ahead in latent space making it capable of sophisticated planning, resulting in superhuman performance across various complex environments.

As the model does not require the true environment dynamics, it can be applied to a much wider range of problems. This being very useful in real-world applications where the environment is almost never fully known or is too complex to model explicitly. However the lack of explicit environment model makes it much more difficult to train, as the model has to learn the dynamics of the environment from scratch and it will start of by scrambling the signals it receives from the environment.

We needed to improve the sampling efficiency of the model training on the data the model thinks is most surprising and what episodes that it did the best on. This is done by using a Prioritized Experience Replay which allows the model to focus on the most important experiences and learn from them more effectively. The model is trained on a large number of episodes, and it learns to prioritize the most important experiences based on their impact on the learning process.

Distributed Prioritized Experience Replay

But as the model needs lots of training data we scaled the training process by using a Distributed Prioritized Experience Replay. Where we had several instances of the program continuously generating training data in parallel. Each instance of the program runs its own environment and collects experiences, which are then stored in a shared experience replay buffer. While we had a parameter server that manages the model parameters and synchronizes them across all instances. With a trainer that samples from the shared experience replay buffer and updates the model parameters based on the sampled experiences.

Prerequisites

Usage

To start, run the following command in the root directory of the project:

python main.py

Testing

To run the tests, run the following command in the root directory of the project:

pytest

To run the tests with coverage, run the following command in the root directory of the project:

coverage run -m pytest

To see the coverage report, run the following command in the root directory of the project:

coverage html -i

📖 Documentations

Contributors

Kristoffer Nohr Olaisen
Kristoffer Nohr Olaisen
Olav Selnes Lorentzen
Olav Selnes Lorentzen

Simon Sandvik Lee

Sverre Nystad

Contributors 4

  •  
  •  
  •  
  •  

Languages