This is the code for implementing of NIPS #3584 paper.
conda create -n imac python=3.6
conda activate imac
pip install tensorflow==1.12.0
conda install mkl_fft=1.0.10
pip install -r requirements.txt
- Known dependencies: Python (3.6.8), OpenAI gym (0.9.4), tensorflow (1.12.0), numpy (1.16.2)
To run the code, cd into the experiments directory and run train.py:
python train.py --scenario simple_spread --exp-name debug --save-dir ./result_test/debug --batch-size 1024 --ibmac_com --trainer ibmac
You can use tensorboard to visualize the results.
-
--scenario: defines which environment in the MPE is to be used (default:"simple_spread") -
--max-episode-lenmaximum length of each episode for the environment (default:25) -
--num-episodestotal number of training episodes (default:60000) -
--num-adversaries: number of adversaries in the environment (default:0) -
--good-policy: algorithm used for the 'good' (non adversary) policies in the environment (default:"maddpg"; options: {"maddpg","ddpg"}) -
--adv-policy: algorithm used for the adversary policies in the environment (default:"maddpg"; options: {"maddpg","ddpg"})
-
--trainer: different algorithms (default:"imbac")ibmac: for training scheduleribmac_inter: for training policy and messages output -
--lr: learning rate (default:1e-2) -
--gamma: discount factor (default:0.95) -
--batch-size: batch size (default:1024) -
--num-units: number of units in the MLP (default:64) -
--beta: coefficient of KL loss (default:0.05) -
--ibmac_com: boolean that enable commucniation (default:False) -
--random-seed: random seed (default:42)
-
--exp-name: name of the experiment, used as the file name to save all results (default:None) -
--save-dir: directory where intermediate training results and model will be saved (default:"/tmp/policy/") -
--save-rate: model is saved every time this number of episodes has been completed (default:1000) -
--load-dir: directory where training state and model are loaded from (default:"")
-
--restore: restores previous training state stored inload-dir(or insave-dirif noload-dirhas been provided), and continues training (default:False) -
--display: displays to the screen the trained policy stored inload-dir(or insave-dirif noload-dirhas been provided), but does not continue training (default:False) -
--benchmark: runs benchmarking evaluations on saved policy, saves results tobenchmark-dirfolder (default:False) -
--benchmark-iters: number of iterations to run benchmarking for (default:100000) -
--benchmark-dir: directory where benchmarking data is saved (default:"./benchmark_files/") -
--plots-dir: directory where training curves are saved (default:"./learning_curves/")
Our code is based on the version in:
@article{lowe2017multi,
title={Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments},
author={Lowe, Ryan and Wu, Yi and Tamar, Aviv and Harb, Jean and Abbeel, Pieter and Mordatch, Igor},
journal={Neural Information Processing Systems (NIPS)},
year={2017}
}
We slightly modify the environment on the act_space setting, so there are some differences on final reward output if you directly install the original version of environment.
We also add a new scenario: simple_spread_partially_observed. The num_agents can be modified for more agents.