All blogs / Competitive self-play with Unity ML-Agents
October 22, 2021 • Joy Zhang • Tutorial • 6 minutes
Competitive self-play involves training an agent against itself. It was used in famous systems such as AlphaGo and OpenAI Five (Dota 2). By playing increasingly stronger versions of itself, agents can discover new and better strategies.
In this post, we walk through using competitive self-play in Unity ML-Agents to train agents to play volleyball. This article is also part 5 of the series 'A hands-on introduction to deep reinforcement learning using Unity ML-Agents'.
We previously trained agents using PPO with the following setup:
This resulted in agents that were able to successfully volley the ball back-and-forth after ~20M training steps:
You can see that the agents make 'easy' passes by aiming the ball towards the centre of the court. This is because we set the reward function to incentivize keeping the ball in play.
Our aim now is to train competitive agents that are rewarded for winning (i.e. landing the ball in the opponent's court). We expect this will lead to agents that learn interesting strategies and make passes that are harder to return.
To follow along this section, you will need:
TeamIdto 1 (the actual value doesn't matter, as long as the PurpleAgent and BlueAgent have different Team ID's):
Our previous reward function was +1 for hitting the ball over the net.
For self-play, we'll switch to:
VolleyballEnvController.csand add the rewards to the
case Event.HitBlueGoal: // blue wins blueAgent.AddReward(1f); purpleAgent.AddReward(-1f); // turn floor blue StartCoroutine(GoalScoredSwapGroundMaterial(volleyballSettings.blueGoalMaterial, RenderersList, .5f)); // end episode blueAgent.EndEpisode(); purpleAgent.EndEpisode(); ResetScene(); break; case Event.HitPurpleGoal: // purple wins purpleAgent.AddReward(1f); blueAgent.AddReward(-1f); // turn floor purple StartCoroutine(GoalScoredSwapGroundMaterial(volleyballSettings.purpleGoalMaterial, RenderersList, .5f)); // end episode blueAgent.EndEpisode(); purpleAgent.EndEpisode(); ResetScene(); break;
AddRewardfrom the other cases
case Event.HitOutOfBounds). From my experience, this may take longer for the agents to learn to hit the ball.
Create a new
.yaml file and copy in the following:
behaviors: Volleyball: trainer_type: ppo hyperparameters: batch_size: 2048 buffer_size: 20480 learning_rate: 0.0002 beta: 0.003 epsilon: 0.15 lambd: 0.93 num_epoch: 4 learning_rate_schedule: constant network_settings: normalize: true hidden_units: 256 num_layers: 2 vis_encode_type: simple reward_signals: extrinsic: gamma: 0.96 strength: 1.0 keep_checkpoints: 5 max_steps: 80000000 time_horizon: 1000 summary_freq: 20000 self_play: window: 10 play_against_latest_model_ratio: 0.5 save_steps: 20000 swap_steps: 10000 team_change: 100000
During self-play, one of the agents will be set as the learning agent and the other as the fixed policy opponent.
save_steps=20000 steps, a snapshot of the learning agent's existing policy will be taken. Up to
window=10 snapshots will be stored. When a new snapshot is taken, the oldest one is discarded. These past versions of itself become the 'opponents' that the learning agent trains against.
swap_steps=10000 steps, the opponent's policy will be swapped with a different snapshot. The snapshot is sampled with a probability of
play_against_latest_model_ratio=0.5 that it will play against the latest policy (i.e. the strongest opponent). This helps to prevent overfitting to a single opponent playstyle.
team_change=100000 steps, the learning agent and opponent teams will be switched.
Feel free to play around with these default hyperparameters (more information available in the official ML-Agents documentation).
Training with self-play in ML-Agents is done the same way as any other form of training:
mlagents-learn <path to config file> --run-id=VB_1 --time-scale=1
tensorboard --logdirresults from your working directory to observe the training process.
In a stable training run, you should see the ELO gradually increase.
In the diagram below, the three inflexion points correspond to the agent:
Compared to our previous training results, I found that even after ~80M steps, the agents trained using self-play don't serve or return the ball as reliably. However, they do learn to hit some interesting shots, like hitting the ball towards the edge of the court:
If you discover any other interesting playstyles, let me know on Twitter!
Thanks for reading! I hope you found this series useful.
Feel free to post any questions or feedback on the Ultimate Volleyball Repo.
If you're looking for further resources on reinforcement learning, check out: