The core RL task involves multiple agents (represented as colored circles and arrow-heads) navigating from their starting positions to their respective end goals while avoiding collisions with other agents and static obstacles.
To do this, we are using a CTDE (Centralized Training Decentralized Execution) approach to train agents. Specifically, we are using MAPPO (Multi-Agent Proximal Policy Optimization), a variant of PPO inspired by MADDPG.
📺 Watch the Video for free
Training RL Agents to Navigate as a Team
If you find this content helpful, please consider supporting my work on Patreon. Your support helps me create more in-depth tutorials and content.
Patreon members also get access to a code walkthrough of this repo.
- Python 3.10+ (required)
uv(recommended) orpipfor package management- To render movies and videos, install ffmpeg
-
Clone the repository:
git clone https://github.com/avbiswas/navigation-mappo-rl.git
-
Install dependencies:
# Using uv uv sync -
Train a new model:
uv run train_mappo.py model_id configs/config.yaml
For example:
uv run train_mappo.py model_1 configs/basic_env.yaml
This will create a new model inside
models/model_1. The latest model and the best all-time models are both saved. -
Run inference
uv run inference.py models/model_1
By default, this will test on the environment config file where the model was originally trained. You can also test on a new config though.
uv run inference.py models/model_1 configs/bottleneck.yaml
A variety of pre-configured environments are provided for your experiments in the configs/ directory.
You can define your own environments by creating a new .yaml file in the configs/ directory. The configuration schema is defined in nav/config_models.py.
Key components of a configuration file:
- Boundary: Defines the polygon vertices for the playable area.
- Agents: Specifies start/goal zones (rectangles), physical properties (radius, max speed), and sensor settings (FOV, range).
- Obstacles: Defines static or moving obstacles (rectangles or circles).
For reference, check configs/basic_env.yaml for a simple setup or configs/moving_env.yaml for dynamic obstacles.
The training configuration and hyperparameters are defined in train_mappo.py.
Key settings include:
- History Length:
history_length = 4(Number of past frames stacked). - Batch Size:
batch_size=128(Number of samples per update). - Learning Rate:
learning_rate=5e-4. - Inference Interval:
inference_interval=5(How often to run evaluation episodes). - Network Architecture: The policy network uses an
ObservationEncoderwhich processes LIDAR data and agent states, outputting a feature vector of size 384.
To modify these, edit the MAPPO initialization in train_mappo.py.
Check out nav/live_renderer.py to see useful rendering settings.
The environment specifications are defined in nav/environment.py.
Each agent receives a composite observation consisting of:
- State Vector:
- Progress towards goal (normalized 0-1).
- Cosine of the angle between the agent's heading and the goal.
- Current speed ratio (current_speed / max_speed).
- Distance to goal.
- Goal vector (x, y).
- LIDAR Readings:
- A set of raycasts (default 60 rays).
- Each ray returns 3 channels: [Distance to Obstacle, Distance to Boundary, Distance to Agent].
- Stacked with
history_length(default 4) frames to provide temporal context.
The agent controls its movement via a continuous 2D vector:
[vx, vy]: Velocity components in the X and Y directions.- Values are clipped to the range
[-1, 1]and scaled by the agent'smax_speed. - The action is applied in the Local Coordinate Space of the agent, where the origin is at the agent's center, and the Y-axis is the agent's goal vector.
The reward function incentivizes reaching the goal while avoiding collisions:
- Goal Reached:
+10 - Collision (Obstacle, Boundary, or Agent):
-10 - Progress Reward: Scaled by speed and alignment with the goal direction (encourages moving efficiently towards the target).
- Time Penalty (Stay Alive):
-0.05per step (encourages reaching the goal quickly).
The model architecture (found in rl/mappo.py and networks/actor_critic_network.py) implements the Actor-Critic method with shared features.
Both the Actor and Critic networks share a common backbone called the ObservationEncoder. This ensures that feature extraction from the raw environment state is consistent and efficient.
- Input: Takes the agent's observation, which includes stacked history frames.
- Structure:
- 1D CNN: Processes the LIDAR data (which is treated as a 1D sequence of ray intersections).
- Agent Network (MLP): Processes the scalar agent states (velocity, distance to goal, etc.).
- Dense Layer: Concatenates the outputs of the CNN and the MLP to produce a unified feature vector (default dimension 384).
The critic evaluates the value of a state for all agents, making it "centralized".
- Attention Mechanism: It uses a Multi-Head Attention (MHA) layer to process the states of all agents. By attending to other agents' encoded states, the critic can learn cooperative behaviors and anticipate interactions.
- Output: A single scalar value representing the estimated return (value) of the global state.
Currently the environment also has a "state" function, but we are not using it, left for future work.
The actor decides the action for a single agent, making it "decentralized" during execution.
- Input: Takes the shared feature vector from the
ObservationEncoder. - Structure: A standard MLP (Multi-Layer Perceptron) that maps features to action probabilities.
- Output: A probability distribution (Diagonal Gaussian) over the action space, from which the agent samples its movement commands.







