Multi Agent Reinforcement Learning A2C with LSTM, CNN, FC Layers, Graph Attention Networks

Hello everyone,

I’m currently working on a Multi-Agent Reinforcement Learning (MARL) project focused on traffic signal control using a grid of intersections in the SUMO simulator. The environment is a 3x3 intersection grid where each intersection is controlled by a separate agent, with the agents coordinating to optimize traffic flow by adjusting signal phases.

Here’s a brief overview of the environment and model setup:

Observations: At each step, the environment returns an observation of shape (9, 3, 12, 20), where there are 9 agents, each receiving a local and partial observation of size (3, 12, 20).
Decentralized Approach: Each agent optimizes its policy using its current local observation, as well as the past 9 observations (stored in a buffer). Additionally, agents consider the influence of their 1-hop neighboring agents to enhance coordination.
Model Architecture:
Base Network: This is shared across all agents and consists of a CNN followed by fully connected layers (CNN + FC) to embed the local observations.
LSTM Network: To capture temporal information, each agent’s past 9 observations are combined with its current local observation. This sequence of observations are then processed through the agent’s LSTM network, which helps capture sequential dependencies and historical trends in the traffic flow.
Graph Attention Network (GAT): I also embed the stacked 9 observations for each agent and use a shared GAT to model the interactions between agents (1-hop neighbors).
Actor-Critic Networks (A2C): The outputs from the LSTM and GAT are concatenated and then fed into separate Actor and Critic networks for each agent to optimize their respective policies.

My model is a custom, simplified version of the architecture described in this article, which proposes a Multi-Agent Deep Reinforcement Learning approach for traffic signal control. Unfortunately, the code used in the paper has not been open-sourced, so I had to build the architecture from scratch based on the concepts outlined in the paper.

I have implemented the entire model in Python using PyTorch, and my code is available on GitHub: GitHub - nicolas-svgn/MARL-GAT. While I have successfully interfaced the various neural network components of the model (CNN, LSTM, GAT, Actor-Critic), I am currently facing issues with ensuring the flow of gradient computation during backpropagation. Specifically, there are challenges in maintaining the proper gradient flow through the different network types in the architecture.

in the train2.py, In my train_loop function, I use .clone():

def train_loop(self):
print()
print(“Start Training”)

    # Enable anomaly detection
    T.autograd.set_detect_anomaly(True)  

    """for step in itertools.count(start=self.agent.resume_step):
        self.agent.step = step"""

    actions = [random.randint(0,3) for tl_id in self.tls]
    obs, rew, terminated, infos = self.env.step(actions)

    graph_features = self.embedder.graph_embed_state(obs)

    gat_output = self.gat_block.gat_output(graph_features)

    for agent in self.agents:
       agent.gat_features = gat_output.clone()
       agent_obs = obs[agent.tl_map_id].copy()
       embedded_agent_obs = self.embedder.embed_agent_obs(agent_obs)
       agent.current_t_obs = embedded_agent_obs.clone()

    for step in range(3):

        actions = []
        agent_log_probs = []

        for agent in self.agents:
            action, log_prob = agent.select_action(agent.current_t_obs, agent.gat_features)
            agent.current_action = action
            actions.append(agent.current_action)
            agent_log_probs.append(log_prob)

        new_obs, rew, terminated, infos = self.env.step(actions)
        new_graph_features = self.embedder.graph_embed_state(new_obs)
        new_gat_output = self.gat_block.gat_output(new_graph_features)

        for agent in self.agents:
            agent.new_gat_features = new_gat_output.clone()
            agent_new_obs = new_obs[agent.tl_map_id].copy()
            embedded_agent_new_obs = self.embedder.embed_agent_obs(agent_new_obs)
            agent.new_t_obs = embedded_agent_new_obs.clone()


        vlosses = []
        plosses = []

        for agent in self.agents:
            print('--------------------')
            print('agent id')
            print(agent.tl_id)
            print('agent map id')
            print(agent.tl_map_id)
            agent_action = agent.current_action
            agent_action_log_prob = agent_log_probs[agent.tl_map_id]
            print('agent action')
            print(agent_action)
            agent_reward = rew[agent.tl_map_id]
            print('agent reward')
            print(agent_reward)
            agent_terminated = terminated[agent.tl_map_id]
            print('agent is done ?')
            print(agent_terminated)
            print('--------------------')

            vloss, ploss = agent.learn(agent.gat_features, agent.new_gat_features, agent_action_log_prob, agent.current_t_obs, agent.new_t_obs, agent_reward, agent_terminated)
            vlosses.append(vloss)
            plosses.append(ploss)

        # Calculate the average losses across all agents
        avg_value_loss = sum(vlosses) / len(vlosses)
        avg_policy_loss = sum(plosses) / len(plosses)

        # Combine the average losses
        total_loss = avg_value_loss + avg_policy_loss

        # Zero gradients for all optimizers (shared and individual)
        self.embedder.base_network.optimizer.zero_grad()
        self.gat_block.gat_network.optimizer.zero_grad()
        for agent in self.agents:
            agent.lstm_network.optimizer.zero_grad()
            agent.actor_network.optimizer.zero_grad()
            agent.critic_network.optimizer.zero_grad()

        # Disable dropout for backpropagation
        self.gat_block.gat_network.train(False)

        # Backpropagate the total loss only once
        print('we re about to backward')
        total_loss.backward(retain_graph=True)
        print('backward done !')

        # Check gradients for the BaseNetwork
        for name, param in self.embedder.base_network.named_parameters():
            if param.grad is not None:
                print(f"Gradient computed for {name}")
            else:
                print(f"No gradient computed for {name}")

        # Re-enable dropout
        self.gat_block.gat_network.train(True)

        # Update all optimizers (shared and individual)
        self.embedder.base_network.optimizer.step()
        self.gat_block.gat_network.optimizer.step()
        for agent in self.agents:
            agent.lstm_network.optimizer.step()
            agent.actor_network.optimizer.step()
            agent.critic_network.optimizer.step()

        for agent in self.agents:
            agent.load_hist_buffer(agent.current_t_obs)
            agent.gat_features = agent.new_gat_features.clone()
            agent.current_t_obs = agent.new_t_obs.clone()

Specifically when updating the current observations and gat features of each of my agents, if I use clone() what I get is the following error :

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [16, 8]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

This error suggests that an in-place operation is modifying the variable, but I’m not explicitly using any in-place operation in my code. If I switch to .detach() instead of .clone(), the error disappears, but the gradients of the base network are no longer computed:

Gradient computed for conv1.weight
Gradient computed for conv1.bias
Gradient computed for conv2.weight
Gradient computed for conv2.bias
Gradient computed for fc1.weight
Gradient computed for fc1.bias
Gradient computed for fc2.weight
Gradient computed for fc2.bias
Gradient computed for fc3.weight
Gradient computed for fc3.bias
Gradient computed for fc4.weight
Gradient computed for fc4.bias

Can anyone offer insights on how to handle the flow of gradient computation properly in a complex architecture like this? When is it appropriate to use .clone(), .detach(), or other operations to avoid issues with in-place modifications and still maintain the gradient flow? Any advice on handling this type of architecture would be greatly appreciated.

Thank you!