Does ModuleList behaves differently from Sequence

Hi there,

I was trying to implement a A2C model to train one of the OpenGym project.

I created two models that are identical to me in terms of structure and forward logic. The main difference between the two models is that one is created using ModulelList with sequence wrapping inside while the other one is using Sequence. However, only the Sequence implementation is learning.

Model One

torch.manual_seed(999)
base_h_num = 0
actor_h_num = 1 
critic_h_num = 1
act_size = 4
input_dim = 33

class A2C_model(nn.Module):
    def __init__(self, input_dim, act_size):
        super().__init__()
        self.input_dim = input_dim
        self.act_size = act_size
        self.base = self.create_base(self.input_dim)
        self.mu = self.create_actor()
        self.val = self.create_critic()
        self.std = nn.Parameter(torch.ones(1, act_size))
  
    def create_base(self, input_dim):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(input_dim, 128)
        layer.add_module(f"fc_layer_1", fc)
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        self.add_hidden_layer(module_list, base_h_num,128, 128)
        return module_list

    def create_actor(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        self.add_hidden_layer(module_list, actor_h_num, 128, 128)
        module_list.append(nn.Sequential(nn.Linear(128, self.act_size)))
        return module_list
    
    def create_critic(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        self.add_hidden_layer(module_list, critic_h_num, 128, 128)
        module_list.append(nn.Sequential(nn.Linear(128, 1)))
        return module_list
    
    
    def add_hidden_layer(self, module_list, num_hidden_layer,
                         input_dim, output_dim):
        if num_hidden_layer == 0:
            return
        for i in range(1, num_hidden_layer+1):
            layer = nn.Sequential()
            fc = nn.Linear(input_dim, output_dim)
            layer.add_module(f"fc_layer_{i}", fc)
            layer.add_module(f"RELU_layer_{i}", nn.ReLU())
            module_list.append(layer)

    def forward(self, x):
        for b in self.base:
            x = b(x)
        mu = x
        for m in self.mu:
            mu = m(mu)
        dist = torch.distributions.Normal(mu, self.std)
        actions = dist.sample()     
        log_prob = dist.log_prob(actions)
        for v in self.val:
            x = v(x)
        return torch.clamp(actions, -1, 1), log_prob, x

Model 2

class A2C_model(nn.Module):
    def __init__(self, input_dim , act_size):
        super(ActorCriticNetwork, self).__init__()
        self.fc1 = nn.Linear(input_dim , 128)
        self.actor_fc = nn.Linear(128, 128)
        self.actor_out = nn.Linear(128, act_size)
        self.std = nn.Parameter(torch.ones(1, act_size))
        self.critic_fc = nn.Linear(128, 128)
        self.critic_out = nn.Linear(128, 1)
        
    def forward(self, state):
        x = F.relu(self.fc1(state))
        mean = self.actor_out(F.relu(self.actor_fc(x)))
        dist = torch.distributions.Normal(mean, self.std)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        value = self.critic_out(F.relu(self.critic_fc(x)))
        return torch.clamp(action, -1, 1), log_prob, value

I created the ModuleList version is to play around with number of hidden layers and it will be easier to make changes to hidden layers. However, it performs very badly as compared to the Sequence implementation.

I have tried different seeds but the ModuleList version has no luck. This really makes me wonder what I have done wrong. I hope someone can help me what is root cause of this descrepancy so I won’t make the same mistake again. Cheers!

I have been trying to figure this thing out for days really appreciate if someone can help me!

Cheers!

That’s a bit strange, as both models yield the same output and get the same gradients, if you initialize them with the same values and set the seed before calling the forward method:

model1 = A2C_model(input_dim, act_size)
model2 = A2C_model2(input_dim, act_size)

with torch.no_grad():
    for param1, param2 in zip(model1.parameters(), model2.parameters()):
        param1.copy_(param2)


x = torch.randn(1, input_dim)
torch.manual_seed(2809)
output1 = model1(x)
torch.manual_seed(2809)
output2 = model2(x)
print(output1)
print(output2)

output1[2].backward()
output2[2].backward()

for param1, param2 in zip(model1.parameters(), model2.parameters()):
    if param1.grad is None and param2.grad is None:
        print('both none')
    else:
        print((param1.grad == param2.grad).all())

Could you use my code snippet to copy the parameters of your “good” model to the bad one and train the bad model for a bit just to see, if it achieves a similar accuracy then?

I am also expiencing a similar issue and I am really confused about it. I implemented the exact same two models. One with ModuleList() and/or as individual single layers and another one with Sequential(). In my case also just Sequential() is learning properly. It seems like the ModuleList() model is learning the general dataset mean instead of reacting on the input. So it seems like somehow the gradients are not flowing properly back until the input. It happens for a task where two twin networks are trained that influence each other. I run it over multiple runs everytime only the Sequential() model is able to find a proper solution.

This really feels like a super strange bug.

I am usin Pytorch 1.2.0 with Cuda 10.0

Did you track down the problem in your case? Or is it really a bug with pytorch/cuda?

You could check it by printing all .grad attributes of the model parameters after the backward call.
If some of them are None, your computation graph was (accidentally) detached at some point.

Could you update to the latest stable version and post a code snippet to reproduce this issue?

My code snippet yields the same output and gradients for both models, so I assume it’s not a bug in PyTorch/CUDA but in the user code.

I will try to check it when I have time if the gradients are flowwing back properly. Unfortunately I am not able to update Cuda 10.0 that easily and Pytorch 1.2. is the latest version for Cuda 10.0. I already updated Pytorch from 1.1 to 1.2 but same behaviour in both versions.

This behaviour just happens with the same model that is forwarded twice with two different inputs and where both outputs are somehow influencing each other.

E.g. something like this:

output1 = model.forward(input1)
output2 = model.forward(input2)

loss = (output1 - output2).pow(2).sum()

But where the optimal output somehow can only be achieved by considering the input and not just by setting the output all to zero, like in this example. I try to think of a simple reproducable example because I cannot post my code here and also not reduce my code which is a quite complex.

Note that the binaries ship with their own CUDA, cudnn, etc. runtime libraries, so your system installed CUDA version will not be used.
If you want to use your system CUDA, you would have to rebuild PyTorch from source.

Your code snippet could yield different output tensors e.g. if dropout or batchnorm layers were used, so you could try to call model.eval() before comparing the output.

PS: don’t use model.forward(input), as this will not use potentially registered hooks and could yield unwanted behavior. Call the model directly via model(input). :wink:

1 Like

Ah okay interesting, thank you :slight_smile: . But the Nvidia driver must be compatible with the Pytorch cuda version i guess? I will try to give it a go later with the newest stable version.

Yes, you would have to install an appropriate NVIDIA driver for the CUDA version you are using (via the binaries or locally on your system).