Does ModuleList behaves differently from Sequence

Pengbo_Ma · August 7, 2019, 9:31pm

Hi there,

I was trying to implement a A2C model to train one of the OpenGym project.

I created two models that are identical to me in terms of structure and forward logic. The main difference between the two models is that one is created using ModulelList with sequence wrapping inside while the other one is using Sequence. However, only the Sequence implementation is learning.

Model One

torch.manual_seed(999)
base_h_num = 0
actor_h_num = 1 
critic_h_num = 1
act_size = 4
input_dim = 33

class A2C_model(nn.Module):
    def __init__(self, input_dim, act_size):
        super().__init__()
        self.input_dim = input_dim
        self.act_size = act_size
        self.base = self.create_base(self.input_dim)
        self.mu = self.create_actor()
        self.val = self.create_critic()
        self.std = nn.Parameter(torch.ones(1, act_size))
  
    def create_base(self, input_dim):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        fc = nn.Linear(input_dim, 128)
        layer.add_module(f"fc_layer_1", fc)
        layer.add_module(f"RELU_layer_1", nn.ReLU())
        module_list.append(layer)
        self.add_hidden_layer(module_list, base_h_num,128, 128)
        return module_list

    def create_actor(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        self.add_hidden_layer(module_list, actor_h_num, 128, 128)
        module_list.append(nn.Sequential(nn.Linear(128, self.act_size)))
        return module_list
    
    def create_critic(self):
        module_list = nn.ModuleList()
        layer = nn.Sequential()
        self.add_hidden_layer(module_list, critic_h_num, 128, 128)
        module_list.append(nn.Sequential(nn.Linear(128, 1)))
        return module_list
    
    
    def add_hidden_layer(self, module_list, num_hidden_layer,
                         input_dim, output_dim):
        if num_hidden_layer == 0:
            return
        for i in range(1, num_hidden_layer+1):
            layer = nn.Sequential()
            fc = nn.Linear(input_dim, output_dim)
            layer.add_module(f"fc_layer_{i}", fc)
            layer.add_module(f"RELU_layer_{i}", nn.ReLU())
            module_list.append(layer)

    def forward(self, x):
        for b in self.base:
            x = b(x)
        mu = x
        for m in self.mu:
            mu = m(mu)
        dist = torch.distributions.Normal(mu, self.std)
        actions = dist.sample()     
        log_prob = dist.log_prob(actions)
        for v in self.val:
            x = v(x)
        return torch.clamp(actions, -1, 1), log_prob, x

Model 2

class A2C_model(nn.Module):
    def __init__(self, input_dim , act_size):
        super(ActorCriticNetwork, self).__init__()
        self.fc1 = nn.Linear(input_dim , 128)
        self.actor_fc = nn.Linear(128, 128)
        self.actor_out = nn.Linear(128, act_size)
        self.std = nn.Parameter(torch.ones(1, act_size))
        self.critic_fc = nn.Linear(128, 128)
        self.critic_out = nn.Linear(128, 1)
        
    def forward(self, state):
        x = F.relu(self.fc1(state))
        mean = self.actor_out(F.relu(self.actor_fc(x)))
        dist = torch.distributions.Normal(mean, self.std)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        value = self.critic_out(F.relu(self.critic_fc(x)))
        return torch.clamp(action, -1, 1), log_prob, value

I created the ModuleList version is to play around with number of hidden layers and it will be easier to make changes to hidden layers. However, it performs very badly as compared to the Sequence implementation.

I have tried different seeds but the ModuleList version has no luck. This really makes me wonder what I have done wrong. I hope someone can help me what is root cause of this descrepancy so I won’t make the same mistake again. Cheers!

I have been trying to figure this thing out for days really appreciate if someone can help me!

Cheers!

ptrblck · August 10, 2019, 12:24am

That’s a bit strange, as both models yield the same output and get the same gradients, if you initialize them with the same values and set the seed before calling the forward method:

model1 = A2C_model(input_dim, act_size)
model2 = A2C_model2(input_dim, act_size)

with torch.no_grad():
    for param1, param2 in zip(model1.parameters(), model2.parameters()):
        param1.copy_(param2)


x = torch.randn(1, input_dim)
torch.manual_seed(2809)
output1 = model1(x)
torch.manual_seed(2809)
output2 = model2(x)
print(output1)
print(output2)

output1[2].backward()
output2[2].backward()

for param1, param2 in zip(model1.parameters(), model2.parameters()):
    if param1.grad is None and param2.grad is None:
        print('both none')
    else:
        print((param1.grad == param2.grad).all())

Could you use my code snippet to copy the parameters of your “good” model to the bad one and train the bad model for a bit just to see, if it achieves a similar accuracy then?

marcel1991 · August 28, 2020, 4:06pm

I am also expiencing a similar issue and I am really confused about it. I implemented the exact same two models. One with ModuleList() and/or as individual single layers and another one with Sequential(). In my case also just Sequential() is learning properly. It seems like the ModuleList() model is learning the general dataset mean instead of reacting on the input. So it seems like somehow the gradients are not flowing properly back until the input. It happens for a task where two twin networks are trained that influence each other. I run it over multiple runs everytime only the Sequential() model is able to find a proper solution.

This really feels like a super strange bug.

I am usin Pytorch 1.2.0 with Cuda 10.0

marcel1991 · August 28, 2020, 4:18pm

Did you track down the problem in your case? Or is it really a bug with pytorch/cuda?

ptrblck · August 28, 2020, 11:58pm

You could check it by printing all .grad attributes of the model parameters after the backward call.
If some of them are None, your computation graph was (accidentally) detached at some point.

Could you update to the latest stable version and post a code snippet to reproduce this issue?

My code snippet yields the same output and gradients for both models, so I assume it’s not a bug in PyTorch/CUDA but in the user code.

marcel1991 · August 30, 2020, 4:10pm

I will try to check it when I have time if the gradients are flowwing back properly. Unfortunately I am not able to update Cuda 10.0 that easily and Pytorch 1.2. is the latest version for Cuda 10.0. I already updated Pytorch from 1.1 to 1.2 but same behaviour in both versions.

This behaviour just happens with the same model that is forwarded twice with two different inputs and where both outputs are somehow influencing each other.

E.g. something like this:

output1 = model.forward(input1)
output2 = model.forward(input2)

loss = (output1 - output2).pow(2).sum()

But where the optimal output somehow can only be achieved by considering the input and not just by setting the output all to zero, like in this example. I try to think of a simple reproducable example because I cannot post my code here and also not reduce my code which is a quite complex.

ptrblck · August 30, 2020, 11:24pm

Note that the binaries ship with their own CUDA, cudnn, etc. runtime libraries, so your system installed CUDA version will not be used.
If you want to use your system CUDA, you would have to rebuild PyTorch from source.

Your code snippet could yield different output tensors e.g. if dropout or batchnorm layers were used, so you could try to call model.eval() before comparing the output.

PS: don’t use model.forward(input), as this will not use potentially registered hooks and could yield unwanted behavior. Call the model directly via model(input).

marcel1991 · August 31, 2020, 8:43am

Ah okay interesting, thank you . But the Nvidia driver must be compatible with the Pytorch cuda version i guess? I will try to give it a go later with the newest stable version.

ptrblck · August 31, 2020, 4:24pm

Yes, you would have to install an appropriate NVIDIA driver for the CUDA version you are using (via the binaries or locally on your system).

sk4301 · May 14, 2024, 1:58pm

Hey @marcel1991 @ptrblck , Did you ever figure this out.

I am experiencing a similar issue. in my case, I have an MLP

class MLP(nn.Module):
    def __init__(self, input_size: int, hidden_sizes: List[int],
                 output_size: int):
        super(MLP, self).__init__()
        fc_layers = []
        for i, hidden_size in enumerate(hidden_sizes):
            if i == 0:
                fc_layers.append(nn.Linear(input_size, hidden_size))
            else:
                fc_layers.append(nn.Linear(hidden_sizes[i - 1], hidden_size))
            fc_layers.append(nn.ReLU())
        fc_layers.append(nn.Linear(hidden_sizes[-1], output_size))
        # below  has really bad model training behavior - loss keeps going up
        # self.fc_layers = nn.ModuleList(fc_layers)
        # below works fine
        self.fc_layers = fc_layers


    def forward(self, x: torch.Tensor):
        for fc_layer in self.fc_layers:
            x = fc_layer(x)
        return x

the MLP flows to a sigmoid and I noticed when I use ModuleList, the output from sigmoid goes to 0 after the first iteration so seems like a bug in ModuleList where gradients might not be flowing properly.

EDIT: Also, nn.Sequential had the same issue just in case that is useful

ptrblck · May 14, 2024, 5:44pm

I wasn’t able to reproduce any issues and did not see a code snippet reproducing divergence.

Your current code has an error and is not properly registering the layers. model.parameters() will thus be empty and no training will happen:

model = MLP(10, [10, 10], 10)
print(list(model.parameters()))
# []

Using nn.ModuleList or nn.Sequential is the right approach.

sk4301 · May 15, 2024, 1:15am

ah thanks! Okay let me switch it back to Sequential and check gradients to see what is going on during training run. This MLP is part of larger network so it’s possible the sane learning curve is just coming from the rest of the network

sk4301 · May 28, 2024, 10:11am

Just to close the loop on my above question. I changed back to nn.Sequential. it ended up being an issue with my learning rate and label distribution (highly skewed). So lowering learning rate and adding pos_weights to BCEWithLogitsLoss fixed my problem eventually. Now the model is learning.