PyTorch code stuck at torch.nn.Linear() in the init() function

Archit_Mukherjee · April 2, 2024, 11:31am

Hi,
I am trying to train a model which has an MLP layer. But when I run the training code, it is stuck at nn.Linear() initialization and I cannot even terminate the process using Ctrl+C

My PyTorch version is 2.2.2
CUDA version is 12.2.1

I have tried with other CUDA version (11.8), but still it is not working.

The MLP class code is as follows:

class MLP_5layer(nn.Module):
    def __init__(self, in_features, out_features):
        super(MLP_5layer, self).__init__()
        
        print('-init-mlp5-')
        layer1_out = int(in_features + (out_features-in_features)/5)
        layer2_out = int(in_features + 2*(out_features-in_features)/5)
        layer3_out = int(in_features + 3*(out_features-in_features)/5)
        layer4_out = int(in_features + 4*(out_features-in_features)/5)
        print(layer4_out)
        print("in_features:", in_features)
        print("layer1_out:", layer1_out)
        self.linear1 = torch.nn.Linear(in_features, layer1_out)
        print('-linear1-')
        self.batchnorm1 = torch.nn.BatchNorm1d(layer1_out)
        print('-layer1-')
        self.linear2 = torch.nn.Linear(layer1_out, layer2_out)
        self.batchnorm2 = torch.nn.BatchNorm1d(layer2_out)
        print('-layer2-')
        self.linear3 = torch.nn.Linear(layer2_out, layer3_out)
        self.batchnorm3 = torch.nn.BatchNorm1d(layer3_out)
        print('-layer3-')
        self.linear4 = torch.nn.Linear(layer3_out, layer4_out)
        self.batchnorm4 = torch.nn.BatchNorm1d(layer4_out)
        print('-layer4-')
        self.linear5 = torch.nn.Linear(layer4_out, out_features)
        self.batchnorm5 = torch.nn.BatchNorm1d(out_features)
        print('-layer5-')
        self.leakyrelu = torch.nn.LeakyReLU()
        self.dropout_01 = torch.nn.Dropout(p=0.1)
        self.dropout_05 = torch.nn.Dropout(p=0.5)
        print('-done-mlp5-')

    def forward(self, x):
        x = self.dropout_01(self.leakyrelu(self.batchnorm1(self.linear1(x))))
        x = self.dropout_05(self.leakyrelu(self.batchnorm2(self.linear2(x))))
        x = self.dropout_05(self.leakyrelu(self.batchnorm3(self.linear3(x))))
        x = self.dropout_05(self.leakyrelu(self.batchnorm4(self.linear4(x))))
        x = self.leakyrelu(self.batchnorm5(self.linear5(x)))
        return x

When I run the main script, I get the following output:

-init-mlp5-
226867
in_features: 347904
layer1_out: 317644

The process is stuck here. Please let me know what is the issue.

Thanks!

ptrblck · April 2, 2024, 7:51pm

Your linear layer tries to initialize 347904 * 317644 * 4 / 1024**3 = 411.68 GB which is why your code is “stuck”. Reduce the number of features to avoid allocating ~400GB.

PyTorch code stuck at torch.nn.Linear() in the __init__() function

PyTorch code stuck at torch.nn.Linear() in the init() function