RuntimeError: cuda runtime error (9) : invalid configuration argument at /pytorch/aten/src/THC/generic/THCTensorMathPointwise.cu:124

I encounter this problem during training, which occurs in the first 400 iterations. I have tried to decrease the batch_size, but of no use. I also monitor the GPU memory during training, and it seems normal.

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMathPointwise.cu line=124 error=9 : invalid configuration argument
Traceback (most recent call last):
  File "train.py", line 669, in <module>
    main()
  File "train.py", line 665, in main
    train(G_model, G_net, G_optimizer, G_pair_dataloader, G_unpair_dataloader, R_model, R_net, R_optimizer, R_pair_dataloader, R_unpair_dataloader)
  File "train.py", line 565, in train
    tmp_r_lips = generate_lg(G_model, tmp_g_txts, tmp_guide_imgs)
  File "train.py", line 320, in generate_lg
    G_imgs = G_net(guide_imgs, None, g_txts)
  File "/home/WeicongChen/anaconda3/envs/pt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/WeicongChen/codes/DualLip/LipGAN/model_G_att.py", line 322, in forward
    text_z, text_h = self.text_encoder(text_inputs)
  File "/home/WeicongChen/anaconda3/envs/pt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/WeicongChen/codes/DualLip/LipGAN/model_G_att.py", line 77, in forward
    hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
RuntimeError: cuda runtime error (9) : invalid configuration argument at /pytorch/aten/src/THC/generic/THCTensorMathPointwise.cu:124

More details: I train two models at the same time, and no error occurs when I train any of them individually.

Are you training these models in a single script or separate ones?
Also, did you change anything in the script(s) for the “dual” run?

I train these models in a single script. Actually they are a Generator(G) and a Classifier(C ), and this error occurs when the Generator are generating data for the Classifier. Because the batchsize of the Generator is smaller than the Classifier, I run the Generator for a couple of times in order to generate enough data for the Classifier. And I observed that if I run the Generator for only one time, this error did not occur.

Here is an example snippet.

num = C_batch_size // G_batch_size # if I set num=1, this error won't occur
for i in range(num):
    input_i = input[i*num : (i+1)*num]
    data.append(G(input_i)) # G generates data for C, this is where the error occurs
C(data)

Could you post a code snippet to reproduce this error, please?

Sorry the code is too long… I guess this may because the data generated by my Generator is to large, whose dimension is (bs X 75 X 80 X 160 X 3). And a typical bs in my setting is 128, which means the generated data will occupy about 0.34G GPU memory. I am trying to solve this problem along this conjecture.

Are you able to reproduce the error by creating a random tensor in the shape of the accumulated generator output?
The initial error points to a wrong configuration, which might be triggered by a wrong kernel launch given the shape, but we would at least need to know the layer and the input shape for further debugging.

Actually I have created a dummy dataset, where all the data are random generated. When I run with this dummy dataset, no error occurs till now, which has been running for 2000 iters.

This is the part of the Generator network. The error occurs in the last three line.

class TextEncoder(nn.Module):
    def __init__(self, hidden_size, embed_size, vocab_size, rnn_type, num_layers=1, bidirectional=True, dropout=0.5, num_convs=3):
        super(TextEncoder, self).__init__()
        self.bidirectional = bidirectional
        self.num_convs = num_convs
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        convs = []
        for _ in range(num_convs):
            conv_layer = nn.Sequential(
                    ConvNorm(embed_size,
                            embed_size,
                            kernel_size=5, stride=1,
                            padding=(5 - 1) // 2,
                            dilation=1, w_init_gain='relu'),
                    nn.BatchNorm1d(embed_size),
                    nn.ReLU(inplace=True))
            convs.append(conv_layer)
        self.conv = nn.ModuleList(convs)
        
        self.rnn = RNNModel(embed_size, hidden_size, rnn_type, num_layers, bidirectional)
        scale = 2 if bidirectional else 1
        self.fc = nn.Linear(hidden_size * scale, hidden_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # seq_len X bs
        batch_size = x.shape[1]
        # seq_len X bs X emb_dim
        x = self.embedding(x)
        if self.num_convs > 0:
            # bs X emb_dim X seq_len
            x = x.permute(1, 2, 0)
            for layer in self.conv:
                x = layer(x)
            x = x.permute(2, 0, 1)
        x = self.dropout(x)

        # seq_len X bs X hid_size * num_direc, num_direc*num_layer*hid_size
        outputs, hidden = self.rnn(x)
        if self.bidirectional:
            hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))) # error occurs in this line
        else:
            hidden = torch.tanh(self.fc(hidden[-1,:,:]))
        return outputs, hidden

Thanks for the code! Could you post the input shapes, which trigger this error?

For the TextEncoder:
hidden_size=512, embed_size=256, vocab_size=28, rnn_type='GRU', num_convs=0.

For the input x, its shape is bs X seq_len, where the seq_len is adapted to the max_len in a batch. And according to my statistics, seq_len ranges in [25, 35].

Hope these infomation can help you locate the problem. Much thanks.

Have a similar problem but with reflection padding: RuntimeError: cuda runtime error (9) : invalid configuration argument at /pytorch/aten/src/THCUNN/generic/TemporalReflectionPadding.cu:64. The error disappears when batch size is small, such as 8.

@xwang233 is already fixing it in this PR.

1 Like

Thanks! Also got the notification from github.