RuntimeError: code is too big

Ahmed_m · August 12, 2019, 11:42am

Hi,

I am using the latest stable version of Pytorch (1.2.0) to train and test a model. For testing, I map the trained model to the CPU and run there. However, I get the following error:

RuntimeError: code is too big

Could anybody let me know what is the main reason for such an error or at least what is the meaning of it? Unfortunately, I could not diagnose the problem based on this limited error message.

This issue only happens when I run on the CPU, with GPU everything is fine. I was thinking it is a RAM issue. However, this error occurs on a workstation of 64 Giga RAM and it disappears when I run on a normal computer of 8 Giga RAM.

Many thanks in advance

andrewpatterson2018 · August 12, 2019, 12:45pm

What is the code your working on?

Ahmed_m · August 12, 2019, 1:02pm

It is a normal nn model like the following one:

class G(nn.Module):
    """G"""
    def __init__(self):
        super().__init__()
        torch.manual_seed(5)


        self.pad1= nn.ReflectionPad1d(15)
        self.enc1 = SpectralNorm(nn.Conv1d(in_channels=1, out_channels=16, kernel_size=32, stride=2, padding=0, bias= False))   # out : [B x 16 x 8192]
        self.enc1_nl = nn.PReLU()  # non-linear transformation after encoder layer 1
        self.pad2= nn.ReflectionPad1d(15)
        self.enc2 = SpectralNorm(nn.Conv1d(16, 16, 32, 2, 0, bias= True))  # [B x 32 x 4096]
        self.enc2_nl = nn.PReLU()
        self.pad3= nn.ReflectionPad1d(15)
        self.enc3 = SpectralNorm(nn.Conv1d(16, 32, 32, 2, 0, bias= True))  # [B x 32 x 2048]
        self.enc3_nl = nn.PReLU()
        self.pad4= nn.ReflectionPad1d(15)
        self.enc4 = SpectralNorm(nn.Conv1d(32, 32, 32, 2, 0, bias= True))  # [B x 64 x 1024]
        self.enc4_nl = nn.PReLU()
        self.pad5= nn.ReflectionPad1d(15)
        self.enc5 = SpectralNorm(nn.Conv1d(32, 64, 32, 2, 0, bias= True))  # [B x 64 x 1024]
        self.enc5_nl = nn.PReLU()
        self.pad6= nn.ReflectionPad1d(15)
        self.enc6 = SpectralNorm(nn.Conv1d(64, 64, 32, 2, 0, bias= True))  # [B x 64 x 1024]
        self.enc6_nl = nn.PReLU()
        self.pad7= nn.ReflectionPad1d(15)
        self.enc7 = SpectralNorm(nn.Conv1d(64, 128, 32, 2, 0, bias= True))  # [B x 64 x 1024]
        self.enc7_nl = nn.PReLU()

        self.conv1x1_1 = SpectralNorm(nn.Conv1d(in_channels=128, out_channels=1, kernel_size=1, stride=1, padding=0, bias= False))

    
        self.dec7_nl = nn.PReLU()
        self.conv1x1_2 = SpectralNorm(nn.Conv1d(in_channels=65, out_channels=1, kernel_size=1, stride=1, padding=0, bias= False))
       
        self.dec6_nl = nn.PReLU()
        self.conv1x1_3 = SpectralNorm(nn.Conv1d(in_channels=65, out_channels=1, kernel_size=1, stride=1, padding=0, bias= False))
        
        self.dec5_nl = nn.PReLU()
        self.conv1x1_4 = SpectralNorm(nn.Conv1d(in_channels=33, out_channels=1, kernel_size=1, stride=1, padding=0, bias= False))

        self.dec4_nl = nn.PReLU()
        self.conv1x1_5 = SpectralNorm(nn.Conv1d(in_channels=33, out_channels=1, kernel_size=1, stride=1, padding=0, bias= False))
        
        self.dec3_nl = nn.PReLU()
        self.conv1x1_6 = SpectralNorm(nn.Conv1d(in_channels=17, out_channels=1, kernel_size=1, stride=1, padding=0, bias= False))
        
        self.dec2_nl = nn.PReLU()
        self.conv1x1_7 = SpectralNorm(nn.Conv1d(in_channels=17, out_channels=1, kernel_size=1, stride=1, padding=0, bias= False))
        
        self.dec1_nl = nn.PReLU()
        
    def forward(self, x):
        """
        Forward pass of generator.
        Args:
            x: input batch (signal)
            z: latent vector
        """
        ### encoding step

        enc1= self.enc1(self.pad1(x))
        enc1_out= self.enc1_nl(enc1)
        enc2= self.enc2(self.pad2(enc1_out))
        enc2_out= self.enc2_nl(enc2)
        enc3= self.enc3(self.pad3(enc2_out))
        enc3_out= self.enc3_nl(enc3)
        enc4= self.enc4(self.pad4(enc3_out))
        enc4_out= self.enc4_nl(enc4)
        enc5= self.enc5(self.pad5(enc4_out))
        enc5_out= self.enc5_nl(enc5)
        enc6= self.enc6(self.pad6(enc5_out))
        enc6_out= self.enc6_nl(enc6)
        enc7= self.enc7(self.pad7(enc6_out))
        enc7_out= self.enc7_nl(enc7)

        code= enc7_out

        dec7= F.interpolate(self.conv1x1_1(code), scale_factor=2, mode="linear")
        dec7_out= self.dec7_nl(torch.cat((dec7,enc6),dim=1))
        
        
        dec6= F.interpolate(self.conv1x1_2(dec7_out), scale_factor=2, mode="linear")
        dec6_out= self.dec6_nl(torch.cat((dec6,enc5),dim=1))
        
        dec5= F.interpolate(self.conv1x1_3(dec6_out), scale_factor=2, mode="linear")
        dec5_out= self.dec5_nl(torch.cat((dec5,enc4),dim=1))
        
        dec4= F.interpolate(self.conv1x1_4(dec5_out), scale_factor=2, mode="linear")
        dec4_out= self.dec4_nl(torch.cat((dec4,enc3),dim=1))
        
        dec3= F.interpolate(self.conv1x1_5(dec4_out), scale_factor=2, mode="linear") 
        dec3_out= self.dec3_nl(torch.cat((dec3,enc2),dim=1))
        
        dec2= F.interpolate(self.conv1x1_6(dec3_out), scale_factor=2, mode="linear") 
        dec2_out= self.dec2_nl(torch.cat((dec2,enc1),dim=1))
        
        dec1= F.interpolate(self.conv1x1_7(dec2_out), scale_factor=2, mode="linear")
        out = self.dec1_nl(dec1)

        return out

When I try to run it on the workstation’s CPU, I got that error:

g=G()
z= torch.randn((1,1,16000))
o= g(z)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-129-2d7f710cef1a> in <module>
      1 g=G()
      2 z= torch.randn((1,1,16000))
----> 3 o= g(z)
      4 out.shape

/home/amm-er/ahd/anaconda3/envs/amustenv/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    545             result = self._slow_forward(*input, **kwargs)
    546         else:
--> 547             result = self.forward(*input, **kwargs)
    548         for hook in self._forward_hooks.values():
    549             hook_result = hook(self, input, result)

<ipython-input-128-de262c205903> in forward(self, x)
    129         enc1= self.enc1(self.pad1(x))
    130         enc1_out= self.enc1_nl(enc1)
--> 131         enc2= self.enc2(self.pad2(enc1_out))
    132         enc2_out= self.enc2_nl(enc2)
    133         enc3= self.enc3(self.pad3(enc2_out))

/home/amm-er/ahd/anaconda3/envs/amustenv/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    545             result = self._slow_forward(*input, **kwargs)
    546         else:
--> 547             result = self.forward(*input, **kwargs)
    548         for hook in self._forward_hooks.values():
    549             hook_result = hook(self, input, result)

<ipython-input-65-68e726f6c853> in forward(self, *args)
     62     def forward(self, *args):
     63         self._update_u_v()
---> 64         return self.module.forward(*args)

/home/amm-er/ahd/anaconda3/envs/amustenv/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    198                             _single(0), self.dilation, self.groups)
    199         return F.conv1d(input, self.weight, self.bias, self.stride,
--> 200                         self.padding, self.dilation, self.groups)
    201 
    202 

RuntimeError: code is too big

On the GPU, everything works fine!

andrewpatterson2018 · August 12, 2019, 1:27pm

It’s hard to say without looking at the rest of the code - but maybe try reduce the batch size? Maybe even try reduce the size of your dataset?

Try monitoring it on the GPusing using the command nvidia-smi or monitor your CPU/Memory usage when you start it?

Ahmed_m · August 12, 2019, 1:31pm

Actually, the batch size I use for running on CPU is equal to 1. “nvidia-smi” provides information about memory utilization by the GPU, which is not relevant to this problem.

Do you have at least any idea what this error “code is too big” should refer to?

andrewpatterson2018 · August 12, 2019, 2:00pm

What do you mean when you say “map the trained model to the CPU”?

Ahmed_m · August 12, 2019, 2:16pm

Assume you have an nn module which was trained and saved to a .pkl file.

For inference, you need to load this model and decide whether you would like to run it on GPU or CPU. This is done by setting the argument “map_location” of “torch.load” to “cpu” for running on the CPU.

I think if you could give me any info about this error message then I would understand the issue better.

andrewpatterson2018 · August 12, 2019, 2:32pm

To be honest its difficult to reproduce so I couldn’t give a definitive answer, my guess is that your training on the GPU then serialisation the model in an unexpected way such that when you try to reload the model into a CPU backend it doesn’t work.

To check this, how is it your saving the model?

Ahmed_m · August 12, 2019, 3:10pm

It is the normal save/load of modules. I have already posted on github and it seems to be an internal issue.