Small conv1 network crushes on CPU

Mikhail · October 31, 2017, 5:08pm

Hello,

I get crash after running a small Conv1d network on cpu. Here is a code example that crushes:

import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T
from torch.autograd import Variable

BATCH = 64
def get_xy():
    X = np.random.randn(BATCH, 50)
    Y = np.random.randn(BATCH, 1)
    return X, Y

use_cuda = False
FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if use_cuda else torch.ByteTensor
Tensor = FloatTensor

class CNNRet(nn.Module):
    def __init__(self):
        super(CNNRet, self).__init__()
        self.cnn1 = nn.Conv1d(in_channels=1, out_channels=5, kernel_size=10, stride=5)
        self.cnn2 = nn.Conv1d(in_channels=5, out_channels=3, kernel_size=5, stride=2)
        self.lin = nn.Linear(9, 1)
    
    def forward(self, x):
        print x
        x = torch.unsqueeze(x, 1)
        x = self.cnn1(x)
        x = F.relu(x)
        x = self.cnn2(x)
        x = F.relu(x)
        x = x.view(-1, 9)
        x = self.lin(x)
        return x

model = CNNRet()
if use_cuda:
    model.cuda()
    
def get_loss():
    X, Y = get_xy()
    X = X.astype(np.float32)
    Y = Y.astype(np.float32)
    X = torch.from_numpy(X)
    Y = torch.from_numpy(Y)
    if use_cuda:
        X = X.cuda()
        Y = Y.cuda()
    X = Variable(X, requires_grad=False)
    Y = Variable(Y, requires_grad=False)
    p = model(X)
    loss = F.mse_loss(p, Y)
    return loss
    
def optimize_model(opt):
    opt.zero_grad()
    loss = get_loss()
    loss.backward()
    opt.step(get_loss)
    return loss

optimizer = optim.RMSprop(model.parameters(), lr=0.001, momentum=0.9)
mloss = 0.0
for it in xrange(10000):
    loss = optimize_model(optimizer)
    mloss = loss * 0.01 + 0.99 * mloss
    if it % 1000 == 999:
        print it, mloss

The error:
PC: @ 0x4a6f8a PyErr_Occurred
*** SIGSEGV (@0x48) received by PID 10298 (TID 0x7f98c5d50700) from PID 72; stack trace: ***
@ 0x7f994bdd3390 (unknown)
@ 0x4a6f8a PyErr_Occurred
@ 0x7f993cc73b27 _array_dealloc_buffer_info
@ 0x7f993cc2615e array_dealloc
@ 0x7f98ffeca8d5 NumpyArrayAllocator::free()
@ 0x7f98f3a2844e THFloatStorage_free
@ 0x7f98f3a3ebd7 THFloatTensor_free
@ 0x7f98f329e029 thpp::THTensor<>::~THTensor()
@ 0x7f98fff00ae1 torch::autograd::ConvBackward::releaseVariables()
@ 0x7f98ffed2d8b torch::autograd::Engine::evaluate_function()
@ 0x7f98ffed3573 torch::autograd::Engine::thread_main()
@ 0x7f98ffef7417 PythonEngine::thread_main()
@ 0x7f98ffed777a std::thread::_Impl<>::_M_run()
@ 0x7f9900cd7a20 execute_native_thread_routine
@ 0x7f994bdc96ba start_thread
@ 0x7f994baff3dd clone
@ 0x0 (unknown)

The strange thing that this code doesn’t crush on gpu, but the prediction - p has a cyclic structure (if you plot it).

SimonW · October 31, 2017, 5:17pm

The code doesn’t crash on my machine with install from master. I’d suggest two things to try:

change mloss = loss * 0.01 + 0.99 * mloss to mloss = loss.data[0] * 0.01 + 0.99 * mloss
update to master

Could you try and see if these solve the issue?

Mikhail · October 31, 2017, 6:00pm

I just installed pytorch from http://pytorch.org/.

Using:
pip install http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl
pip install torchvision

It didn’t help (I also changed mloss).

SimonW · October 31, 2017, 6:03pm

Maybe it is fixed somewhere between 0.2 and master. Sorry that I can’t help if I can’t reproduce it on master.

Building from source is easy though! You might want to try it.

Mikhail · November 1, 2017, 4:37pm

I built it from github repo and still get the same error. Could it be the different python version (I use 2.7) than yours?

ptrblck · November 1, 2017, 6:11pm

The code also crashed on my machine (using CPU).
It seems, the segmentation fault is thrown when calling loss.backward().

I moved the loss.backward() call to get_loss and the segmentation fault disappeared.
Unfortunately, I have no idea, why you cannot return the loss and call backwards afterwards on it.

Besides that, I would remove the get_loss parameter from opt.step(), since you don’t need a closure using RMSprop.