Small conv1 network crushes on CPU

Hello,

I get crash after running a small Conv1d network on cpu. Here is a code example that crushes:

import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T
from torch.autograd import Variable

BATCH = 64
def get_xy():
    X = np.random.randn(BATCH, 50)
    Y = np.random.randn(BATCH, 1)
    return X, Y

use_cuda = False
FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if use_cuda else torch.ByteTensor
Tensor = FloatTensor

class CNNRet(nn.Module):
    def __init__(self):
        super(CNNRet, self).__init__()
        self.cnn1 = nn.Conv1d(in_channels=1, out_channels=5, kernel_size=10, stride=5)
        self.cnn2 = nn.Conv1d(in_channels=5, out_channels=3, kernel_size=5, stride=2)
        self.lin = nn.Linear(9, 1)
    
    def forward(self, x):
        print x
        x = torch.unsqueeze(x, 1)
        x = self.cnn1(x)
        x = F.relu(x)
        x = self.cnn2(x)
        x = F.relu(x)
        x = x.view(-1, 9)
        x = self.lin(x)
        return x

model = CNNRet()
if use_cuda:
    model.cuda()
    
def get_loss():
    X, Y = get_xy()
    X = X.astype(np.float32)
    Y = Y.astype(np.float32)
    X = torch.from_numpy(X)
    Y = torch.from_numpy(Y)
    if use_cuda:
        X = X.cuda()
        Y = Y.cuda()
    X = Variable(X, requires_grad=False)
    Y = Variable(Y, requires_grad=False)
    p = model(X)
    loss = F.mse_loss(p, Y)
    return loss
    
def optimize_model(opt):
    opt.zero_grad()
    loss = get_loss()
    loss.backward()
    opt.step(get_loss)
    return loss

optimizer = optim.RMSprop(model.parameters(), lr=0.001, momentum=0.9)
mloss = 0.0
for it in xrange(10000):
    loss = optimize_model(optimizer)
    mloss = loss * 0.01 + 0.99 * mloss
    if it % 1000 == 999:
        print it, mloss

The error:
PC: @ 0x4a6f8a PyErr_Occurred
*** SIGSEGV (@0x48) received by PID 10298 (TID 0x7f98c5d50700) from PID 72; stack trace: ***
@ 0x7f994bdd3390 (unknown)
@ 0x4a6f8a PyErr_Occurred
@ 0x7f993cc73b27 _array_dealloc_buffer_info
@ 0x7f993cc2615e array_dealloc
@ 0x7f98ffeca8d5 NumpyArrayAllocator::free()
@ 0x7f98f3a2844e THFloatStorage_free
@ 0x7f98f3a3ebd7 THFloatTensor_free
@ 0x7f98f329e029 thpp::THTensor<>::~THTensor()
@ 0x7f98fff00ae1 torch::autograd::ConvBackward::releaseVariables()
@ 0x7f98ffed2d8b torch::autograd::Engine::evaluate_function()
@ 0x7f98ffed3573 torch::autograd::Engine::thread_main()
@ 0x7f98ffef7417 PythonEngine::thread_main()
@ 0x7f98ffed777a std::thread::_Impl<>::_M_run()
@ 0x7f9900cd7a20 execute_native_thread_routine
@ 0x7f994bdc96ba start_thread
@ 0x7f994baff3dd clone
@ 0x0 (unknown)

The strange thing that this code doesn’t crush on gpu, but the prediction - p has a cyclic structure (if you plot it).

The code doesn’t crash on my machine with install from master. I’d suggest two things to try:

  1. change mloss = loss * 0.01 + 0.99 * mloss to mloss = loss.data[0] * 0.01 + 0.99 * mloss
  2. update to master

Could you try and see if these solve the issue?

I just installed pytorch from http://pytorch.org/.

Using:
pip install http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl
pip install torchvision

It didn’t help (I also changed mloss).

Maybe it is fixed somewhere between 0.2 and master. Sorry that I can’t help if I can’t reproduce it on master.

Building from source is easy though! You might want to try it.

I built it from github repo and still get the same error. Could it be the different python version (I use 2.7) than yours?

The code also crashed on my machine (using CPU).
It seems, the segmentation fault is thrown when calling loss.backward().

I moved the loss.backward() call to get_loss and the segmentation fault disappeared.
Unfortunately, I have no idea, why you cannot return the loss and call backwards afterwards on it.

Besides that, I would remove the get_loss parameter from opt.step(), since you don’t need a closure using RMSprop.