CPU sleeping after one iteration of loss.backward()

xixixizi · January 30, 2019, 10:50am

Hi

I could run the code on one machine but then it doesn’t work on another machine with the same pytorch version 0.4.1

But the CPUs all goes into sleep mode after on iteration of loss.backward(), no errors and the program is stuck then.

code is like this:

for i,batchdata in enumerate(train_loader):
optimizer.zero_grad()
x,y=batchdata
y_pred=model(x)
print(x)
print(len(x))
print(y_pred)
loss_train=criterion(y_pred,y)
print(loss_train)
loss_train.backward(retain_graph=True)
print('done batch ').
optimizer.step()
print(“Finished Step”)

Any idea about what might happen here? many thanks in advance.

albanD · January 30, 2019, 10:55am

You code looks ok.
Could you provide a minimal (<100 lines of code) example to repoduce this so that we can look into it in more details?

xixixizi · January 30, 2019, 11:13am

Hi

Thank you for your quick reply,

for more info:

I’m using a DNN with 2 hidden layers and use cross entropy as loss function, the batch size is 32

criterion=nn.CrossEntropyLoss()

the output of the code in my post is like :

tensor([[0., 0., 0., …, 0., 0., 1.],
[1., 0., 0., …, 0., 1., 1.],
[0., 0., 0., …, 0., 1., 1.],
…,
[1., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 1.],
[1., 0., 0., …, 0., 1., 1.]])
32
tensor([[ 0.6209, -0.8086, 0.1675, …, -0.2908, -1.3713, -1.6336],
[ 0.5608, -0.6456, 0.2424, …, -0.4206, -1.1513, -1.7551],
[ 0.5057, -0.9428, 0.2893, …, -0.5379, -1.2654, -1.5686],
…,
[ 0.6501, -0.6999, 0.3204, …, -0.3298, -1.3569, -1.6235],
[ 0.8948, -0.9117, 0.1544, …, -0.3279, -1.1618, -1.6269],
[ 0.7063, -0.7219, 0.2326, …, -0.3535, -1.2034, -1.7179]],
grad_fn= ThAddmmBackward> )
tensor(5.2543, grad_fn=NllLossBackward>)
done batch 0

The strange thing is when I change the batch size to 1, loss_train.backward() not work at all,

output like this if the batch size is 1:

begin batch 0
tensor([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0.]])
1
tensor([[-0.7247, 1.0524, -2.2573, -1.1343, -0.4040, -0.3480, 1.0350, 1.2908,
1.9561, 0.6749, 0.2105, -1.3460, 1.5483, 1.3202, -1.6915, -2.9539,
0.4566, -0.0062, -0.7644, 1.4134, -2.7010, 0.1742, 3.0185, -0.8109,
-0.1226, -0.1662, 1.5913, -1.1074, 1.0465, -0.3631, 0.7734, -1.1649,
-0.7790, -0.4970, -3.5045, 1.2011, -1.0401, -1.7327, 0.3079, 0.1145,
3.1292, -0.2242, -1.2624, 0.1882, 0.7860]],
grad_fn=ThAddmmBackward>)
tensor(1.6626, grad_fn=NllLossBackward>)

albanD · January 30, 2019, 11:38am

Hi,

This sounds very weird.
I’m afraid we’ll need an example to reproduce this to investigate it further though as I have no idea where this can come from.

xixixizi · January 30, 2019, 4:48pm

hi

Please see below for a example:

from numpy import binary_repr
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as utils
import torch.nn.init as init

class ThreeLayerNet(torch.nn.Module):

def __init__(self,D_in,H,D_out):
    super(ThreeLayerNet,self).__init__()
    self.linear1=torch.nn.Linear(D_in,H)
    self.linear2=torch.nn.Linear(H,H)
    self.linear3=torch.nn.Linear(H,D_out)

def forward(self,x):
    h_relu_1=self.linear1(x).clamp(min=0)
    h_relu_2=self.linear2(h_relu_1).clamp(min=0)
    y_pred=self.linear3(h_relu_2)
    return y_pred

def weight_init(m):

if isinstance(m,nn.Linear):
    init.xavier_normal_(m.weight.data)
    init.normal_(m.bias.data)

if name == ‘main’:

device=torch.device("cpu")
print('a')
D_in,D_out=76,45
H=128
model=ThreeLayerNet(D_in,H,D_out)
print('b')
model.apply(weight_init)
print('c')
criterion=nn.CrossEntropyLoss()
optimizer=torch.optim.Adam(model.parameters(),lr=1e-3)

x=torch.tensor([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0.,0.,1., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0., 0.]])
y=torch.tensor([2])
print('d')
y_pred=model(x)
print('e')
loss_train=criterion(y_pred,y)
print('start bp')
loss_train.backward(retain_graph=True)
print('done bp')
optimizer.step()
print('done step')

use this code I can only print to d, and then get stuck.

albanD · January 30, 2019, 6:07pm

Hi,

I can actually run this code with no issue locally with latest pytorch until “done step”.
Could you try with version 1.0 as it might be a bug already resolved?