Very tiny model with two convolutions hangs on loss.backward()
call.
UNet(
(down_path): ModuleList(
(0): UNetConvBlock(
(block): Conv2d(3, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
(last): Conv2d(2, 13, kernel_size=(1, 1), stride=(1, 1))
)
Snippet from stacktrace in GDB:
#0 0x00007fca0afed99f in __GI___poll (fds=0x7fc9ebbd0040, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:29
It’s reproduced on torch with CPU only, GPU one works fine.
The batch size is also important. No hang with batch size = 1, but freezes with batch size = 4.
Profiling of the script for different batches shows different implementation called:
No hang with _slow_conv2d_forward
and aten::_slow_conv2d_backward
.
Hang happens with aten::mkldnn_convolution
, aten::convolution_backward
.
Snapshot of reproducing script:
def main():
train_set = MockDataset()
train_loader = DataLoader(train_set, batch_size=4, num_workers=1, drop_last=True)
model = UNet(n_classes=13)
print(model)
device = 'cpu'
model.to(device)
optimizer = SGD(model.parameters(), lr=1e-3)
for epoch in range(1):
model.train()
for step, batch_data in enumerate(train_loader):
inputs = batch_data[0].to(device)
labels = batch_data[1].to(device)
optimizer.zero_grad()
outputs = model(inputs)
print(outputs.shape, labels.shape)
loss = nn.CrossEntropyLoss()(outputs, labels)
print(f"\nBefore HANG {loss}\n")
loss.backward()
print("\nAFTER HANG\n")
CPU: Intel(R) Core™ i9-10980XE CPU @ 3.00GHz (can’t reproduce on Intel(R) Core™ i9-10920X CPU @ 3.50GHz)
Ubuntu 20.04
Python 3.8.10
torch==1.13.1+cpu
torchvision==0.14.1+cpu