Backward hangs on Torch for CPU

Very tiny model with two convolutions hangs on loss.backward() call.

  (down_path): ModuleList(
    (0): UNetConvBlock(
      (block): Conv2d(3, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (last): Conv2d(2, 13, kernel_size=(1, 1), stride=(1, 1))

Snippet from stacktrace in GDB:

#0  0x00007fca0afed99f in __GI___poll (fds=0x7fc9ebbd0040, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:29

It’s reproduced on torch with CPU only, GPU one works fine.

The batch size is also important. No hang with batch size = 1, but freezes with batch size = 4.
Profiling of the script for different batches shows different implementation called:
No hang with _slow_conv2d_forward and aten::_slow_conv2d_backward.
Hang happens with aten::mkldnn_convolution, aten::convolution_backward.

Snapshot of reproducing script:

def main():
    train_set = MockDataset()
    train_loader = DataLoader(train_set, batch_size=4, num_workers=1, drop_last=True)
    model = UNet(n_classes=13)
    device = 'cpu'
    optimizer = SGD(model.parameters(), lr=1e-3)
    for epoch in range(1):
        for step, batch_data in enumerate(train_loader):
            inputs = batch_data[0].to(device)
            labels = batch_data[1].to(device)
            outputs = model(inputs)
            print(outputs.shape, labels.shape)
            loss = nn.CrossEntropyLoss()(outputs, labels)
            print(f"\nBefore HANG {loss}\n")
            print("\nAFTER HANG\n")

CPU: Intel(R) Core™ i9-10980XE CPU @ 3.00GHz (can’t reproduce on Intel(R) Core™ i9-10920X CPU @ 3.50GHz)
Ubuntu 20.04
Python 3.8.10

minimal reproducer for hang on backward · GitHub

Thanks for reporting this issue! Could you create another issue on GitHub so that MKL devs could take a look at it and try to reproduce it?
I’ve checked a few nodes I can lease but couldn’t find any with an i9-10980XE CPU and thus cannot reproduce the hang.

1 Like

Thank you for trying!
I’ve created the issue on GitHub: The backward call hangs on Torch for CPU · Issue #91547 · pytorch/pytorch · GitHub
Didn’t mention that the issue is reproduced in the docker when the python script is launched in this way:

sh -c "( > log.txt 2>&1; wait)" >&- 2>&- &

Content of the

#!/bin/bash -xe
. venv/bin/activate

Added it to description.