Backward hangs on Torch for CPU

NikolayL · December 30, 2022, 5:43pm

Very tiny model with two convolutions hangs on loss.backward() call.

UNet(
  (down_path): ModuleList(
    (0): UNetConvBlock(
      (block): Conv2d(3, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    )
  )
  (last): Conv2d(2, 13, kernel_size=(1, 1), stride=(1, 1))
)

Snippet from stacktrace in GDB:

#0  0x00007fca0afed99f in __GI___poll (fds=0x7fc9ebbd0040, nfds=1, timeout=5000) at ../sysdeps/unix/sysv/linux/poll.c:29

It’s reproduced on torch with CPU only, GPU one works fine.

The batch size is also important. No hang with batch size = 1, but freezes with batch size = 4.
Profiling of the script for different batches shows different implementation called:
No hang with _slow_conv2d_forward and aten::_slow_conv2d_backward.
Hang happens with aten::mkldnn_convolution, aten::convolution_backward.

Snapshot of reproducing script:

def main():
    train_set = MockDataset()
    train_loader = DataLoader(train_set, batch_size=4, num_workers=1, drop_last=True)
    model = UNet(n_classes=13)
    print(model)
    device = 'cpu'
    model.to(device)
    optimizer = SGD(model.parameters(), lr=1e-3)
    for epoch in range(1):
        model.train()
        for step, batch_data in enumerate(train_loader):
            inputs = batch_data[0].to(device)
            labels = batch_data[1].to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            print(outputs.shape, labels.shape)
            loss = nn.CrossEntropyLoss()(outputs, labels)
            print(f"\nBefore HANG {loss}\n")
            loss.backward()
            print("\nAFTER HANG\n")

CPU: Intel(R) Core™ i9-10980XE CPU @ 3.00GHz (can’t reproduce on Intel(R) Core™ i9-10920X CPU @ 3.50GHz)
Ubuntu 20.04
Python 3.8.10
torch==1.13.1+cpu
torchvision==0.14.1+cpu

minimal reproducer for hang on backward · GitHub

ptrblck · December 30, 2022, 9:56pm

Thanks for reporting this issue! Could you create another issue on GitHub so that MKL devs could take a look at it and try to reproduce it?
I’ve checked a few nodes I can lease but couldn’t find any with an i9-10980XE CPU and thus cannot reproduce the hang.

NikolayL · December 31, 2022, 11:11am

Thank you for trying!
I’ve created the issue on GitHub: The backward call hangs on Torch for CPU · Issue #91547 · pytorch/pytorch · GitHub
Didn’t mention that the issue is reproduced in the docker when the python script is launched in this way:

sh -c "(script.sh > log.txt 2>&1; wait)" >&- 2>&- &

Content of the script.sh:

#!/bin/bash -xe
. venv/bin/activate
python3 main.py

Added it to description.

NikolayL · March 16, 2023, 10:31am

The reason of the hang is described here: The backward call hangs on Torch for CPU · Issue #91547 · pytorch/pytorch · GitHub
OMP_THREAD_LIMIT should set to a number equal to or larger than OMP_NUM_THREADS to avoid the hanging issue.