Hi,
I am writing a training harness from scratch for work that involves iterative pruning – which uses DDP train each level.
tl;dr
SIGTERM/SIGSEGV while running inference during a DDP run + model which has been torch.compile’d.
Exact error:
W0526 00:24:17.229000 22419848091456 torch/multiprocessing/spawn.py:145] Terminating processP ROCESS_ID via signal SIGTERM
Traceback (most recent call last):
File "/harness.py", line 346, in <module>
mp.spawn(main, args=(init_model, world_size, threshold, i), nprocs=world_size, join=True)
File "/orch/multiprocessing/spawn.py", line 281, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "torch/multiprocessing/spawn.py", line 237, in start_processes
while not context.join():
File "torch/multiprocessing/spawn.py", line 169, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
Details:
To describe my setup:
- Model: ResNet-18/50 (with a mask per layers, same size as weight parameters as buffers).
- Batch-Size: 256 (each)
- Autocast: BFloat16
- Dataset: ImageNet (224x224)
- FFCV for dataloading (but for all purposes this shouldn’t matter)
- torch.compile (mode = ‘reduce-overhead’) applied before passing the model to DDP as below:
Tried with: - Different PyTorch versions: 2.0.1, 2.1, 2.2.1, 2.2.2 2.3
- CUDA: 11.8 and 12.1
self.model = torch.compile(self.model, mode='reduce-overhead')
self.model = DDP(self.model, device_ids=[self.gpu_id])
- 2x A100 GPUs (80 GB) with abundant memory/CPU resources.
- num workers (if relevant): 8
I launch training as follows:
for i, threshold in enumerate(thresholds):
mp.spawn(main, args=(init_model, world_size, threshold, i), nprocs=world_size, join=True)
I can make no sense of this issue:
At random (but consistently across attempts to train): DDP throws a SIGSEGV and exits with a SIGTERM when 33/97 batches have been evaluated or say 10/97 batches have been evaluated etc.,
At a loss why this is happening, plenty of GPU vRAM, local memory etc., is available – do not get any more information, it simply exits.
EDIT
Additional Context: When I only run 1-2 batches of training, break and then run the test loop – it completes as expected. i.e. when one full epoch of training is run, and subsequently I run inference the exception occurs.
I have done the following:
- Experimented with the aforementioned pytorch versions/CUDA versions.
- Machines: GCP and a local cluster (same hardware setup)
My test function is very simple:
def test(self):
self.model.eval()
test_loss = 0
correct = 0
total = 0
tloader = tqdm.tqdm(self.test_loader, desc='Testing')
with torch.no_grad():
for inputs, targets in tloader:
with autocast(dtype=torch.bfloat16):
outputs = self.model(inputs)
loss = self.criterion(outputs, targets)
test_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
test_loss /= (len(self.test_loader))
accuracy = 100. * correct / total
return test_loss, accuracy
When I run this without torch.compile, all runs fine