Hi,
I have a machine with 2GPUs and I was trying to use mp.spawn to create 2 subprocesses to run DDP evaluation. In master process, I created a CPU model and in subprocess I converted the cloned cpu model to corresponding GPU based on rank. However, I ran into an illegal memory access error at the 2nd iteration in the second subprocess.
Environment
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 On | 00000000:17:00.0 Off | N/A |
| 28% 29C P8 5W / 180W | 2MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 On | 00000000:B3:00.0 Off | N/A |
| 27% 33C P8 6W / 180W | 19MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Name: torch
Version: 1.9.1+cu111
Code Snippet
def evaluate_ddp(rank, world_size, model):
print('world size', world_size)
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12328"
# create default process group
dist.init_process_group("nccl", rank=rank, world_size=world_size)
batch_size = 64
num_workers = 4
imagenet_dir = '/usr/local/workspace/datasets/imagenet/ILSVRC2012_PyTorch/'
val_dir = os.path.join(imagenet_dir, 'val')
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
val_set = datasets.ImageFolder(
val_dir,
transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
]))
#val_sampler = DistributedSampler(dataset=val_set)
val_loader = torch.utils.data.DataLoader(val_set,
batch_size=batch_size,
shuffle=False,
num_workers=num_workers,
#sampler=val_sampler,
pin_memory=True)
# model = torchvision.models.resnet18(pretrained=True)
#model.metric = metric
#print(f'rank {rank}', next(model.parameters()).device)
model = model.to(rank)
#model = DDP(model, device_ids=[rank])
#model.eval()
group = dist.new_group([0, 1])
with torch.no_grad():
for i, (images, target) in enumerate(val_loader):
# compute output
images = images.to(rank, non_blocking=False)
target = target.to(rank, non_blocking=False)
print(f"image device: {images.device}, target device: {target.device}, model device: {next(model.parameters()).device}")
_ = model(images)
print_freq = 10
if i % print_freq == 0:
print(f"batch {i}, rank {rank}")
print(f'Done rank: {rank}')
dist.destroy_process_group()
model = MyModel() # assign model to cpu
mp.spawn(evaluate_ddp, args=(world_size, model), nprocs=world_size, join=True)
Terminal output
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:1, target device: cuda:1, model device: cuda:1
THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp line=278 error=700 : an illegal memory access was encountered
batch 0, rank 0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
image device: cuda:0, target device: cuda:0, model device: cuda:0
Traceback (most recent call last):
File "minimal_example.py", line 81, in <module>
mp.spawn(evaluate_ddp, args=(world_size, model), nprocs=world_size, join=True)
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/usr/local/workspace/aimet_ddp/minimal_example.py", line 68, in evaluate_ddp
_ = model(images)
File "/usr/local/lib/python3.8/dist-packages/torch/fx/graph_module.py", line 513, in wrapped_call
raise e.with_traceback(None)
RuntimeError: CUDA error: an illegal memory access was encountered
What’s weird about it is this seems to be model dependent… I didn’t run into any issues when running torchvision.models.resnet18 but the problem happens only when I switch to my own model which has a custom op. I would appreciate it if you could shed some light into this! (e.g. how to further debug, how to implement custom model which supports DDP)
Thank you in advance!