CUDA error: an illegal memory access from tensor.sum()

My Pytch program crashes randomly, about 1 minute into training with both GPUs having lots of memory.

loss1 = loss1.sum() #/ region_mask.sum()
RuntimeError: CUDA error: an illegal memory access was encountered

Traceback (most recent call last):
File “/src/line-crop/line-crop-trainer/line_crop_trainer.py”, line 705, in
torch.cuda.empty_cache()
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/memory.py”, line 125, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
[W CUDAGuardImpl.h:124] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)

loss1 = self.criterion_line(predicted_line_mask, all_line_mask)
loss1 = loss1.sum()

I added os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1” as instructed from the crash log of the previous run.

Any suggestions on what is wrong or how to debug?

I added some asserts before this crashing line and removed the 1 dimension, but no impact.

    assert loss1.shape[0] == 16
    assert loss1.shape[1] == 512
    assert loss1.shape[2] == 512

    assert region_mask.shape[0] == 16
    assert region_mask.shape[1] == 1
    assert region_mask.shape[2] == 512
    assert region_mask.shape[3] == 512
    
    loss1 = loss1 * region_mask[:,0,:,:]
    loss1 = loss1.sum() #/ region_mask.sum()

and i still get a crash on

File “/src/line-crop/common-worker-and-trainer/model_line_crop.py”, line 155, in common_step
loss1 = loss1.sum() #/ region_mask.sum()
RuntimeError: CUDA error: an illegal memory access was encountered

ChatGPT suggested adding torch.cuda.synchronize() to cause the error to be raised at this specific point, it suggests that the illegal memory access is likely happening somewhere in the lines of code before this synchronization point.

So my new code looks like

    torch.cuda.synchronize() # not 
    
    loss1 = self.criterion_line(predicted_line_mask, all_line_mask) # reduction none, so I can mask
    
    torch.cuda.synchronize() # line 146 in traceback

    assert loss1.shape[0] == 16
    assert loss1.shape[1] == 512
    assert loss1.shape[2] == 512

    assert region_mask.shape[0] == 16
    assert region_mask.shape[1] == 1
    assert region_mask.shape[2] == 512
    assert region_mask.shape[3] == 512
    
    loss1 = loss1 * region_mask[:,0,:,:]
    loss1 = loss1.sum()

where

self.criterion_line = nn.CrossEntropyLoss(reduction=‘none’)

it ran for 185 mini-batches, then

File “/src/line-crop/common-worker-and-trainer/model_line_crop.py”, line 146, in common_step
torch.cuda.synchronize()
File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered

You might set the blocking launch env variable too late on your code so set it in your terminal instead. The illegal memory access is most likely caused in a previous operation. In case you have installed a full CUDA toolkit you might want to rerun your code with compute-sanitizer to check which kernel causes the memory violation.

Thank you for the suggestion.

I use a docker image derived from

FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime

and set

  • CUDA_LAUNCH_BLOCKING = “1”

No difference.

I also tried setting:

    - TORCH_CUDA_SANITIZER="1"

but looks like this option and lightning with ddp does not get along with SANITIZER and get the error

To use CUDA with multiprocessing, you must use the ‘spawn’ start method

My model is a simple unet so I am not using anything exotic. What does this synchronization error mean and how could I generate this intentionally?

You are not running into a synchronization error but a memory violation. In case you get stuck, could you post a minimal and executable code snippet reproducing the issue so that I could try to narrow down the kernel?

I’m not quite sure how a “memory violation” is triggered. How can I intentionally trigger such a memory violation? Just trying to figure out if I am hunting for a bug in my code or some pytorch library.

Here is the traceback of the simplified code:

torch.cuda.synchronize() # *********** THIS LINE IN TRACEBACK ***********

File “/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py”, line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered

The code: With lots of asserts to make sure data fed into model is ok:

class SimpleModel(pl.LightningModule):
def init(self, hparams):
super().init()
self.save_hyperparameters(hparams)

    # just to make sure save_hyperparameters will create self.hparams
    assert self.hparams['lr']

    assert ENCODED_CHANNELS==50

    self.img_to_lines_unet = unet.UNet(in_channels=1,out_channels=50, UNET_CHNLS = [32, 64, 128, 256, 512, 1024], UNET_ATTENTION=[False, False, False, False, False], bilinear=True)
    
    self.conv = nn.Conv2d(in_channels=50, out_channels=4, kernel_size=1)

    self.criterion_line = nn.CrossEntropyLoss(reduction='none')
    self.criterion_line_old = nn.BCEWithLogitsLoss(reduction='none')


def forward(self, img_in):
    intermediate =  self.img_to_lines_unet(img_in)
    return intermediate

def common_step(self, batch, batch_idx):

    image, region_mask, all_line_mask = batch
    selected_line_mask = None
    selected_line_img_crop = None

    batch_size = image.shape[0]

    assert image.shape[0]==16
    assert image.shape[1]==1
    assert image.shape[2]==512
    assert image.shape[3]==512

    torch.cuda.synchronize()

    intermediate = self(image)

    torch.cuda.synchronize()

    assert intermediate.shape[0]==16
    assert intermediate.shape[1]==50
    assert intermediate.shape[2]==512
    assert intermediate.shape[3]==512

    torch.cuda.synchronize()

    predicted_line_mask = self.conv(intermediate)
    
    torch.cuda.synchronize()

    assert predicted_line_mask.shape[0]==16
    assert predicted_line_mask.shape[1]==4
    assert predicted_line_mask.shape[2]==512
    assert predicted_line_mask.shape[3]==512

    assert all_line_mask.shape[0]==16
    assert all_line_mask.shape[1]==512
    assert all_line_mask.shape[2]==512

    torch.cuda.synchronize()

    loss1 = self.criterion_line(predicted_line_mask, all_line_mask) # reduction none, so I can mask

    torch.cuda.synchronize() # *********** THIS LINE IN TRACEBACK ***********

    assert loss1.shape[0] == 16
    assert loss1.shape[1] == 512
    assert loss1.shape[2] == 512

    assert region_mask.shape[0] == 16
    assert region_mask.shape[1] == 1
    assert region_mask.shape[2] == 512
    assert region_mask.shape[3] == 512
    
    loss1 = loss1 * region_mask[:,0,:,:]
    loss1 = loss1.sum() 

    return loss1, loss1
    

def training_step(self, batch, batch_idx):

    loss, loss1 = self.common_step(batch, batch_idx)

    self.log('train_loss_line', loss1)
    self.log("train_loss", loss)
    
    return loss

def validation_step(self, batch, batch_idx):


    loss, loss1 = self.common_step(batch, batch_idx)

    self.log('val_loss_line', loss1)
    self.log('val_loss', loss)
    

def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=self.hparams['lr'])
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=self.hparams['gamma'])

    return {"optimizer": optimizer, "lr_scheduler": scheduler, "monitor": "val_loss"}

You cannot manually trigger a memory violation as it’s caused by an invalid read or write instruction in the kernel. You could of course write a broken CUDA kernel, but this doesn’t help in isolating the issue.
Your code is not properly formatted and not executable since the inputs are also missing. Use random inputs in the expected shapes so that I can directly copy/paste the code to trigger the memory violation.
Also, post the output of python -m torch.utils.collect_env here.

I found the error: exit(-123) is triggered. So somehow, my training data has extremely rare values above 3 and I speculate that nn.CrossEntropyLoss does not do range checks.

   assert predicted_line_mask.shape[0]==16
    assert predicted_line_mask.shape[1]==4
    assert predicted_line_mask.shape[2]==512
    assert predicted_line_mask.shape[3]==512

    assert all_line_mask.shape[0]==16
    assert all_line_mask.shape[1]==512
    assert all_line_mask.shape[2]==512

    assert not torch.any(torch.isinf(predicted_line_mask))
    if torch.any(all_line_mask > 3):
        print(all_line_mask.dtype)
        print(all_line_mask.min())
        print(all_line_mask.max())
        exit(-123)
    assert not torch.any(all_line_mask > 3)

    torch.cuda.synchronize()

    loss1 = self.criterion_line(predicted_line_mask, all_line_mask) # reduction none, so I can mask

    torch.cuda.synchronize() 

Should nn.CrossEntropyLoss do range check?