Error in transferring data to cuda

Hello,

I am using this in my training function:

for epoch in range(num_epochs):
    train_mean_loss = 0
    train_mean_acc = 0
    rand_var = 0
    
    
    for i, (train_input, train_label) in enumerate(train_dataloader):
      if(train_input.device != device_available):
        print("Train Data wasn't on cuda but now is.")
        train_input = train_input.to(device_available)
        train_label = train_label.to(device_available)
        .
        .
        .

When I run the training loop second time (in case of any error), I get the illegal memory access error.

RuntimeError: CUDA error: an illegal memory access was encountered

I am confused about these things:

  1. Does cuda throw an error if you try to push a tensor to cuda if itt’s already on cuda?
  2. I faced the same issue with my model (I had to factory reset runtime to get it running), I first created an object for my model class and then pushed it to cuda but then I changed something in my model and tried to push it again on cuda, I got the same illegal memory access error.

I am using google colab and a big dataset so, it’s very difficult to debug when I have to factory reset my runtime everytime if I get error during training.

CUDA operations are asynchronous so the illegal memory access might be created in the model and the next CUDA operation would run into this error and raise it.
Could you rerun your script with CUDA_LAUNCH_BLOCKING=1 python scripy.py args and post the stack trace here, please?

I am facing this error on colab, so I just used this in the cell:

CUDA_LAUNCH_BLOCKING = 1
model = MLP_network()
model = model.to(device_available)

Will this work? What does this command do?

No, this won’t work and you would have to use:

os.environ['CUDA_LAUNCH_BLOCKING'] = 1

at the beginning of your script. Make sure to restart the runtime and set this env var before PyTorch or any other library was imported otherwise this variable might not have any effect.

Alright. Thank you so much!

So, I am not facing the error (the one in my question) now because I fixed the problems with my training functions and this error was showing up when I tried to run the training function second time (after getting error because of some other reason). I am gonna close the topic. Thanks again @ptrblck for your time, I appreciate it.