Model works on CPU but breaks when passed to cuda

So, for reference, I’m doing this on Lambda Cloud GPU service with Jupyter notebook, with the env variable CUDA_LAUNCH_BLOCKING = 1, and I got this output:

This error is from a cell containing only one line to test

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_2510/3640003828.py in <module>
----> 1 model = Reconstruction(decoder_embedding_size = 512, additional_encoder_nhead=5, additional_encoder_dim_feedforward = 2048).to(device)

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
   1338                     raise
   1339 
-> 1340         return self._apply(convert)
   1341 
   1342     def register_full_backward_pre_hook(

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _apply(self, fn, recurse)
    898         if recurse:
    899             for module in self.children():
--> 900                 module._apply(fn)
    901 
    902         def compute_should_use_set_data(tensor, tensor_applied):

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _apply(self, fn, recurse)
    898         if recurse:
    899             for module in self.children():
--> 900                 module._apply(fn)
    901 
    902         def compute_should_use_set_data(tensor, tensor_applied):

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _apply(self, fn, recurse)
    898         if recurse:
    899             for module in self.children():
--> 900                 module._apply(fn)
    901 
    902         def compute_should_use_set_data(tensor, tensor_applied):

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _apply(self, fn, recurse)
    925             # `with torch.no_grad():`
    926             with torch.no_grad():
--> 927                 param_applied = fn(param)
    928             p_should_use_set_data = compute_should_use_set_data(param, param_applied)
    929 

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in convert(t)
   1324                         memory_format=convert_to_format,
   1325                     )
-> 1326                 return t.to(
   1327                     device,
   1328                     dtype if t.is_floating_point() or t.is_complex() else None,

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

However, running
model = Reconstruction(decoder_embedding_size = 512, additional_encoder_nhead=5, additional_encoder_dim_feedforward = 2048) works with no error.

How could this be? If you need, I can show Reconstruction()

Device asserts are often triggered by invalid indexing operations so try to narrow down if that could be the case here, too.

If that was the case, why would running exactly the same thing, just without .to(device), work? It even works when running random data through it (on CPU).

Differences in data processing, randomness in the model (if applicable), etc. I’m not familiar with your model or training code so won’t know why the GPU run is failing but it’s not the first issue we’ve debugged here where device specific processing caused valid errors.