CUDA illegal memory access when using batched torch.cholesky

powertj · July 25, 2019, 2:55pm

Hi,

I’m implementing an Unscented Kalman Filter in PyTorch 1.1.0 and I am having issues with the following function:

  def sigma_points(self, mu, sigma):

        U = torch.cholesky((self.l + self.n) * sigma)
        sigmas = [mu]

        for i in range(self.n):
            x1 = mu - U[:, :, i]
            x2 = mu + U[:, :, i]
            sigmas.extend([x1, x2])

        return torch.stack(sigmas, 1).view(-1, self.n)

Where sigma is a batched square matrix.

I get this error when running with CUDA_LAUNCH_BLOCKING=1

CUDA runtime error: an illegal memory access was encountered (77) in magma_spotrf_batched at /magma-2.5.0/src/spotrf_batched.cpp:234
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /magma-2.5.0/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /magma-2.5.0/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /magma-2.5.0/interface_cuda/interface.cpp:946
Traceback (most recent call last):
  File "run_pendulum_analogy.py", line 120, in <module>
    overshooting_d=args.overshooting_d, beta=args.beta, kl_anneal=args.kl_annealing)
  File "/home/tpower/dev/xdtl/pendulum_analogy/experiments/ukfvae.py", line 55, in run
    loss, recon, ql, ll, imgs = self.train(batch_size, train_obs, train_act)
  File "/home/tpower/dev/xdtl/pendulum_analogy/experiments/ukfvae.py", line 133, in train
    a, z_batch, prior_mu, prior_std)
  File "/home/tpower/dev/xdtl/pendulum_analogy/filters/ukf.py", line 80, in filter
    mu_bar, sigma_bar, measurement_fn)
  File "/home/tpower/dev/xdtl/pendulum_analogy/filters/ukf.py", line 23, in update
    sigma_points = self.sigma_point_selector.sigma_points(mu_bar, sigma_bar)
  File "/home/tpower/dev/xdtl/pendulum_analogy/filters/ukf.py", line 227, in sigma_points
    U = torch.cholesky((self.l + self.n) * sigma)
RuntimeError: CUDA error: an illegal memory access was encountered

I don’t see this error when running on CPU, and weirdly enough if I instead loop through and do torch.cholesky on each matrix in the batch individually, I do not get an error.

When I monitor nvidia-smi my memory usage is well below my maxmimum, so I don’t think I am running out of memory.

Thanks

ajvirgona · September 4, 2019, 4:44am

Did you make any progress with this?

I’m having a similar issue with torch.cholesky() inside MultivariateNormal.__init__()
I don’t believe I’m anywhere near running out of memory. Seems to happen somewhat randomly but always late in my training process.

torch.version.cuda returns '10.0.130' while nvidia-smi reports CUDA version 10.1 is this a problem?

So far unable to reproduce on CPU.
Below is what get’s reported using GPU (both with and without CUDA_LAUNCH_BLOCKING=1)

CUDA runtime error: an illegal memory access was encountered (77) in magma_spotrf_batched at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/src/spotrf_batched.cpp:234
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:946
Traceback (most recent call last):
  File "/home/alex/repos/human-motion-rnn/lstm_motion_model/scripts/train", line 527, in <module>
      trainer.run(args)
  File "/home/alex/repos/human-motion-rnn/lstm_motion_model/scripts/train", line 513, in run
      validation_trajectories, save_directory, prefix=mode)
  File "/home/alex/repos/human-motion-rnn/lstm_motion_model/scripts/train", line 55, in train_and_validate
    model, optimizer, training_dataset, config)
   File "/home/alex/repos/human-motion-rnn/lstm_motion_model/scripts/train", line 192, in tbptt
      loss = model.compute_loss(predicted, target)
  File "/home/alex/repos/human-motion-rnn/lstm_motion_model/src/lstm_motion_model/models.py", line 1233, in compute_loss
      return super().compute_loss(predicted, target)
  File "/home/alex/repos/human-motion-rnn/lstm_motion_model/src/lstm_motion_model/models.py", line 281, in compute_loss
      return self.compute_loss_2d_gaussian(predicted, target)
  File "/home/alex/repos/human-motion-rnn/lstm_motion_model/src/lstm_motion_model/models.py", line 323, in compute_loss_2d_gaussian
      dist = MultivariateNormal(mu, covariance_matrix=covar)
  File "/home/alex/python-env/pytorch-env/lib/python3.6/site-packages/torch/distributions/multivariate_normal.py", line 149, in __init__
      self._unbroadcasted_scale_tril = torch.cholesky(covariance_matrix)
  RuntimeError: CUDA error: an illegal memory access was encountered
                > /home/alex/python-env/pytorch-env/lib/python3.6/site-packages/torch/distributions/multivariate_normal.py(149)__init__()
                -> self._unbroadcasted_scale_tril = torch.cholesky(covariance_matrix)

I’m on pytorch version 1.2.0

powertj · September 5, 2019, 9:31pm

I’m afraid I never got to the bottom of this. I currently am just hacking it by transferring matrices to the CPU to do the cholesky then sending them back to the GPU.

For reference I am using torch 1.1.0 and CUDA version 9.0

ajvirgona · September 6, 2019, 7:26am

Thanks for reply.

What a pain, I’m really at a loss. I’m setting up on a different system today to check if its somehow related to my environment or hardware.

I expect that moving this operation to the CPU will slow things down quite a lot.
Let me know how you go.

ptrblck · September 6, 2019, 11:50am

Could this issue be related to this magma bug?
If not, could you please open an issue in the repo so that we can track it?

ajvirgona · September 8, 2019, 1:37am

Yes possibly I’ll do some experiments with the batch size and report back.

Thanks for looking into it.

mruberry · September 27, 2019, 12:45am

Have you tried padding your call with extra (unused) batches?

If that doesn’t work, you may want to try padding and then taking a view of only the batches you’re using. Then pass that view to the Cholesky call.

GlassyWing · May 18, 2020, 3:06am

I face the same problem, I also intend to use pytorch to implement the Kalman filter. After making a lot of attempts, For me, I found this to be the better way than looping：

    # Combine tensors on cpu
    covariances = []
    for row, track_idx in enumerate(track_indices):
        track = tracks[track_idx]
        covariances.append(track.covariance.cpu())

    # Then copy to gpu
    covariances = torch.cat(covariances, dim=0).to("cuda:0")

    # The calculation takes place on the GPU
    torch.cholesky(covariance)

richardk · May 22, 2020, 4:57pm

Same issue. In my case it happens only for certain batch sizes. I haven’t found a pattern though.
Would be great to get to the bottom of this. I am also circumventing the issue in the way @GlassyWing posted.

Qi_Yan · March 17, 2021, 5:17pm

Thank @GlassyWing for your suggestion to work around this issue.
I was indeed curious about this bug, as switching to CPU or the concatenation would anyway slow down the code. In our project, the same error shows up from time to time in a very unpredictable way when torchdist.MultivariateNormal() is called. This is very frustrating as using CPU would be painfully slow, and nobody knows when the training will be stuck (usually good in the early stages). We see the same issue in multiple devices, including RTX 2080 Ti, RTX 3090, Tesla T100 in PyTorch 1.4.0, 1.6.0, and 1.8.0.

ptrblck · March 18, 2021, 7:35am

We are currently working on enabling cuSOLVER for cholesky for (hopefully) performance benefits and to avoid issues in MAGMA.

Qi_Yan · March 18, 2021, 8:16am

Thanks for your message. Really looking forward to it!!!

qiao · March 29, 2022, 11:31am

Hi, I have met the same problem in my code. My env is pytorch1.7.1+cu110, A100. Sometimes, the problem will show in very early stage ( like, in the 1th epoch) ; sometimes , the problem will show in the 48th epoch.

I have tried to upgrade the env to torch==1.8.1+cu111, the program becomes extremely slow ( 100 time slower)

I was calling torchdist.MultivariateNormal() , training 1billion sample with batch size of 16.

Can you give some advice?

ptrblck · March 29, 2022, 5:21pm

I would recommend to update to the latest stable or nightly release and check if the CUDA 11.5 runtime would improve your use case.
E.g.

pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu115

should work.

qiao · April 6, 2022, 4:26am

Yes, I upgrade to cuda 11.3 + pytorch 1.10.2, and it works. Great thanks!