LowRankMultivariateNormal throws CUDA error

eliaz · July 13, 2020, 5:50pm

Hi there!

I’m using the LowRankMultivariateNormal distribution in order to have a distribution of logits for every pixel of an image.
I have an issue when using this distribution for squared images with a shape that is even and large (>= 512x512).

The following code does not work on a colab notebook when the size is set to 512x512 but does for size 513x513.
I also tried it on different GPUs, with the same results.
It works fine when using CPUs.

import torch
from torch.distributions import LowRankMultivariateNormal

DEVICE = "cuda"

torch.manual_seed(23)
for i in range(10):
    print(i)
    distrib = LowRankMultivariateNormal(
        torch.randn(1, 512, 512, 2).to(DEVICE),
        torch.randn(1, 512, 512, 2, 10).to(DEVICE),
        torch.randn(1, 512, 512, 2).to(DEVICE).exp()
    )

(I added the for-loop because sometimes it works for the first iteration, although it was not the case the last few times I ran it.)

I don’t have any clue on how to explain this behavior.

[edit]
The issue is for shapes that are a power of 2, not even.

ptrblck · July 14, 2020, 10:16am

Thanks for reporting the issue!
I could reproduce it with the latest nightly binary and have created this issue to track it.

It seems, that the magma_spotrf_batched call creates the illegal memory access.

eliaz · July 14, 2020, 10:35am

Thanks for the response.
Do you think that there could be a workaround in the meantime?

ptrblck · July 15, 2020, 9:07am

Unsure at the moment, as we are using MAGMA for this particular operation, which causes the failure.
Let’s see, if the code owners might have an idea for a workaround.