Non-deterministic tensor index arithmetic on cuda

The code in this gist does some indexing and simple arithmetic with the exact same inputs 20 000 times.
It is supposed to find the index of the first and last elements in groups of consecutive elements.
When run on a cpu it always returns the same correct result.
However when run on cuda it returns a wrong result 4-20 times out of 20 000 iterations.

I paired the code down as much as I could, however here are a few points:

  • The error is data depended, if I delete any more data than I already have the error stops occurring
  • The cumsum() in line 27 seems to be part of the problem. Commenting it out stops the error from occurring. It should not affect the result in any way since its result is not used.
  • By keeping a copy of the correct result of isfirst and isLast in cFi and cLa I can see that the error is in isLast
  • By subtracting the correct from the incorrect result in lines 39++ I can see that there is a single mismatch in position 767
  • the results where obtained with pytorch 0.4 and a pascal gpu with cuda 9.0

Could this be related to me moving elements to overlapping regions of the tensor (lines 20 and 31)?

Any other ideas?

A typical result I see is below.
The first row shows that the CPU version correctly 20000/20000 times.
The last row shows that the GPU version gave incorrect results 6 out of 20000 times

cpu (20000, 0)
countOnes isFirst=1127 isLast=1128
No mismatches in isFirst tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’)
pos of mismatch in isLast tensor(767, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(1, device=‘cuda:0’)

countOnes isFirst=1127 isLast=1128
No mismatches in isFirst tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’)
pos of mismatch in isLast tensor(767, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(1, device=‘cuda:0’)

countOnes isFirst=1127 isLast=1128
No mismatches in isFirst tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’)
pos of mismatch in isLast tensor(767, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(1, device=‘cuda:0’)

countOnes isFirst=1127 isLast=1128
No mismatches in isFirst tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’)
pos of mismatch in isLast tensor(767, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(1, device=‘cuda:0’)

countOnes isFirst=1127 isLast=1128
No mismatches in isFirst tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’)
pos of mismatch in isLast tensor(767, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(1, device=‘cuda:0’)

countOnes isFirst=1127 isLast=1128
No mismatches in isFirst tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(0, device=‘cuda:0’)
pos of mismatch in isLast tensor(767, device=‘cuda:0’) tensor(0, device=‘cuda:0’) tensor(1, device=‘cuda:0’)

cuda (19994, 6)

Couldn’t reproduce the issue using

  • PyTorch 0.4.1 + CUDA 8.0.61
  • and 0.5.0a0+2c7c12f + CUDA 9.0.176

on a GTX1070.
Both runs return:

cpu (20000, 0)
cuda (20000, 0)

Thanks for trying!!!

Any ideas what I could do next to pinpoint the cause?

Are you using the latest release, i.e. 0.4.1, and could you update if you use an older release, e.g. 0.4.0?

You get the correct results, if you comment out line 27?

Just to be clear: I just copied your code without any commenting and ran the script.
Should I change something to see the issue?

Yes: no changes where necessary to see the problem from the version downloaded from gist when I double checked before submitting the question.

I need to talk to our IT people to upgrade from 0.4.0 to 0.4.1. It might take a few days.

The problem seems to be quite sensitive to all the environmental factors. In your eyes does the code in firstLast() (lines 19-32) look solid?

I also noticed that the code is about 2x slower on GPU than CPU. I am optimizing a re-written version of what I need that does not seem to cause the problem. But I am worried the problem could crop up elsewhere without me noticing.

The code looks alright.
Could you try to create a new conda environment with 0.4.1 or is this not possible due to IT restrictions?

Our IT person installed 0.4.1 and the problem is still present.
He also has done some further testing. Here are his findings:

Some initial tests confirm the issue:

  1. Running with the installed 0.4.0+CUDA 9.0.176 - problem

  2. Running with the newly installed 0.4.1+CUDA 9.0.176 - problem

  3. Running with Nvidia-built version in a container 0.5.0a0 + CUDA 9.0.176 - problem

  4. Locally pip install 0.4.1 - still the same problem

  5. I couldn’t right away build it from scratch, because we use gcc 6.3.0 compiler in the toolchains and there is a known issue with CUDA and one of the C++ libraries. I have made some custom fixes to get it tow work with TensorFlow and if all else fails, I can try with pytorch as well, but I want to try few other things first.

I have tested the hardware with a lot of TensorFlow benchmarks with pretty good results, but have not ran any pytorch benchmarks yet, so I will do that next, just to confirm that things are working. I do have to say that I am a bit puzzled though.

Thanks again for following-up!

Here is some more info. Our person in IT did some more testing (he is great!).

just to be sure, I went on AWS and ran it on one of the new GPU systems with Volta GPUs. I used the deep-learning AMI with pytorch already setup and with CUDA 9.0, Python 3.6.6, nvidia driver
396.37 and pytorch version 0.4.0 and 0.4.1 - both failed with even much worse issues than on our system (the last run was: cuda (9079, 10921) and that was the best result!!) It was also quite slow with the GPU version. At this point I am willing to rule out hardware setup or build issues on our system and think that either it is some kind of pytorch problem, or there is some potential issue with the code. I admit that I have not had a chance to look at the code yet, but will try to get to it over the next few days. The AMI that I used was:

https://aws.amazon.com/marketplace/pp/B077GCH38C

This seems to point to a general problem now not just on our systems.

Thanks for digging into that problem.
I’ve tested your code again on another server with GTX1080 and CUDA 8.0.61 and it ran successfully a lot of times.
However, I managed to get one error for a single run. I cannot reproduce the error currently, but will have a look at your code again.

I got down to this small code snippet:

for i in range(100000):
    print(i)
    a = torch.empty(2000, dtype=torch.uint8, device='cuda').random_(2)
    a[0] = 1
    b = a.clone()
    b[:-1] = b[1:]
    b[-1] = 1

    if a.sum().cpu().item() != b.sum().cpu().item():
        break

The error occurs often at index 1023.
I’m no CUDA expert, but maybe it could be related to some kernel size etc.

@richard, @SimonW, @colesbury Do you guys have any idea, where I could start debugging / looking into the code?

The issue is that the line:

b[:-1] = b[1:]

reads and writes overlapping elements of b. There’s a similar issue in the original snippet, which @bgobbi mentions in the original post.

You can fix this by writing:

b[:-1] = b[1:].clone()

At some point in the future, I’d like to add overlap detection to kernels so that the copy automatically buffers the right-hand-side when necessary, but you have to watch out for these for now.

1 Like

I remembered a similar issue, but couldn’t find it. Thanks for pointing this out!

Yes, thanks for clarifying!
Alberto