Seed changed when debugging acos in cuda

mszhanyi · September 17, 2020, 7:00am

Generally,the seed is 1234 in testing. When I was debugging the test code (test_ops.py), I find in Linux (Ubuntu 16:04, Geforce 1080), when the op is acos and device is cuda:0, the seed is changed.
The function of make_tensor, which in testing\common_utils.py result is.
input:tensor([-0.7754+0.9690j, -0.2091-0.4136j, 0.9258+0.6882j, -0.8382-0.4225j,
0.8441-0.1108j, -0.5019+0.6919j, -0.1726+0.5296j, 0.3494-0.6687j,
-0.2975+0.4149j, -0.3018-0.2714j, -0.8510+0.9209j, 0.8011-0.6919j,
-0.9471-0.7434j, -0.2749+0.2165j, -0.9148-0.3147j, 0.4287-0.4181j,
-0.7237-0.8627j, -0.9948+0.8622j, 0.9395+0.6238j, -0.0843-0.6137j],
device=‘cuda:0’, dtype=torch.complex128, requires_grad=True)
If the seed is 1234, the result is.
tensor([ 0.7069-0.7754j, -0.3517-0.2091j, 0.8470+0.9258j, 0.4945-0.8382j,
0.2541+0.8441j, 0.4346-0.5019j, 0.6110-0.1726j, -0.7419+0.3494j,
0.2255-0.2975j, 0.7646-0.3018j, -0.9519-0.8510j, 0.5869+0.8011j,
-0.5018-0.9471j, 0.0435-0.2749j, -0.8904-0.9148j, -0.8258+0.4287j,
0.2637-0.7237j, 0.7252-0.9948j, -0.5960+0.9395j, -0.8582-0.0843j],
device=‘cuda:0’, dtype=torch.complex128, requires_grad=True)

I debugged in Linux CPU, windows CPU and GPU. The input tensors are all generated by seed of 1234 in the same condition.

So far, I don’t find when the seed is changed and bad impacts of this change.

albanD · September 17, 2020, 7:25pm

Hi,

Could this be related to this issue: https://github.com/pytorch/pytorch/issues/42952 ?

mszhanyi · September 18, 2020, 3:04am

Not exactly. I found it when I was debugging a similar issue that you mentioned.
The test still passed even the input value is not expected due to a unknown seed.
I think it’s reasonable for the test result but I just doubt maybe there’s another potential issue in creating the input. So I raised it in forum not in issues.
The details to reproduce is:

line 92 in test_ops.py
@dtypes(torch.double, torch.cdouble)
@ops(op_db)
def test_fn_grad(self, device, dtype, op):
if op.name == “acos” and device == “cuda:0”:
self._grad_test_helper(device, dtype, op, op.get_op())
set breakpoint on self._grad_test_helper(device, dtype, op, op.get_op())
step in and step in …
at the line 1464 of common_utils.py
the value is not generated by the seed of 1234.
I don’t know what’s the seed because there’s no way to get the seed from the random value.

albanD · September 18, 2020, 2:12pm

Is it because other things generated random Tensors before hitting this line?

at the line 1464 of common_utils.py

Can you link to the code on github at a particular commit (press y when on the page with the file you want then select the line) because I think this line number is outdated?

mszhanyi · September 19, 2020, 1:59am

Thank you for taking care of it.
we’ll find the real and imag are not expected.
get the rand complex

albanD · September 21, 2020, 3:39pm

A seed of 1234 on CPU and GPU will generate different values, this is expected as we use different random number generators on these two devices.
If you generate both on CPU though, you will get the same result, as long as you don’t draw any other random number between the time where you set the seed and sample. For example, windows might not run all the tests that run on linux. And so by the time you get to this test, you might have drawn a different number of samples from the random number generator, giving you different values.

mszhanyi · September 22, 2020, 12:20am

I know CPU and GPU’s random are different.
The issue is in Linux not Windows.

we could add one line to force manual_seed as 1234 and comment it later.
we’ll find the real and imag value are different if the device is cuda:0
#torch. manual_seed(1234)
real = torch.rand(size, device=device, dtype=float_dtype) * span + low
imag = torch.rand(size, device=device, dtype=float_dtype) * span + low

So, is it caused by samples generated by other tests?

albanD · September 22, 2020, 2:05pm

So, is it caused by samples generated by other tests?

Or more likely other inputs generated for this test?

Even after re-reading your message above I feel like I’m missing something as I don’t see what is the exact problem is? Could you rephrase what is unexpected here please?

mszhanyi · September 22, 2020, 3:15pm

Thank your reply. I just curious why the test input value isn’t computed by the seed of 1234 in this condition (acos, cuda:0, linux).

In other conditions, I can get the same value as Pytorch source code with the below snippet.
Anyway, it’s not a big issue so far.

import torch
span = 1.99998
low = -0.99999
size = (20,)
float_dtype = torch.float64
torch.manual_seed(1234)
device = torch.device(“cuda:0”) #change cpu or other
real = torch.rand(size, device=device, dtype=float_dtype) * span + low
imag = torch.rand(size, device=device, dtype=float_dtype) * span + low
c = torch.complex(real, imag)
c.requires_grad = True
print(f"c = {c}")

albanD · September 22, 2020, 3:45pm

Ho,

Maybe because acos samples the input at a different time here because it needs to clamp it and so cannot just specify the size?