Non determinism while using torch.fft.rfft on different machines

Laurent_Gerbaux · July 22, 2022, 9:41am

Hello,
I have a non determinism issue while calling the torch.fft.rfft API between two different linux machines

first machine is :

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz

Second machine is:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                112
On-line CPU(s) list:   0-111
Thread(s) per core:    2
Core(s) per socket:    28
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz

The python code that I’m running is bellow, I have two checksum functions to check the input and output tensor, when calling the rfft API.
When running this code on the two above machines, despite the same input checksum, I have different checksums at the output.
I am using pytorch version 1.12.0, and python 3.7, I tried with Float and Double tensors.
I also noticed the same issue on C++ code.
how could I resolve this non determinism issue ?

import struct
import torch.fft
import binascii
import numbers
import random
import numpy as np

np.random.seed(0)
random.seed(0)
torch.manual_seed(0)
torch.use_deterministic_algorithms(True)

def float_to_hex(f):
    return hex(struct.unpack('<Q', struct.pack('<d', f))[0])

# Checksum function to be run on list
def getCheckSum(array):
    c = 0
    for x in array:
        if x != 0.0:
            c = binascii.crc32(binascii.a2b_hex(float_to_hex(x)[2:]), c)
    return c

# Checksum function to be run on tensor
def getCheckSumT(tensor):
    c = 0
    tensor = tensor.view(tensor.numel())
    for x in tensor:
        if isinstance(x.item(), numbers.Complex):
            value = x.item().real
            if x.item().imag != 0.0:
                c = binascii.crc32(binascii.a2b_hex(float_to_hex(x.item().imag)[2:]), c)
        else:
            value = x.item()
        if value != 0.0:
            c = binascii.crc32(binascii.a2b_hex(float_to_hex(value)[2:]), c)
    return c

# Code testing the non determinism
array = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
x = np.random.random(len(array))/10e3
array = [x+y for x,y in zip(array, x)]
#print(array)
tensor = torch.DoubleTensor(array)
T2 = torch.fft.rfft(tensor)
print("pytorch version:", torch.__version__)
# Running the checksul on the input array, and on the tensor, expectation is to have the same checksum
ArrayCS = getCheckSum(array)
TensorCS = getCheckSumT(tensor)
assert(ArrayCS == TensorCS)
print("Array and Tensor checksums are the same:", ArrayCS)
print("Output tensor checksum:", getCheckSumT(T2))
print("Output tensor:", T2)