Dataset input Nan but fine on index

Jason1995 · May 9, 2021, 2:36pm

I get the message as below when I’m training WideResNet with CIFAR-10.

RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

So I wonder what’s the problem and I found the input is having nan values using the codes below.

for epoch in range(epochs):
    for i, (X, y) in enumerate(train_data):
        X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
        for j in range(X.size(0)):
            if X[j].isnan().any():
                print(j, X[j])

But interesting thing is that if I index from dataset, it’s just fine (no nan value).

What’s the problem? I still cannot understand…

I’m using pytorch official docker (pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel) and GPU with A100-SXM4-40GB.

Jason1995 · May 9, 2021, 2:41pm

for epoch in range(epochs):
    for i, (X, y) in enumerate(train_data):
        print(X.isnan().any())
        X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
        print(X.isnan().any())

I checked more and just found that loading to gpu makes this weird behaviors.

Am I using wrong way to load tensors to gpu? Or it’s related with hardware issues (gpu)?

ptrblck · May 10, 2021, 2:48am

Try to synchronize via torch.cuda.synchronize() before using this tensor as you are using a non blocking data transfer.

Jason1995 · May 10, 2021, 4:20am

It doesn’t work…
Is there other way to prevent those weird behavior??
btw I’m using single GPU (A100-SXM4-40GB)

ptrblck · May 10, 2021, 11:25pm

Thanks for the update.
Could you execute these runs and see, if the behavior changes, as a sync wouldn’t be strictly necessary, but would help us to isolate is an internal method is broken:

add synchronizations in the train_data loop before and after each line of code
set non_blocking=False
rerun the test with and without pin_memory in the DataLoader

Also, do you have a code snippet to reproduce the issue and could post the output of python -m torch.utils.collect_env here?

Jason1995 · May 11, 2021, 1:35am

As you replied I tried and it worked!

If I set pin_memory=False, it works just fine

I share my experiment with ur suggestions as follows (u can see more details with codes)

Experiment 1 - torch.cuda.synchronize() - not worked
Experiment 2 - non_blocking=False - not worked
Experiment 3 - pin_memory=False - worked

Also I share my env for reproduction (for info, I used official pytorch docker image pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel and just install jupyter notebook using pip)

And just curious that I’m sharing 1 GPU with other users cause of lack of GPUs, is this matter…??

# env
!python -m torch.utils.collect_env 

# result
Collecting environment information...
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB
GPU 4: A100-SXM4-40GB
GPU 5: A100-SXM4-40GB
GPU 6: A100-SXM4-40GB
GPU 7: A100-SXM4-40GB

Nvidia driver version: 450.102.04
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.8.1
[pip3] torchelastic==0.2.2
[pip3] torchtext==0.9.1
[pip3] torchvision==0.9.1
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.3.0            py38h54f3939_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.19.2           py38h54aff64_0  
[conda] numpy-base                1.19.2           py38hfa32c7d_0  
[conda] pytorch                   1.8.1           py3.8_cuda11.1_cudnn8.0.5_0    pytorch
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchtext                 0.9.1                      py38    pytorch
[conda] torchvision               0.9.1                py38_cu111    pytorch

Experiment 1. Insert torch.cuda.synchronize() - didn’t work

I tried the experiments as below (adding more torch.cuda.synchronize())

# Default Dataset, Dataloader codes for Experiment 1
batch_size = 128

train_transform = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize(mean=[0.4921, 0.4828, 0.4474], std=[0.1950, 0.1922, 0.1940])])

test_transform = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize(mean=[0.4921, 0.4828, 0.4474], std=[0.1950, 0.1922, 0.1940])])

train_dataset = torchvision.datasets.CIFAR10('./cifar10/', train=True, download=True, transform=train_transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)

test_dataset = torchvision.datasets.CIFAR10('./cifar10/', train=False, download=True, transform=test_transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)

Experiment 1-1

# train code - 1
for epoch in range(epochs):
    if epoch % self.epoch_print == 0: print('Epoch {} Started...'.format(epoch+1))
    for i, (X, y) in enumerate(train_data):
        print(X.isnan().any())
        torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())
        X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
        print(X.isnan().any())


# result
Epoch 1 Started...
tensor(False)
tensor(False)
tensor(True, device='cuda:1')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-16c095588069> in <module>
----> 1 WRN_28_10.train(train_loader, test_loader, save, epochs, lr, momentum, weight_decay, nesterov, milestones)

/workspace/jaeseong_lee/WideResNet/train.py in train(self, train_data, test_data, save, epochs, lr, momentum, weight_decay, nesterov, milestones)
     46 
     47                 optimizer.zero_grad()
---> 48                 loss.backward()
     49                 optimizer.step()
     50 

/opt/conda/lib/python3.8/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    243                 create_graph=create_graph,
    244                 inputs=inputs)
--> 245         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    246 
    247     def register_hook(self, hook):

/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    143         retain_graph = create_graph
    144 
--> 145     Variable._execution_engine.run_backward(
    146         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    147         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag

RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

Experiment 1-2

# train code - 2
for epoch in range(epochs):
    if epoch % self.epoch_print == 0: print('Epoch {} Started...'.format(epoch+1))
    for i, (X, y) in enumerate(train_data):
        print(X.isnan().any())
        torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())
        X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
        print(X.isnan().any())
        torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())


# result
Epoch 1 Started...
tensor(False)
tensor(False)
tensor(True, device='cuda:1')
tensor(True, device='cuda:1')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-16c095588069> in <module>
----> 1 WRN_28_10.train(train_loader, test_loader, save, epochs, lr, momentum, weight_decay, nesterov, milestones)

/workspace/jaeseong_lee/WideResNet/train.py in train(self, train_data, test_data, save, epochs, lr, momentum, weight_decay, nesterov, milestones)
     46 
     47                 optimizer.zero_grad()
---> 48                 loss.backward()
     49                 optimizer.step()
     50 

/opt/conda/lib/python3.8/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    243                 create_graph=create_graph,
    244                 inputs=inputs)
--> 245         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    246 
    247     def register_hook(self, hook):

/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    143         retain_graph = create_graph
    144 
--> 145     Variable._execution_engine.run_backward(
    146         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    147         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag

RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

Experiment 1-3

# train code - 3
for epoch in range(epochs):
    if epoch % self.epoch_print == 0: print('Epoch {} Started...'.format(epoch+1))
    torch.cuda.synchronize(self.gpu)
    for i, (X, y) in enumerate(train_data):
        print(X.isnan().any())
        torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())
        X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
        print(X.isnan().any())
        torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())


# result
Epoch 1 Started...
tensor(False)
tensor(False)
tensor(True, device='cuda:1')
tensor(True, device='cuda:1')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-16c095588069> in <module>
----> 1 WRN_28_10.train(train_loader, test_loader, save, epochs, lr, momentum, weight_decay, nesterov, milestones)

/workspace/jaeseong_lee/WideResNet/train.py in train(self, train_data, test_data, save, epochs, lr, momentum, weight_decay, nesterov, milestones)
     46 
     47                 optimizer.zero_grad()
---> 48                 loss.backward()
     49                 optimizer.step()
     50 

/opt/conda/lib/python3.8/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    243                 create_graph=create_graph,
    244                 inputs=inputs)
--> 245         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    246 
    247     def register_hook(self, hook):

/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    143         retain_graph = create_graph
    144 
--> 145     Variable._execution_engine.run_backward(
    146         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    147         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag

RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

Experiment 1-4

# train code - 4
torch.cuda.synchronize(self.gpu)
for epoch in range(epochs):
    if epoch % self.epoch_print == 0: print('Epoch {} Started...'.format(epoch+1))
    torch.cuda.synchronize(self.gpu)
    for i, (X, y) in enumerate(train_data):
        print(X.isnan().any())
        torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())
        X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
        print(X.isnan().any())
        torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())


# result
Epoch 1 Started...
tensor(False)
tensor(False)
tensor(True, device='cuda:1')
tensor(True, device='cuda:1')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-16c095588069> in <module>
----> 1 WRN_28_10.train(train_loader, test_loader, save, epochs, lr, momentum, weight_decay, nesterov, milestones)

/workspace/jaeseong_lee/WideResNet/train.py in train(self, train_data, test_data, save, epochs, lr, momentum, weight_decay, nesterov, milestones)
     46 
     47                 optimizer.zero_grad()
---> 48                 loss.backward()
     49                 optimizer.step()
     50 

/opt/conda/lib/python3.8/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    243                 create_graph=create_graph,
    244                 inputs=inputs)
--> 245         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    246 
    247     def register_hook(self, hook):

/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    143         retain_graph = create_graph
    144 
--> 145     Variable._execution_engine.run_backward(
    146         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    147         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag

RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

Experiment 2. Set non_blocking=False - didn’t work

# Default Dataset, Dataloader codes for Experiment 2
batch_size = 128

train_transform = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize(mean=[0.4921, 0.4828, 0.4474], std=[0.1950, 0.1922, 0.1940])])

test_transform = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize(mean=[0.4921, 0.4828, 0.4474], std=[0.1950, 0.1922, 0.1940])])

train_dataset = torchvision.datasets.CIFAR10('./cifar10/', train=True, download=True, transform=train_transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)

test_dataset = torchvision.datasets.CIFAR10('./cifar10/', train=False, download=True, transform=test_transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)

Experiment 2-1 (without inserting torch.cuda.synchronize())

# train code - 1
#torch.cuda.synchronize(self.gpu)
for epoch in range(epochs):
    if epoch % self.epoch_print == 0: print('Epoch {} Started...'.format(epoch+1))
    #torch.cuda.synchronize(self.gpu)
    for i, (X, y) in enumerate(train_data):
        print(X.isnan().any())
        #torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())
        X, y = X.cuda(self.gpu, non_blocking=False), y.cuda(self.gpu, non_blocking=False)
        print(X.isnan().any())
        #torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())


# result
Epoch 1 Started...
tensor(False)
tensor(False)
tensor(True, device='cuda:1')
tensor(True, device='cuda:1')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-16c095588069> in <module>
----> 1 WRN_28_10.train(train_loader, test_loader, save, epochs, lr, momentum, weight_decay, nesterov, milestones)

/workspace/jaeseong_lee/WideResNet/train.py in train(self, train_data, test_data, save, epochs, lr, momentum, weight_decay, nesterov, milestones)
     46 
     47                 optimizer.zero_grad()
---> 48                 loss.backward()
     49                 optimizer.step()
     50 

/opt/conda/lib/python3.8/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    243                 create_graph=create_graph,
    244                 inputs=inputs)
--> 245         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    246 
    247     def register_hook(self, hook):

/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    143         retain_graph = create_graph
    144 
--> 145     Variable._execution_engine.run_backward(
    146         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    147         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag

RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

Experiment 2-2 (inserting torch.cuda.synchronize() as Experiment 1-4)

# train code - 2
torch.cuda.synchronize(self.gpu)
for epoch in range(epochs):
    if epoch % self.epoch_print == 0: print('Epoch {} Started...'.format(epoch+1))
    torch.cuda.synchronize(self.gpu)
    for i, (X, y) in enumerate(train_data):
        print(X.isnan().any())
        torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())
        X, y = X.cuda(self.gpu, non_blocking=False), y.cuda(self.gpu, non_blocking=False)
        print(X.isnan().any())
        torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())


# result
Epoch 1 Started...
tensor(False)
tensor(False)
tensor(True, device='cuda:1')
tensor(True, device='cuda:1')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-16c095588069> in <module>
----> 1 WRN_28_10.train(train_loader, test_loader, save, epochs, lr, momentum, weight_decay, nesterov, milestones)

/workspace/jaeseong_lee/WideResNet/train.py in train(self, train_data, test_data, save, epochs, lr, momentum, weight_decay, nesterov, milestones)
     46 
     47                 optimizer.zero_grad()
---> 48                 loss.backward()
     49                 optimizer.step()
     50 

/opt/conda/lib/python3.8/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    243                 create_graph=create_graph,
    244                 inputs=inputs)
--> 245         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    246 
    247     def register_hook(self, hook):

/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    143         retain_graph = create_graph
    144 
--> 145     Variable._execution_engine.run_backward(
    146         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    147         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag

RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

Experiment 3. pin memory=False - works well

Experiment 3-1 (without torch.cuda.synchronize(), non_blocking=True)

# Dataset, Dataloader codes
batch_size = 128

train_transform = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize(mean=[0.4921, 0.4828, 0.4474], std=[0.1950, 0.1922, 0.1940])])

test_transform = transforms.Compose([
    transforms.ToTensor(), transforms.Normalize(mean=[0.4921, 0.4828, 0.4474], std=[0.1950, 0.1922, 0.1940])])

train_dataset = torchvision.datasets.CIFAR10('./cifar10/', train=True, download=True, transform=train_transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=False, pin_memory=False)

test_dataset = torchvision.datasets.CIFAR10('./cifar10/', train=False, download=True, transform=test_transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, pin_memory=False)


# train code - 1
#torch.cuda.synchronize(self.gpu)
for epoch in range(epochs):
    if epoch % self.epoch_print == 0: print('Epoch {} Started...'.format(epoch+1))
    #torch.cuda.synchronize(self.gpu)
    for i, (X, y) in enumerate(train_data):
        print(X.isnan().any())
        #torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())
        X, y = X.cuda(self.gpu, non_blocking=True), y.cuda(self.gpu, non_blocking=True)
        print(X.isnan().any())
        #torch.cuda.synchronize(self.gpu)
        print(X.isnan().any())


# result
Epoch 1 Started...
tensor(False)
tensor(False)
tensor(False, device='cuda:1')
tensor(False, device='cuda:1')
tensor(False)
tensor(False)
tensor(False, device='cuda:1')
tensor(False, device='cuda:1')
tensor(False)
tensor(False)
tensor(False, device='cuda:1')
tensor(False, device='cuda:1')
tensor(False)
tensor(False)
tensor(False, device='cuda:1')
tensor(False, device='cuda:1')

ptrblck · May 11, 2021, 3:57am

Thanks for the detailed analysis as well as the minimal code snippet, which is really helpful!
I’ll try to reproduce it using the wheels on our systems.

ptrblck · May 11, 2021, 9:37am

Unfortunately, I cannot reproduce the issue on our 8x A100 node using the pip wheels, conda binaries, and a source build. Are you seeing the same behavior on all GPUs (could you try to use another one, as you are currently using GPU1)? Did this behavior start after a specific update (of the driver etc.)?

Jason1995 · May 12, 2021, 2:09am

1. Checking on other gpus

Since I found this error on sunday, I mostly checked on the gpus I can access on sunday.
(This is the capture report I wrote for bugs issues to send which I tried on sunday, checked on gpu - 0, 1, 3, 4, 5, 6, 7)

It’s some kinda weird that sometimes it’s okay but sometimes it’s making weird behaviors.
(See gpu 0 and 6 - sometimes it’s okay but sometimes it makes nan)

And still I can find the same issues on gpu - 0, 1, 3, 4, 6
(cause I’m using gpu server with other users, it’s quite hard to check on all gpus… but I checked as much as I can)

Also, some trials go well but some trials make nan behaviors.

For more details, I upload my debug codes on google drive links with open access!

I run my codes using jupyter so u can check the logs that I printed.

https://drive.google.com/drive/folders/1eQavapn1NQ_NQkU6Kg84WX42UYy7CnSD

2. Specific updates

Actually I’m using gpu server as rent, I don’t know the owner did some special updates or not…

I know it could be hardware issues and also the software issues so I asked for gpu server owner for this issues.

Also I’m not the only one who is using gpu server, that might be the reason?? (like there are more than 2 processes are working per 1 gpu for every 8 gpus)

If you need more details, I’ll try to reproduce it or explain I can do

ptrblck · May 12, 2021, 4:53am

Multiple processes using the same devices shouldn’t cause this issue and since I cannot reproduce the issue on our A100 servers, I would guess it might be a setup issue. Hardware defects could of course also cause such weird issues, but since you are able to reproduce it on multiple devices in the server, I doubt it’s the case.
Did the admin provide any information about the setup of the machine? I’m also not familiar with your lease of the node, but could you try another one?

Jason1995 · May 12, 2021, 5:10am

I don’t have more available server to check

I send emails to admin (the server owner) and if I get replies, I’ll reply for u!

jopo666 · May 17, 2021, 7:14pm

I’ve been having the same problem for the past 6 months! Pain in the ass to reproduce as it only happens every now and then. On CPU the tensors are fine but when sending them to GPU, some individual values may flip to nan.

I’ve never observed the problem in our 4 x 4 P100 server if I run my models on a single node but when training models on multiple nodes (SLURM and DPP) this problem occurs again. I’ve had talks about this problem with out server provider and we tried swapping GPUs but the problem persists.

pin_memory=False seems to not fix the problem for me. I’m happy to provide additional details about our system if it helps! Here’s the output of collect_env if it helps at all.

Collecting environment information...
PyTorch version: 1.8.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB
GPU 2: Tesla P100-PCIE-16GB
GPU 3: Tesla P100-PCIE-16GB

Nvidia driver version: 450.102.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] pytorch-lightning==1.3.0
[pip3] torch==1.8.1
[pip3] torchmetrics==0.2.0
[pip3] torchvision==0.9.1
[conda] _pytorch_select           0.1                       cpu_0  
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.0.221             h6bb024c_0  
[conda] libmklml                  2019.0.5                      0  
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.3.0            py38h54f3939_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.20.2                   pypi_0    pypi
[conda] numpy-base                1.19.2           py38hfa32c7d_0  
[conda] pytorch-lightning         1.3.0                    pypi_0    pypi
[conda] torch                     1.8.1                    pypi_0    pypi
[conda] torchmetrics              0.2.0                    pypi_0    pypi
[conda] torchvision               0.9.1                    pypi_0    pypi