NaN values in Tensors for no obvious reason

JunzheJosephZhu · January 16, 2024, 7:30pm

Hi,
I’m doing a small test run of DinoV2 GitHub - facebookresearch/dinov2: PyTorch code and models for the DINOv2 self-supervised learning method..
Since I have a pretty special setup that takes extremely long to reproduce, I’ll just try to explain the problem as clearly as possible.

I’m training on 2xL4 with pytorch==2.0.0+cu117, FSDP, torchrun with NCCL backend.
Basically, my loss function includes sinkhorn_knopp, which is coded as follow:
def sinkhorn_knopp_teacher(self, teacher_output, teacher_temp, n_iterations=3):
import os
# print(“teacher sinkhorn on RANK”, os.environ[“RANK”])
# if os.environ[“RANK”] == ‘0’:
# teacher_output = torch.load(“tmp0.pth”)
# teacher_temp = torch.load(“temperature0.pth”)
# else:
# teacher_output = torch.load(“tmp.pth”)
# teacher_temp = torch.load(“temperature.pth”)
teacher_output = teacher_output.float()
# print(“starting sinkhorn iteration on RANK”, os.environ[“RANK”])
world_size = dist.get_world_size() if dist.is_initialized() else 1
# print(“world size on RANK”, os.environ[“RANK”], world_size)
Q = torch.exp(teacher_output / teacher_temp).t() # Q is K-by-B for consistency with notations from our paper
B = Q.shape[1] * world_size # number of samples to assign
K = Q.shape[0] # how many prototypes
Q_prenorm = Q.clone()

    # make the matrix sums to 1
    sum_Q = torch.sum(Q)
    if dist.is_initialized():
        torch.cuda.synchronize()
        dist.barrier()
        # print("reducing sum_Q on RANK", os.environ["RANK"])
        # print("tensor device on RANK", os.environ["RANK"], sum_Q.device)
        dist.all_reduce(sum_Q)
        torch.cuda.synchronize()
    Q /= sum_Q
    
    for it in range(n_iterations):
        # normalize each row: total weight per prototype must be 1/K
        locals()['Q' + str(it)] = Q.clone()
        sum_of_rows = torch.sum(Q, dim=1, keepdim=True)
        locals()["sum_of_rows_local" + str(it)] = sum_of_rows.clone()
        if dist.is_initialized():
            torch.cuda.synchronize()
            dist.barrier()
            dist.all_reduce(sum_of_rows)
            torch.cuda.synchronize()
        locals()["sum_of_rows" + str(it)] = sum_of_rows.clone()
        Q /= sum_of_rows
        locals()["Q_div_rowsum" + str(it)] = Q.clone()
        Q /= K
        locals()["Q_div_K" + str(it)] = Q.clone()
        

        # normalize each column: total weight per sample must be 1/B
        # Q /= torch.sum(Q, dim=0, keepdim=True)
        sum_of_cols = torch.sum(Q, dim=0, keepdim=True)
        locals()["sum_of_cols" + str(it)] = sum_of_cols.clone()
        Q /= sum_of_cols
        locals()["Q_div_colsum" + str(it)] = Q.clone()
        Q /= B
        locals()["Q_div_B" + str(it)] = Q.clone()

    Q *= B  # the columns must sum to 1 so that Q is an assignment
    # print("finished sinkhorn iteration on RANK", os.environ["RANK"])
    if Q.isnan().any():
        torch.save(teacher_output, "teacher_output_nan_rank" + os.environ["RANK"] + ".pth")
        torch.save(teacher_temp, "teacher_temp_nan_rank" + os.environ["RANK"] + ".pth")
    return Q.t()

However, I see that every thousand training iteration, at a random iteration within sinkhorn_knopp’s for loop, sum_of_cols with contain A SINGLE nan value. Q_div_K2 has normal numerical values(somewhere between 1e-10 and 1-e5, definitely shouldn’t cause overflow/underflow), and so does everything before it.

As you can see in the code, I saved a lot of intermediate variables with a unique name locally(such as Q0, Q1, Q2, Q_div_K0, Q_div_K1, Q_div_K2, etc). I added a conditional debugger checkpoint at the last line that triggers if Q.isnan().any(). It get triggered after several thousand training iterations.

I examined with the following expresssions in my trigger point(at the last line starting with “return” )
torch.sum(Q_div_K2, dim=0, keepdim=True)
([[0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048]], device=‘cuda:0’)
sum_of_cols2
tensor([[0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
nan, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048]], device=‘cuda:0’)

As you can see, there’s a SINGLE NaN value in sum_of_cols2, and it is not reproducable in the debugger(sum_of_cols2 should have the exact same value as torch.sum(Q_div_K2, dim=0, keepdim=True))

I have been stuck for 2 days with this problem, and just can’t figured it out for the life of me. As you can see, I already added synchronization and a barrier before the all_reduce call, and another synchronization after all_reduce.

My suspicion now is whether this is due to memory fragmentation but pytorch for some reason not triggering OOM. Due to FSDP the memory usage doesnt seem to increase after a certain batch size(I forgot exactly how big, but right now I’m using a batch size of 52 per GPU, and even a batch size of 46 per GPU uses similar memory). Nvidia SMI looks like this:
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 On | 00000000:00:03.0 Off | 0 |
| N/A 67C P0 25W / 75W| 106MiB / 23034MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA L4 On | 00000000:00:04.0 Off | 0 |
| N/A 69C P0 26W / 75W| 4MiB / 23034MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

This erorr is very stochastic, it usually happens after a few thousand training iterations, regardless if I’m resuming from a previous checkpoint or not.

Can someone explain this mysterious bug to me? Is it just due to memory fragmentation and overflow without explicitly trigger errors?

JunzheJosephZhu · January 17, 2024, 11:49am

I moved synchronization and barrier to after all_reduce, and the problem was solved. Apparently though all_reduce is SUPPOSED to enqueue a blocking operation on the GPU, it’s actually non-blocking(tensors operations that utilize the all_reduce output can happen before reduce is completed).

JunzheJosephZhu · January 18, 2024, 10:39am

@ptrblck Begging for some guidance. In another example:
sum_of_rows = torch.sum(Q, dim=1, keepdim=True)
locals()[“sum_of_rows_local” + str(it)] = sum_of_rows.clone()
if dist.is_initialized():
torch.cuda.synchronize()
dist.barrier()
dist.all_reduce(sum_of_rows)
torch.cuda.synchronize()
dist.barrier()
locals()[“sum_of_rows” + str(it)] = sum_of_rows.clone()
Q /= sum_of_rows
In this piece of code, sum_of_rows_local1 is all valid values that are pretty small in magnitude. sum_of_rows however is all NaN. I’m pretty sure this is not a numerical issue. Something with all_reduce is seriously broken. This only happens every several hundred iterations. I’m using L4 GPU with Cuda=11.7

JunzheJosephZhu · January 18, 2024, 5:02pm

@ptrblck it seems that in other operations that don’t involve all_reduce, cuda also just randomly returns NaN, but when the exact same expression is ran in the debugger in the exact same context, a reasonable number is produced. Is it possible that this is just a bug in L4 GPU’s driver’s? I really cannot dig out any other possible reason. I’ve checked streams of execution etc, all normal.

nickums · August 29, 2024, 12:41pm

I dont think the GPU can deal with nan values:

tensor = torch.tensor([1.0, float('nan'), 2.0]).cuda()

# Check for NaN values
nan_mask = torch.isnan(tensor)

Traceback (most recent call last):

tensor = torch.tensor([1.0, float('nan'), 2.0]).cuda()

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

ptrblck · August 29, 2024, 2:24pm

Works fine for me:

tensor = torch.tensor([1.0, float('nan'), 2.0]).cuda()

# Check for NaN values
nan_mask = torch.isnan(tensor)

print(nan_mask)
# tensor([False,  True, False], device='cuda:0')