Hi,
I’m doing a small test run of DinoV2 GitHub - facebookresearch/dinov2: PyTorch code and models for the DINOv2 self-supervised learning method..
Since I have a pretty special setup that takes extremely long to reproduce, I’ll just try to explain the problem as clearly as possible.
I’m training on 2xL4 with pytorch==2.0.0+cu117, FSDP, torchrun with NCCL backend.
Basically, my loss function includes sinkhorn_knopp, which is coded as follow:
def sinkhorn_knopp_teacher(self, teacher_output, teacher_temp, n_iterations=3):
import os
# print(“teacher sinkhorn on RANK”, os.environ[“RANK”])
# if os.environ[“RANK”] == ‘0’:
# teacher_output = torch.load(“tmp0.pth”)
# teacher_temp = torch.load(“temperature0.pth”)
# else:
# teacher_output = torch.load(“tmp.pth”)
# teacher_temp = torch.load(“temperature.pth”)
teacher_output = teacher_output.float()
# print(“starting sinkhorn iteration on RANK”, os.environ[“RANK”])
world_size = dist.get_world_size() if dist.is_initialized() else 1
# print(“world size on RANK”, os.environ[“RANK”], world_size)
Q = torch.exp(teacher_output / teacher_temp).t() # Q is K-by-B for consistency with notations from our paper
B = Q.shape[1] * world_size # number of samples to assign
K = Q.shape[0] # how many prototypes
Q_prenorm = Q.clone()
# make the matrix sums to 1
sum_Q = torch.sum(Q)
if dist.is_initialized():
torch.cuda.synchronize()
dist.barrier()
# print("reducing sum_Q on RANK", os.environ["RANK"])
# print("tensor device on RANK", os.environ["RANK"], sum_Q.device)
dist.all_reduce(sum_Q)
torch.cuda.synchronize()
Q /= sum_Q
for it in range(n_iterations):
# normalize each row: total weight per prototype must be 1/K
locals()['Q' + str(it)] = Q.clone()
sum_of_rows = torch.sum(Q, dim=1, keepdim=True)
locals()["sum_of_rows_local" + str(it)] = sum_of_rows.clone()
if dist.is_initialized():
torch.cuda.synchronize()
dist.barrier()
dist.all_reduce(sum_of_rows)
torch.cuda.synchronize()
locals()["sum_of_rows" + str(it)] = sum_of_rows.clone()
Q /= sum_of_rows
locals()["Q_div_rowsum" + str(it)] = Q.clone()
Q /= K
locals()["Q_div_K" + str(it)] = Q.clone()
# normalize each column: total weight per sample must be 1/B
# Q /= torch.sum(Q, dim=0, keepdim=True)
sum_of_cols = torch.sum(Q, dim=0, keepdim=True)
locals()["sum_of_cols" + str(it)] = sum_of_cols.clone()
Q /= sum_of_cols
locals()["Q_div_colsum" + str(it)] = Q.clone()
Q /= B
locals()["Q_div_B" + str(it)] = Q.clone()
Q *= B # the columns must sum to 1 so that Q is an assignment
# print("finished sinkhorn iteration on RANK", os.environ["RANK"])
if Q.isnan().any():
torch.save(teacher_output, "teacher_output_nan_rank" + os.environ["RANK"] + ".pth")
torch.save(teacher_temp, "teacher_temp_nan_rank" + os.environ["RANK"] + ".pth")
return Q.t()
However, I see that every thousand training iteration, at a random iteration within sinkhorn_knopp’s for loop, sum_of_cols with contain A SINGLE nan value. Q_div_K2 has normal numerical values(somewhere between 1e-10 and 1-e5, definitely shouldn’t cause overflow/underflow), and so does everything before it.
As you can see in the code, I saved a lot of intermediate variables with a unique name locally(such as Q0, Q1, Q2, Q_div_K0, Q_div_K1, Q_div_K2, etc). I added a conditional debugger checkpoint at the last line that triggers if Q.isnan().any(). It get triggered after several thousand training iterations.
I examined with the following expresssions in my trigger point(at the last line starting with “return” )
torch.sum(Q_div_K2, dim=0, keepdim=True)
([[0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048]], device=‘cuda:0’)
sum_of_cols2
tensor([[0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
nan, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048, 0.0048,
0.0048, 0.0048, 0.0048, 0.0048, 0.0048]], device=‘cuda:0’)
As you can see, there’s a SINGLE NaN value in sum_of_cols2, and it is not reproducable in the debugger(sum_of_cols2 should have the exact same value as torch.sum(Q_div_K2, dim=0, keepdim=True))
I have been stuck for 2 days with this problem, and just can’t figured it out for the life of me. As you can see, I already added synchronization and a barrier before the all_reduce call, and another synchronization after all_reduce.
My suspicion now is whether this is due to memory fragmentation but pytorch for some reason not triggering OOM. Due to FSDP the memory usage doesnt seem to increase after a certain batch size(I forgot exactly how big, but right now I’m using a batch size of 52 per GPU, and even a batch size of 46 per GPU uses similar memory). Nvidia SMI looks like this:
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 On | 00000000:00:03.0 Off | 0 |
| N/A 67C P0 25W / 75W| 106MiB / 23034MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA L4 On | 00000000:00:04.0 Off | 0 |
| N/A 69C P0 26W / 75W| 4MiB / 23034MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
This erorr is very stochastic, it usually happens after a few thousand training iterations, regardless if I’m resuming from a previous checkpoint or not.
Can someone explain this mysterious bug to me? Is it just due to memory fragmentation and overflow without explicitly trigger errors?