Dataloader's data going blank/corrupt after 1st iteration

Emkey · July 11, 2022, 8:56pm

I have been trying to fix this issue for some time and still have not managed. My main guess is that the problem resides in the DataLoader, as the class were I’m loading the data, prints the paths and pointclouds properly!

I am trying to train a network (PCN Pytorch) with 2 inputs: Partial PointClouds (Input) and their labels Complete PointClouds (Ground Truth). Something quite strange is happening: After the first epoch, the Complete PCs get ‘corrupted’. It is doubly strange as the partials are always different, but because of the design of what I am doing, the complete PC of an object is always the same! So I tried loading always 1 complete per object, N times (being N the amount of partials per object as well), and loading N completes (the same one), in case it was a problem of 'trying to open the same file concurrently).

train_dataset = ShapeNet('data/PCN', 'train', params.category)
train_dataloader = DataLoader(train_dataset, batch_size=params.batch_size, shuffle=True, num_workers=params.num_workers)

I have been checking the data up to Shapenet’s __getitem__(self, index), and visualizing those pointclouds, I can tell they have no errors. The dataset is relatively small so I can check them manually at the moment.

def __getitem__(self, index):

    partial_path = self.partial_paths[index]
    complete_path = self.complete_paths[index]

    partial_pc = self.random_sample(self.read_point_cloud(partial_path), 2048)
    complete_pc = self.random_sample(self.read_point_cloud(complete_path), 16384)

    ## To visualize the complete PC:
    # pcd = o3d.geometry.PointCloud()
    # pcd.points = o3d.utility.Vector3dVector(complete_pc)
    # o3d.visualization.draw_geometries([pcd])
    ##
    return torch.from_numpy(partial_pc), torch.from_numpy(complete_pc)

Screenshot 2022-07-11 at 22-41-45 PyTorch Dataloader not loading data properly Dataset getting corrupted

This is my main source of headaches. But I thought I could somewhat fix it. I realised that after using dataloader, when trying to visualise those Pointclouds in the for i, (p, c) in enumerate(train_dataloader): (p of partial, c of complete) in the training process:

 for epoch in range(1, params.epochs + 1):
        # hyperparameter alpha
        if train_step < 10000:
            alpha = 0.01
        elif train_step < 20000:
            alpha = 0.1
        elif epoch < 50000:
            alpha = 0.5
        else:
            alpha = 1.0

        # training
        model.train()
        for i, (p, c) in enumerate(train_dataloader):
            p, c = p.to(params.device), c.to(params.device)

            optimizer.zero_grad()

            # forward propagation
            coarse_pred, dense_pred = model(p)
            
            # loss function

After visualizing c during the first epoch, Complete PCs were alright. After the first epoch, they were as shown in the picture, regardless the object. I thought: well, I might be able to load it (as it’s just one per object) in here, convert it to a tensor, and substituting c in each iteration, but I am also not managing.


            complete_pc = random_sample(read_point_cloud(path), 16384)
            c = torch.from_numpy(complete_pc)
            c = c.to(params.device)

Why is DataLoader doing this? Why would it load correctly all partials, but on the other side, the complete PCs would work only in the first epoch?

ptrblck · July 12, 2022, 12:13am

Just to understand the issue correctly:

you are trying to load partial point clouds as well complete ones
inside the Dataset.__getitem__ method every object looks correct
in the DataLoader loop the partials are correct, but the complete point clouds are wrong

If so, which values do the complete PCs show? Are you seeing random values or somehow “updated” values?

Emkey · July 12, 2022, 12:00pm

Yes!

I am trying to load partials and completes. 1 partial and it’s label (a complete pc of that partial)
Every object looks correct.
Exactly. I will show now one of the values of c in the first iteration (any of them seem correct due to the values), and then the values of c after the first iteration (a complete PC that is just 0s for some reason.

In the first iteration:

tensor([[[-0.0506,  0.0707,  0.1959],
         [-0.0563,  0.0352,  0.0951],
         [ 0.0155, -0.0445, -0.0087],
         ...,
         [-0.0146,  0.0055,  0.2190],
         [-0.0519, -0.0934,  0.0150],
         [-0.0448,  0.0730,  0.0953]],

        [[ 0.0221,  0.0643,  0.1358],
         [ 0.0154, -0.1023,  0.0147],
         [ 0.0234, -0.0052, -0.0036],
         ...,
         [ 0.0046,  0.0746,  0.0353],
         [-0.0515,  0.0245,  0.2120],
         [-0.0248, -0.0959, -0.0075]],

        [[-0.0042,  0.0739,  0.1641],
         [-0.0497, -0.0999,  0.1064],
         [-0.0545, -0.0351,  0.0054],
         ...,
         [-0.0051,  0.0738,  0.1947],
         [-0.0569,  0.0048,  0.1346],
         [-0.0149,  0.0740,  0.1650]],

        ...,

        [[-0.0546, -0.0146,  0.0049],
         [-0.0449,  0.0648, -0.0045],
         [-0.0254,  0.0708,  0.2109],
         ...,
         [-0.0508,  0.0706,  0.2040],
         [-0.0250, -0.0048, -0.0123],
         [-0.0573, -0.0546,  0.0449]],

        [[-0.0564, -0.0654,  0.1647],
         [ 0.0274, -0.0552,  0.1844],
         [ 0.0195,  0.0139,  0.1362],
         ...,
         [-0.0411, -0.0251, -0.0104],
         [ 0.0298, -0.0467,  0.0470],
         [ 0.0182, -0.0969,  0.1256]],

        [[ 0.0310,  0.0150,  0.0148],
         [-0.0504,  0.0705,  0.0838],
         [ 0.0221, -0.0055,  0.1249],
         ...,
         [-0.0354,  0.0752,  0.0651],
         [ 0.0216, -0.0548,  0.1447],
         [-0.0514,  0.0147, -0.0024]]], device='cuda:0')

And in the second iteration, here it can be seen that the PC is ‘empty’ for some reason.

tensor([[[0., 0., 0.],
         [0., 0., -0.],
         [0., 0., 0.],
         ...,
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]],

        [[-0., 0., 0.],
         [-0., -0., 0.],
         [-0., -0., 0.],
         ...,
         [-0., -0., 0.],
         [-0., -0., 0.],
         [-0., 0., -0.]],

        [[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         ...,
         [0., 0., 0.],
         [-0., 0., 0.],
         [-0., -0., -0.]],

        ...,

        [[0., 0., 0.],
         [0., 0., -0.],
         [-0., -0., 0.],
         ...,
         [0., 0., 0.],
         [-0., 0., -0.],
         [-0., -0., -0.]],

        [[0., 0., 0.],
         [-0., 0., 0.],
         [0., 0., -0.],
         ...,
         [-0., -0., 0.],
         [-0., -0., 0.],
         [-0., -0., -0.]],

        [[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         ...,
         [-0., 0., -0.],
         [0., 0., 0.],
         [0., 0., 0.]]], device='cuda:0')

Thanks for the quick reply! @ptrblck

ptrblck · July 12, 2022, 6:57pm

Thanks for the update. Would it be possible to upload a sample and provide a minimal, executable code snippet to reproduce and debug the issue?

Emkey · July 13, 2022, 6:34am

I will try to upload a small small sample of my dataset (a few partials and completes of 3 objects) as well as code. I am concerned that the network needs specific versions of CUDA and cudnn, so it might be hard. I’ll post it asap! Thank you very much for your quick replies.