Model is not Training

Stanley_C · November 30, 2020, 6:32am

Here is my training loop code:

train_loader = torch.utils.data.DataLoader(MeshLoader("./train_mesh/", input_npoints=1024, output_npoints=2048), batch_size=batch_size, num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(MeshLoader("./test_mesh/", input_npoints=1024, output_npoints=2048), batch_size=batch_size, num_workers=num_workers)
model =  PointNet_Upsample(npoint=1024, up_ratio_per_expansion=2, nstage=1).cuda()#PointNet(up_ratio=2, use_normal=False).cuda()
optimizer = torch.optim.Adam(model.parameters(),lr=0.0001)
count_train = 1
chamferDist = ChamferDistance()
model.train()

loss_array = []
for epoch in range(100):
	for step, (input_pt, target_pt) in enumerate(train_loader):
		optimizer.zero_grad()
		input_pt = input_pt.cuda().float()
		target_pt = target_pt.cuda().float()
		preds = model(input_pt)
		loss = chamferDist(target_pt, preds, bidirectional=True)	
		print("Epoch", epoch+1, ":", loss.detach().cpu().item()//input_pt.shape[0])
		loss.backward()
		optimizer.step()
		count_train = count_train + input_pt.shape[0]

However, when training, here is my training loss.

The repeated pattern in loss seems to indicate that the model is not training and that their is an issue with the training loop in particular, because even an incorrect model would still see variance in the loss over time. I am using chamfer’s distance from this repo chamfer’s distance repo link. It returns the sum of the distance between the closest points for each point, and I tried to use it the same way that MSE loss would be used, but it seems like the chamfers distance is not interchangeable. The examples.py in the repo: chamferdist/example.py at master · krrish94/chamferdist · GitHub, seems to indicate that it should be used the same way:

# Backprop using this loss!
cdist.backward()

but, it does not seem to work. Could someone help me with this issue? I have tried many ways to fix this, but the issue still persists, so I was hoping someone here could help me. Thank you!

ptrblck · November 30, 2020, 7:09am

The repeated pattern indicates that you are not shuffling the data, which also shows your code, so could you enable shuffling and see if it would help in any way?
Also, since the loss seems to be exactly the same, your computation graph might be detached, so check if all parameters get valid gradients after the loss.backward() operation via:

for name, param in model.named_parameters():
    print(name, param.grad)

Stanley_C · November 30, 2020, 7:23am

Hi, thanks for responding! I added your code and got results like this:

FE_layer.0.weight tensor([[[[  43103.2578]],

         [[-176214.8594]],

         [[ 193186.0156]]],


        [[[ -65900.7109]],

         [[   9959.4365]],

         [[   7424.0977]]],


        [[[  -2614.8594]],

         [[ -31229.7559]],

         [[ -17630.7383]]],


        [[[  46508.3984]],

         [[  66506.5938]],

         [[ -16597.5312]]],


        [[[ -29938.4004]],

         [[  20681.8320]],

         [[  16445.1836]]],

I’m not sure if I’m interpreting it correctly, but it seems like the gradients are flowing, it is just the model doesn’t learn.

ptrblck · November 30, 2020, 7:26am

If you get valid gradient values for all parameters, then the computation graph is at least not detached.
Note however, that the gradients seem to be quite large. I’m not familiar with your exact use case, but maybe try to scale the loss down a bit so that the gradient magnitudes will also be reduced.

Stanley_C · November 30, 2020, 5:02pm

Okay, so I should divide the loss, so I don’t have “exploding gradients”? So my training loop is fine, is it just the loss is too large, causing the gradient magnitudes to increase to a point where the model does not learn anything?

ptrblck · November 30, 2020, 5:03pm

That would be one possible approach, the other one would be to lower the learning rate.

Stanley_C · November 30, 2020, 6:28pm

I tried scaling my inputs between -1 and 1, and then dividing the loss by a bit, (although it is still in the 6 digits). I still see gradient explosions, later in the neural network. Should I try to future scale down the loss and should I try gradient clipping if that doesn’t work?
Example:

coordReconstruction_layer.0.10.bias tensor([  29565.5938,  107157.0859, -415445.0000,  -52302.0312, -174125.9688,
          91306.4375,  178269.4062, -365361.7500,  573210.0000,  207930.5625,
         -18185.3281,   58567.1094, -360612.0625,  128182.8359,   29353.1719,
        -199463.4062,  -17777.7500, -110101.8203, -532024.7500,  160549.6250,
        -576055.6875,  189424.9844,  157812.2344, -245538.3906,   85801.1406,
         176016.4375,  224826.5312,   23766.4766, -619454.1875, -294117.7500,
        -325290.5312,   60833.8672, -107854.8438,  -57861.6602, -264200.2500,
          83863.8906,  172188.1406,  385307.7500, -346806.8125,  108188.7266,
         440669.6250,  -97171.9609, -602740.8125,  129569.0156, -736367.2500,
          20825.3438,  -35092.2891,   22432.6875, -297975.8438, -282776.5938,
         226446.8125,   77869.7969,  -57261.6875, -323141.1875,  362888.2500,
        -118788.5781,   48330.5234, -223928.7188,  507906.5000,  -31208.7812,
        -106249.6562,  274496.8750, -547059.8750,  354355.5000],
       device='cuda:0')
coordReconstruction_layer.1.0.weight tensor([[[[ 1.2466e+06]],

         [[ 1.6526e+06]],

         [[ 5.6299e+06]],

         [[ 3.0220e+06]],

ptrblck · December 1, 2020, 5:02am

Yes, I would try to compare the gradient stats to a working model (e.g. use your model in a standard classification use case and just plot the gradients magnitudes for a few iterations/epochs).
I don’t know why your loss is that large, but from my naive point of view the gradient values look too large to allow a proper training (disclaimer: I’m not a researcher, so take it with a grain of salt).