Hi all, I am having trouble with my model not training. I have printed out the median magnitude of the gradient for each parameter in the network, and they are usually 0 and if not very small (of the order of 10e-6).
for epoch in range(num_epochs):
for name, param in net.named_parameters():
print(name, torch.median(torch.abs(param.grad)).data[0] if param.grad is not None else None)
running_loss = 0.0
for i, data_batch in enumerate(trainloader, 0):
inputs, labels = data_batch
labels = to_one_hot(labels, num_classes)
labels_clone = labels.clone()
inputs, labels = inputs.type(dtype), labels.type(dtype)
inputs, labels = Variable(inputs), Variable(labels)
optimizer.zero_grad()
outputs = net(inputs)
outputs_clone = outputs.clone().data
metric_value = metric.update(outputs_clone, labels_clone)
loss = criterion(outputs, labels, metric_value)
running_loss += loss.data[0]
loss.backward()
optimizer.step()
I am training the all convolutional network (https://github.com/StefOe/all-conv-pytorch) on CIFAR-100, with Pytorch version 0.3.1.
Here is an example of the gradients that I get:
124.51171112060547
q1 epoch: 2 train_acc = 0.010145833333333333 val acc = 0.009166666666666667 loss = 186765.4358444214
conv1.weight 5.883157427888364e-07
conv1.bias 4.06868912250502e-06
conv2.weight 1.669390883307642e-07
conv2.bias 1.2699252692982554e-05
conv3.weight 2.7531811497283343e-07
conv3.bias 5.92384094488807e-06
conv4.weight 3.003013944180566e-07
conv4.bias 2.1904938876105007e-06
conv5.weight 3.168307500800438e-07
conv5.bias 1.6316347455358482e-06
conv6.weight 3.1667983080296835e-07
conv6.bias 1.25941642181715e-06
conv7.weight 3.2192451726587024e-07
conv7.bias 1.1171061942150118e-06
conv8.weight 3.980393898928014e-07
conv8.bias 1.3634778497362277e-06
class_conv.weight 6.011470077282866e-08
class_conv.bias 5.082645202492131e-07