Thanks @ptrblck for your answer! Checked with the following code whether the parameters get updated and they seem to be not, as the original plots submitted also show that no learning is happening.
optim.zero_grad()
params_before = list(model.parameters())
loss.backward()
optim.step()
params_after = list(model.parameters())
all_params_equal = True
for p_before, p_after in zip(params_before, params_after):
all_params_equal = all_params_equal and torch.equal(p_before.data, p_after.data)
print(f"Are all parameters equal? {all_params_equal}")
Not sure what you mean by checking whether the model is static.
Thanks for the check! Could you check the .grad attributes of all parameters before and after the first backward() call next?
They should be set to None before and should show a valid tensor afterwards.
If both print statements are showing None gradients, the computation graph seems to be detached and we would need to check why that’s the case.
Thanks again @ptrblck . Modified the training loop as below:
optim.zero_grad()
params_before = list([param.data.cpu().detach().numpy() for param in model.parameters()])
for p_before in model.parameters():
print(f"Grad: before - {p_before.grad}")
loss.backward()
for p_after in model.parameters():
print(f"Grad: after - {p_after.grad}")
optim.step()
params_after = list([param.data.cpu().detach().numpy() for param in model.parameters()])
all_params_equal = True
for p_before, p_after in zip(params_before, params_after):
all_params_equal = all_params_equal and np.array_equal(p_before, p_after)
print(f"Are all parameters equal? {all_params_equal}")
Previously I think the parameters shared the same reference, and were marked as equal. Detached them and stored them as numpy arrays to make the comparison and there seem to be small changes in the parameters, i.e. they seem to get updated. Sorry for this!
The gradients seem to be None first and then after doing first the backprop they seem to get updated to tensors. Here is the output:
OK, so it seems the gradients are calculated and the parameters updated.
It’s still unclear why your model isn’t learning anything. Could you compare the gradient magnitudes between PyTorch 1.13.1 and 2.0.1?
I’m unsure what norm values would be expected, but I also see that you are using the mps backend.
Do you see the same behavior in any other backend (e.g. CPU or CUDA)?
Thank for checking. In this case I would start removing parts of the model until e.g. a single layer is used to check if you can train it at all. I haven’t seen this kind of issue before and since your parameters are updated, I guess something else might block the training.
@ptrblck so the model has just two RNN layers.
Switch to just one layer and we get the same bad results in PyTorch 2.0.1 while in 1.13.1 the performance goes from ~0.9 to ~0.75.
In that case I would assume it should be easy to post a minimal and executable code snippet reproducing the training stagnation, which we could use to reproduce and debug it.
The sample contains various conda environment files depending on your hardware (Mac/CPU and Nvidia GPU) for each of the PyTorch version 1.13.1 and 2.0.1.
To run it on CPU the following command should be executed: python main.py --num_shot 1 --num_classes 2.
Thanks! Unfortunately, your code isn’t executable without downloading an unknown dataset from an unknown source.
The comments in your model also don’t seem to be valid as using any value combination for [B, K+1, N, 784] for the data and [B, K+1, N, N] results in shape mismatches, so could you let me know what the expected shapes are?
@ptrblck we are using the Omniglot dataset, i.e. the transpose of MNIST, as per the MANN paper we are implementing in the code sample. If this poses a security threat for you, could switch to MNIST, though it will take some time.
The code should definitely run as it is, so not sure what you mean with the comments related to the shapes. Have you executed the code and received any errors?
In imageio 2.19.3, the min and max values are [0, 255].
However in imageio 2.31.1, the min and max values are [0., 1.]. Hence (image / 255.) makes the image to (almost) 0 in the pytorch 2.0.1 conda environment.
After handling this normalization correctly, all the pytorch versions give similar performance (atleast in my experiments).
image = image / image.max()
I am not sure if this is the same issue that you are facing though.