State_dict contains erroneous parameters

Hello,

I have a strange issue when loading checkpoints from a long running training process. Before I create an issue on Github about this I wanted to know if anyone has encountered something like this, or that it maybe is a known issue?

I have a training process that has been running for about 22 days now (Pytorch 1.6, DistributedDataParallel using 4 GPUs, Pytorch native mixed precision training). I checked a saved checkpoint after about 8 days and after loading it, it worked well (inference results made sense). Now after 22 days, I loaded the most recent checkpoint and got inference results that made no sense.

So I started comparing the parameters in both checkpoints. What I found was that in the ‘faulty’ checkpoint the first two values of each parameter tensor are close to zero, while this is not the case in the good checkpoint. Please see the attached images.

Is any issue like this known to you?

Good checkpoint:

Faulty checkpoint:

Note: just to be clear, the faulty checkpoint was from much later in the training process and should have a much lower loss (as was calculated during training).

What was the training and validation accuracy of the “fauly” checkpoint before storing it?
If I understand your issue correctly, you are seeing worse results when reloading the checkpoint than during training?

Hi @ptrblck,

Thanks for responding. This is not an overfitting issue for sure, but to provide you some insight in to the model performance I added some of the tensorboard graphs at the end. Yes, the issue occurs when I load the checkpoint.

To provide some more context, this is an open-domain chatbot model. I trained these kinds of models using Pytorch many times before and I know the difference between when such a model is not trained well in some way, and when there is really something technically wrong. When it is not trained well (under-fitted or over-fitted), you can still make a conversation with it, but it makes not much sense, and/or you might see glitches in the text generation. In the case of the ‘faulty’ checkpoint the model simply outputs garbage, without any structure at all.

Further, it is very strange that all parameter Tensors, over all layers(!), start with the same kind of values (zero, or very close to zero). This can’t be right.

Some more details; chat_gru and conversation_gru are actually not GRUs, but Simple Recurring Units (SRUs), also see GitHub - asappresearch/sru: Training RNNs as Fast as CNNs (https://arxiv.org/abs/1709.02755).

Because I started my experiment using the SRUs for the first time, I thought maybe the issue is related to using the SRUs. But if that is the case, why would other layers (e.g. feed-forward layer out) also have the first two parameters close to zero? Could it be that on the C level, memory gets overwritten? (SRU has a C/CUDA implementation)

When I started the experiment, I also started using Pytorch 1.6.0, so maybe it is related to that, I can’t really tell.

Below are some screenshots of my Tensorboard for this experiment:

Moving average of cross entropy loss for training and validation set at earlier “good”/“functioning” checkpoint:

Moving average of cross entropy loss for training and validation set at later “faulty”/“garbage-out” checkpoint:

Cross entropy over whole training set, calculated once per epoch:

Cross entropy over whole validation set, calculated once per epoch:

Last note : I use a dropout of 0.2, this is why the training loss is higher then the validation loss.

Did you check the values in the good run and saw that they are not close to zero?
I would start with a comparison between the “good” model after training vs. the model after reloading.
Also, could you use the latest nightly binary, as I’ve isolated a checkpoint issue a few weeks ago (which should be fixed by now), which corrupted the loading of state_dicts if CUDATensors were stored.

Hey @ptrblck,

Did you check the values in the good run and saw that they are not close to zero?

Yeah, that’s is exactly what I did, as explained in my initial post. In the attached images of my first post you can see parameters of the ‘good’ checkpoint and the ‘faulty’ checkpoint. As you can see the parameter tensors in the good checkpoint do not start with values close to zero.

Also, could you use the latest nightly binary, as I’ve isolated a checkpoint issue a few weeks ago (which should be fixed by now), which corrupted the loading of state_dict s if CUDATensors were stored.

Will do! Is there a merge request and/or a related issue I can read about this issue?

It’s this one here: https://github.com/pytorch/pytorch/issues/46020

1 Like

I see that this fix was merged in to pytorch:release/1.7 on the 12th of October, the 1.7.0 release was on the 23rd of October. So, I can assume that your fix landed in Pytorch 1.7.0, correct?

Yes, that should be the case.