LSTM CTC model not learning in PyTorch

dnnagy · June 25, 2019, 2:04pm

Hi! I was working on an ASR model in tensorflow keras, but now I want to swich to pytorch. I’m trying to reimplement keras model in pytorch, but I think, I did a mistake, because the same model on the same data does not learn in pytorch.

Here is a full jupyter notebook of my problem: notebook on github

As You can see, the TF model overfits the random data, as expected, but the pytorch model does not learn anything.

I’m using pytorch 1.1.0 with CUDA, and Tensorflow 2.0.0-beta1

smth · June 25, 2019, 2:13pm

initialization of weights will matter here.

the nn.Linear layers that you created, initialize their weights to something other than the default initialization and see if it makes a difference. You can use https://pytorch.org/docs/stable/nn.html#torch-nn-init for convenience to try different initializations.

dnnagy · June 25, 2019, 2:15pm

Could You post a simple example on how to properly initialize weights?
Do I have to re-initialize LSTM weights after each epoch or sample?

dnnagy · June 25, 2019, 2:30pm

Ok, I initialized all my Linear weights based on this comment, but pytorch is still not learning:

def weight_init(m): 
  if isinstance(m, nn.Linear):
    size = m.weight.size()
    fan_out = size[0] # number of rows
    fan_in = size[1] # number of columns
    variance = np.sqrt(2.0/(fan_in + fan_out))
    m.weight.data.normal_(0.0, variance)
    
baseline_model = FCBaseline(SEGMENT_WIDTH, SEGMENT_HEIGHT, SEGMENT_CHANNELS, num_classes)
baseline_model.apply(weight_init)

a3VonG · June 25, 2019, 9:32pm

Could be that I missed it but it seems like a possible reason is that you forgot to zero the gradients before/after running a batch. You only seem to do it at the start. Try adding the following INSIDE your training loop:

optimizer.zero_grad()

Does this solve your issue?

See here for an example or here for the reason why this is needed.

dnnagy · June 27, 2019, 4:48pm

Thank You, this solved my issue. I forgot to zero out gradients after each minbach. Now works fine.

# Optimizer needs the gradients of this minibatch only, so zero out prev grads.
optimizer.zero_grad()
loss.backward() # Calculates derivatives with autograd
optimizer.step() # Update weights