A Very strange phenomenon I met in training machine translation


Namespace(batch_size=32, cuda=True, d_inner_hid=2048, d_k=64, d_model=512, d_v=64, d_word_vec=512, data=‘data/multi30k.atok.low.pt’, dropout=0.1, embs_share_weight=False, epoch=10, label_smoothing=True, log=None, max_token_seq_len=102, n_head=8, n_layers=6, n_warmup_steps=4000, no_cuda=False, proj_share_weight=True, save_mode=‘best’, save_model=‘data/trained’, src_vocab_size=28699, tgt_vocab_size=52799)
cuda device count: 2
[ Epoch 0 ]

  • (Training) : 79%|▊| 7034/8883 [56:34<15:09, 2.03it/s]

learning rate update code:

    def _get_lr_scale(self):
        return np.min([np.power(self.n_current_steps, -0.5), np.power(self.n_warmup_steps, -1.5) * self.n_current_steps])

    def _update_learning_rate(self):
        ''' Learning rate scheduling per step '''
        self.n_current_steps += 1
        lr = self.init_lr * self._get_lr_scale()
        for param_group in self._optimizer.param_groups:
            param_group['lr'] = lr

I use transformer model to train machine traslation, but in first epoch i meet the above strange phenomenon/problem.

  1. Why does loss increase?
    2.Why does accuracy decrease?

It is not necessary that your loss should decrease for every batch within each epoch (it can go up for different batches), but it should decrease across different epochs.
If your loss is not decreasing across different epochs, learning rate could be a problem

yeah, the learning rate is a problem.
I have set lr to min(self.init_lr * self._get_lr_scale(), 0.00001 ). this phenomenon disappear