Unable to reproduce results in Emformer

Hello everyone,
I tried to reproduce the results in streaming ASR (Must-C), which shows the WER should be:

dev 0.190
tst-COMMON 0.213
tst-HE 0.186

However, following the training and eval scripts listed in the link above (except the slurm), I got the loss becomes inf from the second epoch:

Epoch 2, global step 10770: ‘Losses/val_loss’ reached inf (best 72.69429), saving model to ‘pytorch/audio/examples/asr/emformer_rnnt/experiments_gradClip5.0/checkpoints/epoch=2-step=10770.ckpt’ as top 5
Epoch 2, global step 10770: ‘Losses/train_loss’ reached inf (best 169.92964), saving model to ‘pytorch/audio/examples/asr/emformer_rnnt/experiments_gradClip5.0/checkpoints/epoch=2-step=10770-v1.ckpt’ as top 5

Later, I removed --gradient-clip-val 5.0 from the training script, and get a very large WER:

dev 0.385
tst-COMMON 0.389
tst-HE 0.352

The training log this time shows the model loss becomes nan from the 8-th epoch, but the streaming ASR (Must-C) link above shows the model can be converged in the 55-th epoch (as it uses epoch=55-step=106679.ckpt for eval).

Quite confused about the nan and inf of loss. Can someone help to show me the correct script or environment? Thank you very much.


My env:
GPU V100, cuda 10.2, 8 gpus and 1 node in training
python 3.7.6
torch ‘1.11.0+cu102’
torchaudio ‘0.11.0+cu102’

my training script:

$PYTHON global_stats.py --model-type mustc --dataset-path $MUSTC_DATA

    $PYTHON train.py \
    --model-type mustc \
    --exp-dir ./experiments \
    --dataset-path $MUSTC_DATA \
    --num-nodes 1 \
    --gpus 8 \
    --global-stats-path ./global_stats.json \
    --sp-model-path ./spm_bpe_500.model \

the MUST-C dataset I used:

$ wc -l *_asr.tsv
    1419 dev_asr.tsv
  225278 train_asr.tsv
    2588 tst-COMMON_asr.tsv
     601 tst-HE_asr.tsv
  229886 total

To avoid GPU OOM, I changed the max_token_limit from 100 to 50 in lightning.py:

        dataset = CustomDataset(MUSTC(self.mustc_path, subset="train"), 50, 20)

Waiting for your kind help, thank you.

Hi @Rachel_Zhang, thanks for sharing the info. I also met nan gradient issue when I wrote the script. It is due to large vocab_size when training the sentencepiece model. By decreasing it to 500 resolves the issue.

Could you share how you compute the global_stats.json and spm_bpe_500.model? Thanks!

Hi @nateanl , Thank you for your quick reply.
Here shows my global_stats.json and spm_bpe_500.model.

Scripts for generating them:

$PYTHON global_stats.py --model-type mustc --dataset-path $MUSTC_DATA
$PYTHON mustc/train_spm.py --mustc-path $MUSTC_DATA

The default vocab_size is 500 and I haven’t changed it.

Thanks! The global_stats.json and sentencepiece model should be fine. In your $MUSTC_DATA directory, is the dataset listed as:


Thank you for reply.

$ pwd

$ tree . -L 2
|-- dev
|   |-- h5
|   |-- txt
|   `-- wav
|-- train
|   |-- h5
|   |-- txt
|   `-- wav
|-- tst-COMMON
|   |-- h5
|   |-- txt
|   `-- wav
`-- tst-HE
    |-- h5
    |-- txt
    `-- wav

$ wc -l */txt/*.en    
    1423 dev/txt/dev.en
  229703 train/txt/train.en
    2641 tst-COMMON/txt/tst-COMMON.en
     600 tst-HE/txt/tst-HE.en
  234367 total

Further, could you help me to debug, whether changing max_token_limit from 100 to 50 would lead to such collapse? (I only have V100-16G for now.)

Sure. I changed max_token_limit to 50 on my side. It’s been running for 3 epochs and the loss is normal. Could you share the lightning log and the provide your environment information? You can get it by

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Sure. I’ve put the environment (env_res.txt) and lightning log (experiment*.log) here, waiting for your help to debug:)

1 Like

Thanks! From your env_res.txt. There are two versions of pytorch (pytorch 1.7.1 and torch 1.11.0). Also the cudatoolkit version (9.2) doesn’t match pytorch build (10.2).
Could you try to uninstall and update pytorch, torchaudio, and cudatoolkit to make the versions consistent, then re-run the experiment to see if the nan loss issue can be resolved?

Hi @Rachel_Zhang, have you resolved the gradient issue?