Loss becoms NaN. But, in cpu mode, loss is calculated normally

I made a speech recognition Transformer.
But, After the first iteration, the loss becomes NaN.

Another strange thing, however, is that loss is calculated normally in CPU mode.

I will attach the address of GitHub with my code below, my test code and test result.

# Copyright (c) 2020, Soohwan Kim. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import torch
import torch.nn as nn

from kospeech.model_builder import build_transformer

batch_size = 4
seq_length = 200
target_length = 10
input_size = 80

cuda = torch.cuda.is_available()
device = torch.device('cuda' if cuda else 'cpu')

transformer = build_transformer(
    num_classes=10,
    d_model=16,
    d_ff=32,
    num_heads=2,
    input_dim=input_size,
    num_encoder_layers=3,
    num_decoder_layers=2,
    extractor='vgg',
    dropout_p=0.1,
    device=device,
    pad_id=0,
    sos_id=1,
    eos_id=2,
    joint_ctc_attention=False,
    max_length=10,
)

criterion = nn.CrossEntropyLoss(ignore_index=0, reduction='mean')
optimizer = torch.optim.Adam(transformer.parameters(), lr=1e-04)

for i in range(10):
    inputs = torch.FloatTensor(batch_size, seq_length, input_size).to(device)
    input_lengths = torch.LongTensor([seq_length, seq_length - 10, seq_length - 20, seq_length - 30])
    targets = torch.LongTensor([[1, 3, 3, 3, 3, 3, 4, 5, 6, 2],
                                [1, 3, 3, 3, 3, 3, 4, 5, 2, 0],
                                [1, 3, 3, 3, 3, 3, 4, 2, 0, 0],
                                [1, 3, 3, 3, 3, 3, 4, 2, 0, 0]]).to(device)

    outputs, _, _ = transformer(inputs, input_lengths, targets)
    loss = criterion(outputs.contiguous().view(-1, outputs.size(-1)), targets[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(loss)

  • Result in CUDA mode
tensor(10.6171, device='cuda:0', grad_fn=<MeanBackward0>)
tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)
tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)
tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)
tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)
tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)
tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)
tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)
tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)
tensor(nan, device='cuda:0', grad_fn=<MeanBackward0>)
  • Result in CPU mode
tensor(12.8601, grad_fn=<MeanBackward0>)
tensor(12.5969, grad_fn=<MeanBackward0>)
tensor(12.4775, grad_fn=<MeanBackward0>)
tensor(10.8609, grad_fn=<MeanBackward0>)
tensor(11.9296, grad_fn=<MeanBackward0>)
tensor(9.9925, grad_fn=<MeanBackward0>)
tensor(11.5225, grad_fn=<MeanBackward0>)
tensor(9.3546, grad_fn=<MeanBackward0>)
tensor(11.4693, grad_fn=<MeanBackward0>)
tensor(9.2175, grad_fn=<MeanBackward0>)

Please help me!

Hi Soohwan!

If I am reading your code correctly, inputs (the tensor you pass into your
transformer) is uninitialized. This could definitely cause the problems you
see.

It is true that your cuda result becomes nan after only one optimizer.step()
iteration. But even before your first iteration, your cuda and gpu results differ.
This could certainly be due to your uninitialized inputs. (It could also be due
to differing random initialization of the weights in transformer, depending on
how it is constructed.) But in any event, you should start by tracking down
this difference. You should be able to get the results (including, among other
things, the loss) returned by your cuda computation to agree (up to some
reasonable round-off error) with those of your gpu computation, at least before
round-off error has begun to accumulate over the course of a number of
iterations.

Best.

K. Frank

Hi! K. Frank!

Thanks for your answer.
But, It looks like that in the test file, but it actually looks like that when train with speech files.
I think there are other problems, do you have any idea?