Transformer outputs nan in eval mode

Hi! We are using nn.TransformerEncoder for a simple binary classification task. The training is fine, but when evaluating (model.eval()), the output of transformer becomes nan while the input is fine. Evaluating without model.eval() is also fine. My input length equals to 3, the dimension of features for each and there are ~3000 samples per batch (i.e., the input size is [3000, 3, 64]).

Basically, the nan comes when applying

## seq.shape = [batch x seq x hidden_dim]
pos_enc_sequence = self.pos_encoder(sequence)  # This does not contain nan's
z = self.transformer(pos_enc_sequence, batch_masks)  # nan comes in z

The transformer module is just a wrapper around TransformerEncoder. In fact, it is not totally nan… The output is like:

tensor([[[    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],^M
^M
        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],^M
^M
        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],^M
^M
        ...,^M
^M
        [[-1.0353,  1.5367, -0.8378,  ...,  1.8553,  3.2305,  0.7344],^M
         [-1.2352,  1.6701, -0.6161,  ...,  1.2745,  2.7163,  1.2356],^M
         [-1.0250,  1.8169, -0.6667,  ...,  1.4378,  2.6771,  1.3972]],^M
^M
        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan]],^M
^M
        [[    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan],^M
         [    nan,     nan,     nan,  ...,     nan,     nan,     nan]]],^M
       device='cuda:0')^M

We suspect the problem might be in the softmax layer. We are actively investigating the output of each layer of Transformer, but right now we are not sure if we can directly get the output of softmax layer without rewriting a Transformer module. Any pointer would be hugely appreciated.

Well… It’s becoming weird. I have no problem when using model.eval() without torch.no_grad() for evaluation or without model.eval(), but once they are used together, my hook cannot record anything from any layer within the transformer module (it can still record output from the encoders before transformer (we are using it for fusion)). Literally speaking, the transformer module is not running…

It seems there is a bug in the usage of src_key_padding_mask (wrongly treated 1 and 0). After fixing the bug, the evaluation behaves normally if not perfectly bug-free.