Hi! We are using `nn.TransformerEncoder`

for a simple binary classification task. The training is fine, but when evaluating (`model.eval()`

), the output of transformer becomes nan while the input is fine. Evaluating without `model.eval()`

is also fine. My input length equals to 3, the dimension of features for each and there are ~3000 samples per batch (i.e., the input size is [3000, 3, 64]).

Basically, the nan comes when applying

```
## seq.shape = [batch x seq x hidden_dim]
pos_enc_sequence = self.pos_encoder(sequence) # This does not contain nan's
z = self.transformer(pos_enc_sequence, batch_masks) # nan comes in z
```

The `transformer`

module is just a wrapper around `TransformerEncoder`

. In fact, it is not totally nan… The output is like:

```
tensor([[[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan]],^M
^M
[[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan]],^M
^M
[[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan]],^M
^M
...,^M
^M
[[-1.0353, 1.5367, -0.8378, ..., 1.8553, 3.2305, 0.7344],^M
[-1.2352, 1.6701, -0.6161, ..., 1.2745, 2.7163, 1.2356],^M
[-1.0250, 1.8169, -0.6667, ..., 1.4378, 2.6771, 1.3972]],^M
^M
[[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan]],^M
^M
[[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan],^M
[ nan, nan, nan, ..., nan, nan, nan]]],^M
device='cuda:0')^M
```

We suspect the problem might be in the softmax layer. We are actively investigating the output of each layer of Transformer, but right now we are not sure if we can directly get the output of softmax layer without rewriting a Transformer module. Any pointer would be hugely appreciated.