All NaN output of TransformerEncoder with "normal" input

I am using a tansformer model (on the CPU) based on nn.TransformerEncoder. The first 2 layers before the transformer encoder layer are a nn.Linear projection layer and a fixed positional encoding layer (i.e. with no trainable parameters).

I have noticed that although the input to the model never includes NaN values or values very large in magnitude, the output of the model is always all-NaN. When using a simple linear layer instead of nn.TransformerEncoder the problem disappears. Gradients cannot be an issue, because I am evaluating the model immediately after initialization.

Now, I am monitoring the activations using hooks:

def recursively_hook(model, hook_fn):
    for name, module in model.named_children(): #model._modules.items():
        if len(list(module.children())) > 0:  # if not leaf node
            for submodule in module.children():
                recursively_hook(submodule, hook_fn)
        else:
            module.register_forward_hook(hook_fn)

def hook_fn(m, inp, out):
    if torch.isnan(out).any() and not torch.isnan(inp).any():
        print(m)
        print("Input: [ Shape {} ]\n{}".format(inp.shape, inp))
        print("Output: [ Shape {} ]\n{}".format(out.shape, out))

model = create_model()
recursively_hook(model, hook_fn)

Based on the activations, I have observed that:

  • the first 2 aforementioned layers give an output without NaN, and NaN first appears within the nn.TransformerEncoder.
  • supposedly, there is no module which gives NaN output without first having received NaN in the input - so the hook with the rule as above never prints anything!

I am wondering how this second point can ever be possible (given that NaN is never in the input to the model). Am I doing something wrong when using the hooks?
At the same time, model parameter values also appear to be okay (not NaN or “abnormally” large).

Weirdly enough, when trying to train the model, NaN now appears immediately in the output of the first nn.Linear layer, i.e. before the nn.TransformerEncoder (so the hook function above indeed only prints this first layer). Does this sound like an initialization problem?

I am still not sure why the hook_fn as defined above doesn’t print anything, but I have identified the problem: I was not following the correct convention for padding_mask : True or 1 should mean “ignore/pad” and False or 0 should mean “consider”