I am using a tansformer model (on the CPU) based on nn.TransformerEncoder
. The first 2 layers before the transformer encoder layer are a nn.Linear projection layer and a fixed positional encoding layer (i.e. with no trainable parameters).
I have noticed that although the input to the model never includes NaN values or values very large in magnitude, the output of the model is always all-NaN. When using a simple linear layer instead of nn.TransformerEncoder
the problem disappears. Gradients cannot be an issue, because I am evaluating the model immediately after initialization.
Now, I am monitoring the activations using hooks:
def recursively_hook(model, hook_fn):
for name, module in model.named_children(): #model._modules.items():
if len(list(module.children())) > 0: # if not leaf node
for submodule in module.children():
recursively_hook(submodule, hook_fn)
else:
module.register_forward_hook(hook_fn)
def hook_fn(m, inp, out):
if torch.isnan(out).any() and not torch.isnan(inp).any():
print(m)
print("Input: [ Shape {} ]\n{}".format(inp.shape, inp))
print("Output: [ Shape {} ]\n{}".format(out.shape, out))
model = create_model()
recursively_hook(model, hook_fn)
Based on the activations, I have observed that:
- the first 2 aforementioned layers give an output without NaN, and NaN first appears within the nn.TransformerEncoder.
- supposedly, there is no module which gives NaN output without first having received NaN in the input - so the hook with the rule as above never prints anything!
I am wondering how this second point can ever be possible (given that NaN is never in the input to the model). Am I doing something wrong when using the hooks?
At the same time, model parameter values also appear to be okay (not NaN or “abnormally” large).
Weirdly enough, when trying to train the model, NaN now appears immediately in the output of the first nn.Linear layer, i.e. before the nn.TransformerEncoder
(so the hook function above indeed only prints this first layer). Does this sound like an initialization problem?