Getting Nans from dropout layer

My model is throwing NaNs intermittently. From debugging, i found on every occasion, dropout was the layer whose output was NaN first. Why is dropout outputing NaNs?

Model is being trained in mixed precission.

I added this hook

def nan_hook(self, inp, output):
            if not isinstance(output, tuple):
                outputs = [output]
            else:
                outputs = output

            for i, out in enumerate(outputs):
                nan_mask = torch.isnan(out)
                if nan_mask.any():
                    print("Hook: Nan occured In", self.__class__.__name__)
for submodule in self.model.modules():
            submodule.register_forward_hook(nan_hook)

Output

70_driver_log_9.txt-SystemLog: 2020-02-22 17:13:45,253:DEBUG : transformers_pretraining.trainer.apexDDP : 159 : ***** Training step 172 *****
70_driver_log_9.txt-SystemLog: 2020-02-22 17:13:45,253:DEBUG : transformers_pretraining.utils : 47 : Inside <function Singleprocess._forward at 0x7f773402fb70>
70_driver_log_9.txt:Hook: Nan occured In Dropout
70_driver_log_9.txt:Hook: Nan occured In BertSelfAttention
70_driver_log_9.txt:Hook: Nan occured In Linear
70_driver_log_9.txt:Hook: Nan occured In Dropout
70_driver_log_9.txt:Hook: Nan occured In LayerNorm
70_driver_log_9.txt:Hook: Nan occured In BertSelfOutput
70_driver_log_9.txt:Hook: Nan occured In BertAttention

Model architecture:

BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)

That is strange, dropout should be a straight copy. Also, it’s considered a neutral op for mixed precision (inputs aren’t changed). Are you sure dropout is most recent operation that occurred? I’d put a print statement right before and after the offending dropout function call to be sure.

In any case, if you’re using Apex Amp, you should be aware that will soon be deprecated. I’m working on native Pytorch support for mixed precision, targeting the upcoming 1.5 release:
https://github.com/pytorch/pytorch/pull/32140
https://github.com/pytorch/pytorch/pull/33366
PRs have upstream approval and are undergoing end-to-end testing locally. It may be preferable to wait for the native integration, and see if that resolves your issue.

Hi,
Yes I am sure about this. My hook always prints dropout as the first layer where Nan appears in the output. This has happening more than once.

It’s hard to write print statements as I am using huggungface Bert. That’s why used the hooks.

This Nan issue is very frequent at 217th epoch which reduces amp loss scaler to 0 resulting in division by zero error.

We are not sure how to solve this. What would you suggest for now ?

1 Like