Hi, I’m trying to use PyTorch AMP after knowing that it is now out. However, I’m getting NaN values in my model output. I traced this problem to the output of matmul having some infinite values. The inputs to the matmul operation are fine.
Does the same error happen in normal FP32 training?
If not, I think running the block in autocast(enabled=False) context is one choice. But use this context, it’ll be necessary to convert some input tensors to FP32.
autocast(enabled=False) subregions can be nested in autocast-enabled regions. Locally disabling autocast can be useful, for example, if you want to force a subregion to run in a particular dtype. Disabling autocast gives you explicit control over the execution type. In the subregion, inputs from the surrounding region should be cast to dtype before use: (quote from Automatic Mixed Precision package - torch.amp — PyTorch 2.1 documentation)
A possible solution is to scale-down the values one of the two matrices, or both of them before doing matmul operation, you can try something similar to what in Softmax-Attention.
for example you can try doing it this way:
f = torch.matmul(theta_x / math.sqrt(your_hidden_size), phi_x)