I am implementing my training with mixed-precision operations with
torch.cuda.amp.autocast(enabled=True) and I have some trouble with a masking operation (for the transformer model). What I need to do is to mask all the 0’s before the softmax computation. In order to do so, I use a very negative value for the mask (that ensures the softmax to output 0 attention for those logits). Unfortunately, with mixed precision this operation causes overflow. The operation is the following:
_MASKING_VALUE=-1e30 masked_attn_logits = attn_logits.masked_fill(attn_mask==0, value=_MASKING_VALUE)
I already tried to reduce the masking value to -1e6 but without success. Can you help me?