Autocast/ Mixed Precision nondeterministic behavior during converting into half (float16)

AKT_TARAFDER · July 6, 2024, 3:29am


import torch
import torch.nn as nn
import itertools
import torch.nn.functional as F
import datetime
import argparse
import numpy as np
from torch.cuda.amp import autocast, GradScaler


x= torch.randn(5).cuda()
x=x.to(torch.float16)

print('x type: ', x.dtype)

use_fp16= True

with autocast(enabled = True, dtype=torch.float16):
    y= torch.exp(x)
    print('y type: ', y.dtype)
    z= torch.sum(x)
    print('z type; ', z.dtype)

    w=x+1
    print('w type: ', w.dtype)

gives the following output:

(1) The question is why exp/sum producing a 32 bit floating point number?
(2) When the kernels are called during Mixed-precision training, some operations might start calling 32 bit kernels, rather than 16 bit ones. How to know what is expected behavior of what operation?
(3) Does anybody know that even for float16, are they doing float32 operations first and writing them, then converting it to 16 bits using another kernel (for matrix multiplication, GEMM, SpMM)?

ptrblck · July 6, 2024, 3:23pm

This section of the docs might be helpful as it describes which operators are eligible.

AKT_TARAFDER · July 7, 2024, 6:19pm

@ptrblck There are 2 categories in the documentation. One of the listed ones for GPU is: can auto-cast to float32. But this does not detail when it goes to float32, and when it does not. I am a little confused about the selection criteria. Or does it mean these do not go into float16 at all?
I noticed exp does not go into float16, when the input is float16 as shown in the code above. Exp is in “can autocast to float32”

https://pytorch.org/docs/stable/amp.html#cuda-ops-that-can-autocast-to-float32