import torch
import torch.nn as nn
import itertools
import torch.nn.functional as F
import datetime
import argparse
import numpy as np
from torch.cuda.amp import autocast, GradScaler
x= torch.randn(5).cuda()
x=x.to(torch.float16)
print('x type: ', x.dtype)
use_fp16= True
with autocast(enabled = True, dtype=torch.float16):
y= torch.exp(x)
print('y type: ', y.dtype)
z= torch.sum(x)
print('z type; ', z.dtype)
w=x+1
print('w type: ', w.dtype)
gives the following output:
(1) The question is why exp/sum producing a 32 bit floating point number?
(2) When the kernels are called during Mixed-precision training, some operations might start calling 32 bit kernels, rather than 16 bit ones. How to know what is expected behavior of what operation?
(3) Does anybody know that even for float16, are they doing float32 operations first and writing them, then converting it to 16 bits using another kernel (for matrix multiplication, GEMM, SpMM)?