At times, matrix multiplication between fp16 and fp16 can result in overflow. Does the Pytorch Matmul kernel support fp16 input and fp32 output?
At times, matrix multiplication between fp16 and fp16 can result in overflow. Does the Pytorch Matmul kernel support fp16 input and fp32 output?