Possibly a noob question here. Say I have two bf16 tensors. Is there a way to do matmul on them and return the accumulator in FP32?
This might be implementation dependent, but in almost all cases accumulation will be done in fp32 by default. You can try out
variant that might work
1 Like