FP16*FP16 in convolution generating FP32

#1

@qcolombet

Looking at the FP16 implementation of convolution.
sum += float(
filterW.at({d, fx, fy, fd}) *
inW.at({n, (size_t)ox, (size_t)oy, g * inCperG + fd}));

The filterW and inW are FP16. sum is Float.

The * defined in glow/Support/Float16.h returns FP16.

Shouldnt we need to define a version of * that returns Float instead when multiplying 2 FP16 quantities, as we are accumulating in FP32?

(Quentin) #2

There is no right or wrong answer in that case. I can see some hardware doing the full float computation while others only accumulate in fp32 while effectively doing all the computations in fp16. The interpreter does the latter.

The idea for the basic fp16 operators is that all the intermediate results are casted back to fp16.