There is no right or wrong answer in that case. I can see some hardware doing the full float computation while others only accumulate in fp32 while effectively doing all the computations in fp16. The interpreter does the latter.
The idea for the basic fp16 operators is that all the intermediate results are casted back to fp16.