When entering an autocast-enabled region, Tensors may be any type. You should not call half() or bfloat16() on your model(s) or inputs when using autocasting.
But how should I code such that model(x) uses bfloat16 and then later loss_fun(y_pred, y_true) uses float32.
I just realized I made a silly mistake: I forgot my training codes have a utility module that use y_pred and defined loss_fun to re-calculate the loss together with other metrics for tracking/logging. I did not wrap the loss computation logic in that module under autocast though that module also consumes the same y_pred of type bfloat16.
The errors that I reported in my original post were actually cast from that utility module. If I don’t explicitly convert y_true to bfloat16, loss_fun got two input arguments of different dtypes. If I explicitly convert y_true to bfloat16, both input arguments are bfloat16 but binary_cross_entropy can not process bfloat16 because it is not under autocast.
In conclusion, autocast should work well with BCELoss() on CPUs.