Given an arbitrary fp32 nn.Module that fits on a single GPU, is there a full enumeration of the differences between
- MixedPrecision(torch.bfloat16, torch.float32)
- torch.autocast(“cuda”, dtype=torch.bfloat16)
in computation?
I noticed that certain modules/methods do not execute with correct precision using FSDP MixedPrecision, so there exists a difference.