DDP + fp16 + gradient accumulation

Yes, when I disable fp16 on single GPU training, the model performs better in the first few logging steps. Eventually, with or without fp16, there is no significant difference. But the speed is very slow without fp16.