DDP's trained loss is obviously not as good as DP's

Hello everyone, I have a question that has been puzzling me for a long time.
The model I trained using DDP mode is much worse than the model trained by DP mode.
The specific settings are as follows: I keep the total batch_size the same. Each model runs 4 GPUs separately. training both use Adam optimizer. The learning rate is 1e-3.
I don’t know what’s wrong. Can anyone help me?

@ioyy on the onset, the difference between Data Parallel and Distributed Data Parallel is that one uses single process and the other multiple processes. Both implement the gradient synchronization process to assimilate gradients trained in different threads/machines.

Please share your code so that we can look further into it. Please ensure similar seeds in all machines in DDD