For multiple GPUs Huggingface Transformer: torch.cuda.empty_cache() stuck

https://github.com/huggingface/transformers/blob/835de4c8335f72a9c53178f54cc3b4c0688960ec/src/transformers/trainer.py#L3219

torch.cuda.empty_cache()

For the multiple GPUs, it stucks at this line forever - at that time there is no GPU usage, but has CPU usage.

Does anyone have any idea about it?