Slow down training with PowerSGD during training

Training Info:
GPU Device Type: A100
Number of GPU: 8

Code snippet:
process_group = dist.new_group(ranks=None, backend=“nccl”)
PowerSGDState(process_group=process_group, matrix_approximation_rank=1),
Reference: DDP Communication Hooks — PyTorch 1.11.0 documentation

Hi @jinyuan.feng, are you saying that there is a slow down in your training when you include PowerSGD? How are you detecting the slowdown and what is the alternative script you are using.