Are Distributed Optimizers supported for CUDA?

In the PyTorch documentation page for Distributed Optimizers, the webpage says:

WARNING: Distributed optimizer is not currently supported when using CUDA tensors

Link to snapshot of the documentation page captured by web archive at the time of writing this post: Distributed Optimizers — PyTorch master documentation (

However, the tutorial on Distributed optimizers (Shard Optimizer States with ZeroRedundancyOptimizer — PyTorch Tutorials 1.11.0+cu102 documentation) uses models on CUDA devices with no issues at all.

I tried using the ZeroRedundancyOptimizer with my model too and it does seem to be working even when models are on CUDA/GPUs. Is the documentation page inaccurate/out of date, or am I misinterpreting what the warning means?


Distributed optimizers are a little bit different, it is often used with distributed rpc system.

The ZeroRedundancyOptimizer is a sharded optimizer often used with DDP, it can support CUDA tensors

1 Like