How to use torch.distributed.optim.ZeroRedundancyOptimizer with overlap_with_ddp=True?

The optional parameter overlap_with_ddp is not well illustrated in the document.
I don’t understand the requirement in the document:

(2) registering a DDP communication hook constructed from one of the functions in

So I just ignored it.

When I was trying to test it, a warning came:

WARNING:root:step() should not be included in the training loop when overlap_with_ddp=True

Then after I removed optimizer.step(), the warning disappeared, but the parameters are not optimized, or fixed.

I am confused about how to make use of this parameter properly.

Is it possible to add an example to explain this?

cc @awgu for using Zero optimizer