Solutions to both problems:
- Updating NVIDIA driver from 525.89.02 to 525.105.17 solved the NCCL problem.
- For model synchronization to work forward and backward passes should alternate. In the test I performed this was not the case.
Solutions to both problems: