Different training speed when using 1 node with 8 GPU vs using 8 node with 1 GPU on the same 8-V100 machine

When training models on a 8-gpu machine with docker, I have tried the two following ways:

  1. start 1 docker container with 8 gpu and run ddp with nnodes=1 and nproc-per-node = 8
  2. start 8 docker container with 1 gpu per container and run ddp with nnodes=8 and nproc-per-node = 1

The first way is much faster than the second way. ( 17 steps/s vs 13 steps/s )
In theory both two ways can use nccl to communicate, so what cause this training speed gap?