Performance difference: Dataparallel vs Distributed

I am going to build a computer with 4 GPUs. But, as far as I know, we cannot utilize all the 4 GPUs with x16 PCI lanes. Thus, I am considering two options as follows:

  1. Two machines: each has 1 CPU and 2 GPUs, so that they can operate at x16 lanes. Then, connecting the two machines by using the “Distributed” package.

  2. One machine: it has 1 CPU and 4 GPUs. Thus, they operate at x8 lanes. Then, use “Dataparallel” to utilize all the 4 GPUs.

I currently have no machines to test the above two settings.
If anyone has experience, please share it.
Thanks!