Model sharding, data parallelism and NVLink

Hi Guys,
I read the following thread:
https://discuss.pytorch.org/t/split-single-model-in-multiple-gpus/13239

and also watched the following video:
https://youtu.be/_d3xs1L4jeA

You guys provided great information, but I still have some questions. In my group, we are interested in buying a server with 8 Nvidia A40 GPUs, such that those 8 GPUs are split into 4 groups of 2, where each pair of GPUs are physically connected using a NVLink bridge.
I wonder how using 4 pairs of NVLink GPUs will affect the utilization of data parallelism and model sharding. Will the code snippets provided in the thread above work as is? For example, will I be able to achieve model sharding between two pairs of GPUs?
In general, I wonder how the usage of NVLink will accelerate training and inference with respect to the common usage of multi-gpu scenarios such as data parallelism and model sharding (when comparing it to the case where there is no usage of NVLink at all).

Thanks!