I’m comparing GLOO and NCCL’s performance by profiling the time of sending/recving tensors between two AWS instances.
As GLOO is usually recommended for CPU tensors, I send/recv CPU tensors when using GLOO and send/recv GPU tensors when using NCCL.
I tested on two kinds of AWS instances: g4dn.xlarge and p3.2xlarge. The claimed network bandwidth for g4dn.xlarge is up to 25 Gbps and for p3.2xlarge is 10 Gbps.
Results on the g4dn instance:
data size (MB)
GLOO (GB/s)
NCCL(GB/s)
4
0.59
0.91
8
0.58
1.14
16
0.58
1.4
32
0.58
1.34
64
0.58
1.67
128
0.57
1.32
256
0.57
1.3
512
0.57
1.51
1024
0.57
1.46
Results on the p3 instance:
data size (MB)
GLOO (GB/s)
NCCL(GB/s)
4
1.11
1.19
8
1.11
1.17
16
1.11
1.16
32
1.11
1.15
64
1.11
1.16
128
1.11
1.16
256
1.11
1.15
512
1.11
1.15
1024
1.11
1.15
On the g4dn instance, GLOO’s performance is ~30%-60% of NCCL, while on the p3 instance, their performance is very close. I don’t understand why GLOO’s performance is much worse than NCCL in the g4dn instances. What may affect GLOO’s performance?
By the way, is GLOO still the recommended way to send/recv tensors on CPU?
I think Gloo is still recommended on CPU. Also, gloo is synchronous (blocking) while NCCL is non-blocking. How did you get your results? Are they both using GPU (w gloo or NCCL)?
Hi Hugo, thanks for the reply. I use GLOO for sending/recving CPU tensors, while NCCL for GPU tensors. Here is my code, I use the environment variable CPU_TENSOR to control the use of CPU or GPU. I use torch.cuda.synchronize() to make sure we get the actual time of asynchronous NCCL.
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as multiproc
import time
CPU_TENSOR = os.environ.get("CPU_TENSOR", "0") == "1"
def run(local_rank, global_rank, cpu_tensor):
warmup_times = 20
repeat_times = 100
if not cpu_tensor:
torch.cuda.set_device(local_rank)
for i in range(1,11):
data_size_in_mb = 2**i
data_size = data_size_in_mb * 1024 * 1024 // 2
if cpu_tensor:
tensor = torch.ones(data_size, dtype=torch.float16)
else:
tensor = torch.ones(data_size, dtype=torch.float16).cuda()
if global_rank == 0:
## warmup
for i in range(warmup_times):
dist.send(tensor, dst=1-global_rank)
time_list = []
for i in range(repeat_times):
torch.cuda.synchronize()
start = time.time()
dist.send(tensor, dst=1-global_rank)
torch.cuda.synchronize()
end = time.time()
time_list.append((end - start)*1000)
elif global_rank == 1:
## warmup
for i in range(warmup_times):
dist.recv(tensor, src=1-global_rank)
time_list = []
for i in range(repeat_times):
torch.cuda.synchronize()
start = time.time()
dist.recv(tensor, src=1-global_rank)
torch.cuda.synchronize()
end = time.time()
time_list.append((end - start)*1000)
avg_time_result_in_ms = sum(time_list)/repeat_times
bandwidth_in_gb_per_second = (data_size_in_mb/1024) / (avg_time_result_in_ms/1000)
if cpu_tensor:
print(f'(cpu, Rank {global_rank} | Time(averaged {repeat_times} times) = {avg_time_result_in_ms:.2f} ms, data_size = {data_size_in_mb:.2f} MB, bandwidth = {bandwidth_in_gb_per_second:.2f} GB/s')
else:
print(f'(GPU, Rank {global_rank} | Time(averaged {repeat_times} times) = {avg_time_result_in_ms:.2f} ms, data_size = {data_size_in_mb:.2f} MB, bandwidth = {bandwidth_in_gb_per_second:.2f} GB/s')
def init_process(local_rank, global_rank, world_size, fn):
if CPU_TENSOR:
backend = 'gloo'
else:
backend = 'nccl'
init_method = 'tcp://'
master_ip = os.getenv('MASTER_ADDR', 'localhost')
master_port = os.getenv('MASTER_PORT', '6000')
init_method += master_ip + ':' + master_port
dist.init_process_group(
backend=backend,
world_size=world_size, rank=global_rank,
init_method=init_method)
fn(local_rank, global_rank, CPU_TENSOR)
if __name__ == "__main__":
gpus_per_node = 1
num_nodes = 2
node_rank = int(os.getenv('NODE_RANK', 0))
multiproc.set_start_method("spawn")
world_size = gpus_per_node * num_nodes
processes = []
for local_rank in range(gpus_per_node):
global_rank = gpus_per_node * node_rank + local_rank
p = multiproc.Process(target=init_process, args=(local_rank, global_rank, world_size, run))
p.start()
processes.append(p)
for p in processes:
p.join()
More insights on the performance gap are welcome, thanks!
Well, IIUC, we are comparing CPU+Gloo vs GPU+NCCL here? Well I don’t know if this is an apple2apple comparison here. (in terms of comm library because HW is different as well?)
Yes, I understand the HW is different when we use different libraries, e.g., NCCL uses NVLink when communication happens within a node while GLOO uses PCIE.
However, in my experiments, both GLOO and NCCL perform cross-instance communication, I expect both of them to use the network interface (e.g., Ethernet) for communication, for which I expect the bandwidth to be the same. (If in a real machine, is this correct?)
Currently, I guess AWS did some virtualization on the network interface, so GLOO and NCCL may use different hardware, but I didn’t find any evidence.