Why GLOO's performance is much worse than NCCL?

I’m comparing GLOO and NCCL’s performance by profiling the time of sending/recving tensors between two AWS instances.

As GLOO is usually recommended for CPU tensors, I send/recv CPU tensors when using GLOO and send/recv GPU tensors when using NCCL.

I tested on two kinds of AWS instances: g4dn.xlarge and p3.2xlarge. The claimed network bandwidth for g4dn.xlarge is up to 25 Gbps and for p3.2xlarge is 10 Gbps.

Results on the g4dn instance:

data size (MB) GLOO (GB/s) NCCL(GB/s)
4 0.59 0.91
8 0.58 1.14
16 0.58 1.4
32 0.58 1.34
64 0.58 1.67
128 0.57 1.32
256 0.57 1.3
512 0.57 1.51
1024 0.57 1.46

Results on the p3 instance:

data size (MB) GLOO (GB/s) NCCL(GB/s)
4 1.11 1.19
8 1.11 1.17
16 1.11 1.16
32 1.11 1.15
64 1.11 1.16
128 1.11 1.16
256 1.11 1.15
512 1.11 1.15
1024 1.11 1.15

On the g4dn instance, GLOO’s performance is ~30%-60% of NCCL, while on the p3 instance, their performance is very close. I don’t understand why GLOO’s performance is much worse than NCCL in the g4dn instances. What may affect GLOO’s performance?

By the way, is GLOO still the recommended way to send/recv tensors on CPU?

I think Gloo is still recommended on CPU. Also, gloo is synchronous (blocking) while NCCL is non-blocking. How did you get your results? Are they both using GPU (w gloo or NCCL)?

Hi Hugo, thanks for the reply. I use GLOO for sending/recving CPU tensors, while NCCL for GPU tensors. Here is my code, I use the environment variable CPU_TENSOR to control the use of CPU or GPU. I use torch.cuda.synchronize() to make sure we get the actual time of asynchronous NCCL.

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as multiproc
import time

CPU_TENSOR = os.environ.get("CPU_TENSOR", "0") == "1"

def run(local_rank, global_rank, cpu_tensor):
    warmup_times = 20
    repeat_times = 100
    if not cpu_tensor:
        torch.cuda.set_device(local_rank)

    for i in range(1,11):
        data_size_in_mb = 2**i
        data_size = data_size_in_mb * 1024 * 1024 // 2

        if cpu_tensor:
            tensor = torch.ones(data_size, dtype=torch.float16)
        else:
            tensor = torch.ones(data_size, dtype=torch.float16).cuda()

        if global_rank == 0:
            ## warmup
            for i in range(warmup_times): 
                dist.send(tensor, dst=1-global_rank)

            time_list = []
            for i in range(repeat_times): 
                torch.cuda.synchronize()
                start = time.time()          
                dist.send(tensor, dst=1-global_rank)
                torch.cuda.synchronize()
                end = time.time()    
                time_list.append((end - start)*1000)    

        elif global_rank == 1:
            ## warmup
            for i in range(warmup_times): 
                dist.recv(tensor, src=1-global_rank)
                
            time_list = []
            for i in range(repeat_times): 
                torch.cuda.synchronize()
                start = time.time()          
                dist.recv(tensor, src=1-global_rank)
                torch.cuda.synchronize()
                end = time.time()    
                time_list.append((end - start)*1000)    

        avg_time_result_in_ms = sum(time_list)/repeat_times
        bandwidth_in_gb_per_second = (data_size_in_mb/1024) / (avg_time_result_in_ms/1000)

        if cpu_tensor:
            print(f'(cpu, Rank {global_rank} | Time(averaged {repeat_times} times) = {avg_time_result_in_ms:.2f} ms, data_size = {data_size_in_mb:.2f} MB, bandwidth = {bandwidth_in_gb_per_second:.2f} GB/s')
        else:
            print(f'(GPU, Rank {global_rank} | Time(averaged {repeat_times} times) = {avg_time_result_in_ms:.2f} ms, data_size = {data_size_in_mb:.2f} MB, bandwidth = {bandwidth_in_gb_per_second:.2f} GB/s')

def init_process(local_rank, global_rank, world_size, fn):

    if CPU_TENSOR:
        backend = 'gloo'
    else:
        backend = 'nccl'

    init_method = 'tcp://'
    master_ip = os.getenv('MASTER_ADDR', 'localhost')
    master_port = os.getenv('MASTER_PORT', '6000') 
    init_method += master_ip + ':' + master_port

    dist.init_process_group(
        backend=backend,
        world_size=world_size, rank=global_rank,
        init_method=init_method)

    fn(local_rank, global_rank, CPU_TENSOR)

if __name__ == "__main__":

    gpus_per_node = 1
    num_nodes = 2
    node_rank = int(os.getenv('NODE_RANK', 0))
    multiproc.set_start_method("spawn")
    world_size = gpus_per_node * num_nodes

    processes = []
    for local_rank in range(gpus_per_node):
        global_rank = gpus_per_node * node_rank + local_rank
        p = multiproc.Process(target=init_process, args=(local_rank, global_rank, world_size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

More insights on the performance gap are welcome, thanks!

Well, IIUC, we are comparing CPU+Gloo vs GPU+NCCL here? Well I don’t know if this is an apple2apple comparison here. (in terms of comm library because HW is different as well?)

Yes, I understand the HW is different when we use different libraries, e.g., NCCL uses NVLink when communication happens within a node while GLOO uses PCIE.
However, in my experiments, both GLOO and NCCL perform cross-instance communication, I expect both of them to use the network interface (e.g., Ethernet) for communication, for which I expect the bandwidth to be the same. (If in a real machine, is this correct?)
Currently, I guess AWS did some virtualization on the network interface, so GLOO and NCCL may use different hardware, but I didn’t find any evidence.