Loss.item() Causes the wrong time consumption

Hello everyone!

I am using NCCL as the backend for the distributed training of PyTorch. I used four compute nodes. But there was a strange problem in calculating the time consumption.

Part of the code is as follows.

def average_gradients(model):
    size = float(dist.get_world_size())
    for param in model.parameters():
        dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
        param.grad /= size

 for epoch in range(args.epoch):
        begin_time = time.time()
        print('epoch', str(epoch))
        model.train()
        loss_sum = 0
        time_sum = 0
        for batch_idx,(images,labels) in enumerate(trainloader):
            batch_1 = time.time()
            images = images.to(device)
            labels = labels.to(device)
            optimizer.zero_grad()
            output = model(images)
            loss = criterion(output,labels)
            loss_sum += loss.item()
            loss.backward()
            average_gradients(model)
            optimizer.step()
            batch_2 = time.time()
            print("batch:{}\tloss:{}".format(batch_idx, loss.item()))
            time_sum += (batch_2 - batch_1)
            print(time_sum)
        end_time = time.time()
        print("epoch:{}\tloss:{}\ttime:{}".format(epoch, loss_sum / len(trainloader), end_time - begin_time))

In each epoch, the time_sum is 6.3s, but the (end_time-begin_time) is 18.8s,which is a big time difference. I guess that loss.item() is causing this problem, but I’m still confused. I would be very grateful if you could give me some advice.

loss.item() will synchronize the device for you, as the value stored on the GPU has to be moved to the host.
Since CUDA operations are executed asynchronously you would need to synchronize the code via torch.cuda.synchronize() before starting and stopping the timers. In your current code the profiling is thus invalid.

Thank you for your reply. But I still have a question.
Why does this time mismatch phenomenon not occur at a single node when training is performed? When I run similar code on a single node, the time_sum approximately equals to (end_time-begin_time).
Does torch.cuda.synchronize relates to distributed training?
Looking forward to your reply.

It’s hard to tell as it depends if and when the CPU run run ahead. I.e. your CPU might be blocked by something in the distributed run which would slow it down and could cause the times to approx. match.

I see. Thank you for your help!

@ptrblck I’m sorry to bother you again.
I found another phenomenon under this topic.
If I add multiprocessing.set_start_method('spawn') at the code beginning, there will still be a time difference between time_sum and (end_time-begin_time) even if I used torch.cuda.synchronize().
I’m very confused. Could you give me some advice? Thank you!

I don’t know the reason for it. Could you post a minimal, executable code snippet which would show this behavior?

@ptrblck Thank you for your quick reply!
I used three nodes to do the experiment of distributed training.

#!/usr/bin/env python
import argparse
import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.backends.cudnn as cudnn
from math import ceil
from random import Random
from torch.multiprocessing import Process
from torch.autograd import Variable
from torchvision import datasets, transforms
# from torchvision import models
from models import resnet
import time
from multiprocessing import Pool
import multiprocessing

def average_gradients(model):
    size = float(dist.get_world_size())
    for param in model.parameters():
        dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
        param.grad /= size

def test(model):
    correct_sum = 0
    model.eval()
    with torch.no_grad():
        for batch_idx,(images,labels) in enumerate(testloader):
            images, labels = images.to(device),labels.to(device)
            output = model(images)
            pred = output.argmax(dim=1, keepdim=True)
            pred = pred.to(device)
            correct = pred.eq(labels.view_as(pred)).sum().item()
            correct_sum += correct
    print("Acc:{}".format(correct_sum/len(testloader.dataset)))

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument('--rank', type=int)
    parser.add_argument('--world_size', type=int)
    parser.add_argument('--batch_size', type=int, default=256)
    parser.add_argument('--epoch', type=int, default=100)
    parser.add_argument('--lr', type=float, default=0.1)
    args = parser.parse_args()
    multiprocessing.set_start_method("spawn")
    os.environ["MASTER_ADDR"] = "master_ip"
    os.environ["MASTER_PORT"] = "master_port"
    dist.init_process_group('nccl',rank=args.rank, world_size=args.world_size)
    print("init end...")
    cudnn.benchmark = True
    torch.manual_seed(1000)
    torch.cuda.manual_seed(2000)
    model = resnet.ResNet18()
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
    model.to(device)
    print("model end...")
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=0.9, weight_decay=1e-4)
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    transform_test = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    trainset = datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transform_train)
    sampler = torch.utils.data.DistributedSampler(trainset,num_replicas=args.world_size,rank=args.rank)
    trainloader = torch.utils.data.DataLoader(
        trainset, batch_size=args.batch_size, shuffle=(sampler is None), sampler=sampler, num_workers=2)
    testset = datasets.CIFAR10(
        root='./data', train=False, download=True, transform=transform_test)
    testloader = torch.utils.data.DataLoader(
        testset, batch_size=args.batch_size, shuffle=True, num_workers=2)

    for epoch in range(args.epoch):
        begin_time = time.time()
        print('epoch', str(epoch))
        model.train()
        loss_sum = 0
        time_sum = 0
        for batch_idx,(images,labels) in enumerate(trainloader):
            torch.cuda.synchronize()
            batch_1 = time.time()
            images = images.to(device)
            labels = labels.to(device)
            optimizer.zero_grad()
            output = model(images)
            loss = criterion(output,labels)
            loss_sum += loss.item()
            loss.backward()
            average_gradients(model)
            optimizer.step()
            torch.cuda.synchronize()
            batch_2 = time.time()
            time_sum += (batch_2 - batch_1)
        end_time = time.time()
        print(f"time_sum:{time_sum}")
        print(f"end_time-begin_time:{end_time-begin_time}")
        test(model)
        time.sleep(2)

The experimental results are as follows:
node1: python3 main.py --rank 0 --world_size 3
node2: python3 main.py --rank 1 --world_size 3
node3: python3 main.py --rank 2 --world_size 3

If I delete the multiprocessing.set_start_method,the time_sum and (end_time-begin_time) are similar.

Looking forward to your reply.

Thanks for the code. I guess the teardown of the processes of the DataLoaders seems to be different, but that’s just my guess and you could try to create a profile to see which operations are executed between these timers.

Thank you for your reply.
When I set num_workers in DataLoader parameters to 1, the time difference still exists.
By the way, how to create a profile to see which operations are executed between these timers? I’ve never done this before.

num_workers=1 would still spawn a new process while num_workers=0 would use the main process for the data loading.
You could use the PyTorch profiler and check the timeline or e.g. Nsight Systems.