How can I print the loss of a distributed model while training?

Bassel · November 30, 2018, 4:58pm

When I print the loss.item() inside the training function, it prints the loss for every GPU and for every node.

2 GPUs
GPU :  1
2 GPUs
GPU :  0
Loss in epoch 0  =  2.16241826098016
Loss in epoch 0  =  2.1587327145515602
Loss in epoch 1  =  1.2776704367170943
Loss in epoch 1  =  1.2715401794048067
Loss in epoch 2  =  0.8121602121819841
Loss in epoch 2  =  0.8152065771691342
Loss in epoch 3  =  0.663035316162921
Loss in epoch 3  =  0.6598308266477382

My training function looks like this

def train(model, device, train_loader, optimizer,criteration, epochs):
    model.train()
    print("2 GPUs")
    print("GPU : ",args.local_rank)
    model = torch.nn.parallel.DistributedDataParallel(model,
                                                  device_ids=[args.local_rank],
                                                  output_device=args.local_rank)
    for epoch in range(epochs):
        LOSS = 0
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = criteration(output, target)
            loss.backward()
            optimizer.step()
            LOSS += loss.item()
        print("Loss in epoch", epoch," = ", LOSS/len(train_loader))

Is there a way of printing the average loss over the GPUs/nodes for every epoch?