Average loss in DP and DDP

Hi. I have a question regarding data parallel (DP) and distributed data parallel (DDP).

I have read many articles about DP and understand that gradient is reduced automatically. However, I could not find an article explaining whether or not loss is also reduced. For example, I believe that the following codes appear typical main routine of a DP program.

outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()

I understand that the split inputs and the model are copied on each GPU and a forward path is concurrently computed to yield the loss, then a backward path is also concurrently computed and finally all gradients are reduced to one.

Is the loss obtained by above code averaged over all the GPUs, which is exactly same as a loss computed by a serial program? Or, is the loss a value from just one GPU (gpu0)? I need to plot a loss chart, so I wonder if the loss is averaged over the GPUs.

The same question applies to outputs. I also need to compute training accuracy using outputs in above code. Does it hold the results of all the GPUs? If so, in what structure of a tensor are they stored?

Regarding DDP, above codes are written in each process running on respective GPU. In this case, how can I access the values on all the GPUs to plot the averaged loss and total accuracy?

I appreciate any sources of information. Thank you in advance.

5 Likes

No, loss is not reduced because there is only one loss tensor with DP. Besides, the gradients are actually accumulated automatically by the autograd engine. As DP is single-process-multi-thread, all threads share the same autograd engine, and hence ops on different threads will be added to the same autograd graph.

Is the loss obtained by above code averaged over all the GPUs, which is exactly same as a loss computed by a serial program? Or, is the loss a value from just one GPU (gpu0)? I need to plot a loss chart, so I wonder if the loss is averaged over the GPUs.

DP’s forward function will gather all outputs to cuda:0 (by default) and then return the gathered result. So, in the code above outputs is on one GPU and hence loss is also on one GPU.

The same question applies to outputs. I also need to compute training accuracy using outputs in above code. Does it hold the results of all the GPUs? If so, in what structure of a tensor are they stored?

Below is DP’s forward function. The outputs var on line 161 holds the output on different GPUs, but the gather function on line 162 copied them to one GPU.

If you want to access individual output on different GPUs, you can do so in the forward function of your model (the one you passed to DP ctor). E.g.,

class MyModel(nn.Module):
  def __init__(self):
    self.fc = nnLinear(10, 10)
  def forward(self, input):
    output = self.fc(input)
    print("per-GPU output ", output)
    return output


dp = DataParallel(MyModel())
outputs = dp(inputs) # this outputs is on one GPU

Regarding DDP, above codes are written in each process running on respective GPU. In this case, how can I access the values on all the GPUs to plot the averaged loss and total accuracy?

You can use gather or all_gather or all_reduce to communicate the loss to one process and print it.

BTW, could you please add a “distributed” tag to distributed training related questions? People working on distributed training monitor that tag and can get back to you promptly.

1 Like

Thank you Shen Li for your detail explanation. It is very helpful, and now I understand what’s going on in DP and DDP. I modified your codes to see the order of data so that I can make sure that the output is correctly compared to corresponding labels in loss function.

import torch
import torch.nn as nn

device = "cuda:0"

class Model(nn.Module):

    def __init__(self):
        super(Model, self).__init__()
    
    # forward() outputs the input as it is. 
    def forward(self, input):
        output = input
        print("per-GPU output ", output)
        return output

model = Model()
model = nn.DataParallel(model)
model.to(device)

# input is a sequence of integer in 2D shape.
input = torch.arange(20 * 5).reshape(20, 5)
input = input.to(device)
print("total input ", input)
output = model(input)
print("total output ", output)

I was not sure about the “tag” that you pointed out, but I added “distributed” to “Categories”.

I still have a related question about DDP.
In my understanding, the gradient is a vector that points a direction where the loss increases the most. I learned from your explanation that we don’t have the “total” loss until we “gather”, “all_gather”, or “all_reduce” the loss computed in each GPU. If we use a loss in each process instead of total loss to compute each gradient and average all the gradients, will it be a correct “total” gradient of the total loss?

In other words, I wonder if it is mathematically correct that averaging all gradients that increase each of respective loss produces a total gradient that increases the averaged loss.

If it is not correct, I think it means that we need to do all_reduce of the loss before we do loss.backward in order to hand total loss information to each process for computing correct gradients. Is my thinking correct?

Thank you again for your kind assistance.

Good question. Instead of communicating loss, DDP communicates gradients. So the loss is local to every process, but after the backward pass, the gradient is globally averaged, so that all processes will see the same gradient. This is brief explanation, and this is a full paper describing the algorithm.

If it is not correct, I think it means that we need to do all_reduce of the loss before we do loss.backward in order to hand total loss information to each process for computing correct gradients. Is my thinking correct?

The reason we didn’t communicating loss is because that’s not sufficient. When computing gradients, we need both loss and activation, and the activation depends on local inputs. So we need to either communicate loss + activation or gradients. DDP does the later.

2 Likes

Thank you again.

Maybe you have fully answered my question, but I still feel that my point is missing. As I understand, a gradient is computed by the back propagation using the chain rule and first derivative of functions in a model network. Also, as you mentioned, we need the function vales within the network, as well as the loss.

Since the method existed far before the parallelism era, the back-prop naturally started from a single “total” or “global” loss in the single processor platform. Therefore, in that case, we use a loss readily averaged over a batch of input. On the other hand, in the multi-GPU platform, a batch input is farther divided into smaller batches each of which is used to produce a “local” loss by a GPU. In that case, when computing the local gradient, the functions, inputs, and function values are exactly same as the case of the single processor platform. Only difference is using the local loss instead of the global loss.

My question is; does averaging the local gradients computed from the local losses produce exactly the same one as the global gradient computed from the global loss?

If the answer is no, I think that we need to average the local losses to produce a global loss and hand it to all the GPUs to compute correct local gradients that are averaged to produce a correct global gradient. This might be achieved by performing all_reduce() over the local losses before doing loss.backward() on each GPU.

The answer could be yes, but I don’t know the mathematical explanation for it.

That is my point.
If I misunderstand something, please point it out. Thank you.

1 Like

In that case, when computing the local gradient, the functions, inputs, and function values are exactly same as the case of the single processor platform.

This is actually not true. Say we have a function f(x) = w * x, where w is the weight. Then when you compute gradient (i.e., dw), you will need both df (from loss, which depends on local input) and x (from local input or intermediate local output, which also depend on local input). So, if not communicating gradients, we need to communicate both the final loss and the intermediate outputs of all layers.

No, this is not guaranteed to be the same, but due to a different reason. If 1) the loss function satisfies the condition loss_fn([x1, x2]) == (loss_fn(x1) + loss_fn(x2)) / 2 and 2) batch size on all processes are the same, then average gradients should be correct. Otherwise, average won’t produce the same result. One example would be, if we use .sum() as the loss function, we should just sum instead of averaging the gradient.

If the answer is no, I think that we need to average the local losses to produce a global loss and hand it to all the GPUs to compute correct local gradients that are averaged to produce a correct global gradient. This might be achieved by performing all_reduce() over the local losses before doing loss.backward() on each GPU.

I might miss sth. If we do the above, it means we compute the gradients using global loss and local activation (i.e., global df and local x in the f(x)=w*x example above). In this case, what does this gradient mean?

Thank you for your further explanation.

So, if not communicating gradients, we need to communicate both the final loss and the intermediate outputs of all layers.

Yes, I agree that we must communicate gradients to have a global gradient. My question is about relationship between the global loss and the local gradients, not about communicating losses instead of gradients.

If 1) the loss function satisfies the condition loss_fn([x1, x2]) == (loss_fn(x1) + loss_fn(x2)) / 2 and 2) batch size on all processes are the same, then average gradients should be correct.

I understand that, in a parallel process, the losses are locally averaged on a GPU, and the resulting losses can be globally averaged. That is the reason why the condition you explained must hold to have the “average of average” being equal to the global average.

My point is based on that a parallel process just does the same thing in parallel as a serial process does, and both of them are supposed to produce identical results.

What I am wondering about is that the backward path of the computational graph in a DDP process starts from a local loss, while it starts from a global loss in the serial process, and they are supposed to produce the same result.

From your former explanation, I learned that the backward path starts from the global loss in DP, but not DDP. So, I believe that DP will produce the same results as the serial process does, but I wonder about DDP.

One thing I have come across is that, if the global loss is computed by sum() / batch_size, the backward path might start from 1 and dividing it by batch_size. If this is true, the only difference between starting from the global loss and the local loss should be difference between dividing by the global batch size and the local per-GPU batch size.

So, I suspect that the gradients in those cases have the same direction but different sizes. In particular, the gradient from DDP might be n_gpu times larger than DP, where n_gpu is the number of GPUs. Even if this is true, that will not be a big problem, but DDP may require a different learning rate from DP. I just thought that way, but it needs a confirmation.

Is this correct? I appreciate your assistance. Thank you.

Yep, this is true for the sum() / batch_size case you mentioned, on the condition that all processes are using the same batch size. Here is the test to verify that:

In particular, the gradient from DDP might be n_gpu times larger than DP, where n_gpu is the number of GPUs. Even if this is true, that will not be a big problem, but DDP may require a different learning rate from DP. I just thought that way, but it needs a confirmation.

DDP computes the average of all gradients from all processes, so the gradient should be the same value as local training for the sum() / batch_size case. What might affect the learning rate is the batch size you configured for each DDP process. If each process is using the same batch_size as local training, it means that in each iteration the DDP gang collective process world_size * batch_size input data, so you might be more confident on the result gradient compared to local training and might need to set the learning rate to a larger value. But this is not guaranteed. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel

Thank you, Shen Li.

DDP computes the average of all gradients from all processes, so the gradient should be the same value as local training for the sum() / batch_size case.

I interpret it as that the difference is taken care of when computing the global gradient from the local gradients, and we will see no difference from the serial cases.

What might affect the learning rate is the batch size you configured for each DDP process.

I think that whether or not we expand the global batch size is a choice between computation speed per iteration and algorithmic efficiency of total convergence, with a larger learning rate that you mentioned. Besides, we can make use of the GPU memories if we choose a large batch size. I feel that a larger batch brings about faster convergence even in the wall clock time bases, if we can efficiently utilize the multiple GPUs. That’s what I’m trying to do.

Thank you very much. I appreciate your time for this long discussion.

4 Likes

Hi, Shen Li. If I use ‘loss = loss.sum()’, how can I not averaging the gradient in DDP?