DDP and Dropout behave the same across GPUs

Hi, correct me if I’m wrong but I found that Dropout behaves similarly (correlated) across different GPUs when using DDP. In other word, cells at same position in tensor in different GPUs get all dropped out or not dropped out.

I believe this (might) makes the training loss reduce slower than when using single GPU training.

Is this a bug? and is there a quick fix to make dropout layers operate independently across GPUs?

Example code

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
import random
import torch.distributed as dist
import os

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

class CustomDataset(Dataset):
    def __init__(self, size=10) -> None:
        super().__init__()
        self.size=10
    
    def __getitem__(self, i):
        return torch.Tensor([i]), torch.Tensor([1, i , 2])
    
    def __len__(self):
        return self.size

class CustomModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(1, 3)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, x):
        return self.dropout(self.linear1(x))


def main():
    local_process_index = int(os.environ.get("LOCAL_RANK", -1))
    dist.init_process_group(backend="nccl",
                            world_size=2, # Use 2 GPUs
                            rank=local_process_index)
    set_seed(66)
    device = torch.device("cuda", local_process_index)
    dataset = CustomDataset(size=10)
    dataloader = DataLoader(dataset, batch_size=5, shuffle=False,
                            sampler=DistributedSampler(dataset))
    model = DDP(CustomModel().to(device), 
                device_ids=[local_process_index],
                output_device=local_process_index)
    model.train()
    for input_, output_ in dataloader:
        input_, output_ = input_.to(device), output_.to(device)
        res = model(input_)
        print(f'{res} : {input_}')
if __name__ == '__main__':
    main()
-------
Result:
tensor([[ 0.0000, -0.5181, -0.2358],
        [ 0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000, -8.9689],
        [ 0.2071,  0.0000,  0.0000],
        [ 0.0000, -2.4261,  0.0000]], device='cuda:1',
       grad_fn=<FusedDropoutBackward>)
tensor([[  0.0000,  -6.2421,  -3.5107],
        [  0.0000,   0.0000,   0.0000],
        [  0.0000,   0.0000,  -2.4191],
        [  0.2832,   0.0000,   0.0000],
        [  0.0000, -10.0581,   0.0000]], device='cuda:0',
       grad_fn=<FusedDropoutBackward>)

I think this would be expected, since you are manually seeding the script, wouldn’t it?

1 Like

I see, thanks. Then I guess setting seed with local process seed(local_process_index) can make it still reproducible and the dropout operates independently across process. Would there be any hidden issue with this?

I don’t think there would be any hidden issue with it, as each process would then get its own seed.
Nevertheless, I would recommend to check the results using a small code snippet and make sure the results are expected.

1 Like

Is there any pytorch convention about using which random number generator?
e.g Are all pytorch operations by default use torch RNG?, or are there operations use python built-in random RNG ? … And how can I look for which RNG is used? …

Thanks

You can grep -r "import random" in the pytorch source to check the usage in the code base.
Currently a lot of tests are using the Python random package (which should be uninteresting for you), the worker_init_fn seeds the random package for each worker in the DataLoader, and the distributed/elastic/rendezvous methods seem to use it for a random delay.
Besides that the internal methods should use the PyTorch pseudorandom number generator.

If I seed each process differently, then dropout would behave differently for each process. So conceptually the tensor that is dropped in one process is picked in another then when we are doing all reduce on gradients there would be no zeroed-out gradient which is different from single GPU training if we use dropout. So how can we set different manual seed if we use a dropout layer across all the processes to make it conceptually work?

I’m not sure I understand the issue here: why is having no zero’d out gradient after allreduce a problem? Wouldn’t this be possible in a single GPU data-parallel case anyway, as different samples in the same batch can have different parts of their activations drop’d out?

I don’t know what you mean by “conceptually” work, but you could simply set a different seed depending on the rank in a distributed training setup to ensure that dropout is behaving differently across the different ranks.

sorry to explain more clearly
assume 2 gpus with different seeds
like what i meant is lets say the input to dropout is of dim=4 with activation being [a0,a1,a2,a3]
in forward pass gpu 1 first two are dropped out
[0,a1,0,a3]
in forward pass gpu 2 the other two are dropped out
[a0,0,a2,0]

now when we do backward pass with all reduce there would not be any gradient which is dropped out
but this would be different from a single gpu(no data parellism) case right because there would always be some activation dropped out so wouldnt the behaviour diverge from here? I am trying to compare the behaviour of dropout with DDP and
with no DDP.

If you mean batch-size one with no data parallelism, then sure, it is different (but this is typical and expected behavior for training).

If you are trying to force e.g., all GPUs to dropout identically, did you check that

    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

didn’t produce the behavior expected?

yes so if set the manual seed then it would have the same behaviour. But with respect to the answer to this comment above.

if i consider the case which i highlighted then there is a problem of unstable training right, when scaling a model with dropout with multiple gpu’s ?

Wouldnt this cause unstable training as during the all reduce operation gradients from a dropout with different masks would be averaged?

Wouldn’t this be the same case for a single model using a batch size > 1?

I understand that for a mini batch each datum has a different mask.

But the backward passes that happen on each mini-batch are independent in each forward-backward pass and no gradient reduction thats happening across different mini batch

But for multi gpu setting what is of concern to me is that different masks are there across different gpus and we are averaging the gradient. So it would be like combining gradients from different models at the same time. ?