DataParallel dosen't work

Hi, I am trying to use DataParallel to get ride of out-of-memory that I ran into when I train my code. I read the docs of PyTorch and I found it quite easy. But I do not know for what reason it doesn’t work in my code.
Here is my code. The model is a resnet in 3D.

    model = generate_resnet3D(conf.model_depth,conf.in_c)
   model = nn.DataParallel(model, device_ids = [0,4,5])
   model.to(device)
 
def train(epoch):
    
    model.train()
    
    # to track the training loss as the model trains
    train_losses = 0
    
    for num,(vol1,vol2) in  enumerate(train_loader):
        
        
        # first we change volums to torch tensors
        vol1 = vol1.to(device)
        vol2 = vol2.to(device)

        
        # next we apply the model to get representations
        
        x1,h1 = model(vol1)
        x2,h2 = model(vol2)
        
        
        print("Outside: input size", vol1.size(),
          "output_size", x1.size())```

I use a batch of size 3. But when I check the output I found that the batch is not split.

```Outside: input size torch.Size([3, 1, 128, 128, 128]) output_size torch.Size([3, 2048, 10, 10, 10])
Outside: input size torch.Size([3, 1, 128, 128, 128]) output_size torch.Size([3, 2048, 10, 10, 10])
Outside: input size torch.Size([3, 1, 128, 128, 128]) output_size torch.Size([3, 2048, 10, 10, 10])
Outside: input size torch.Size([3, 1, 128, 128, 128]) output_size torch.Size([3, 2048, 10, 10, 10])
Outside: input size torch.Size([3, 1, 128, 128, 128]) output_size torch.Size([3, 2048, 10, 10, 10])
Outside: input size torch.Size([3, 1, 128, 128, 128]) output_size torch.Size([3, 2048, 10, 10, 10])```


Here is another example that I checked DataParallel:
```gpu_usage()

model = nn.DataParallel(model,device_ids=[0,1])

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

model.to(device)

input = torch.randn(2,1,128,128,128)

input.to(device)

output1, output2 = model(input)

print("Outside: input size", input.size(),
          "output_size", output2.size())

torch.cuda.empty_cache()

sys.stdout.flush()```

The output again wasn't split in two parts!

```| ID | GPU | MEM |
------------------
|  0 |  0% |  7% |
|  1 |  0% |  0% |
|  2 | 84% | 64% |
|  3 | 56% | 50% |
|  4 |  0% |  7% |
|  5 |  0% |  7% |
|  6 | 94% | 99% |
|  7 | 92% | 99% |
Outside: input size torch.Size([2, 1, 128, 128, 128]) output_size torch.Size([2, 2048])```

The first line shows the memory usage. I will be thankful if you help me to fix this problem.

nn.DataParallel will split the inputs before passing them to the forward method and will concatenate each output to the expected complete batch output after the forward pass was done.
If you print the shapes outside of the forward method, you won’t see any difference compared to a single GPU run.
To see the split size, you could add the print statements to the forward method directly.

But in the second example, as you see I am printing the input before sending it to the forward method, don’t I? It’s the same as example in the PyTorch docs.

And would you please tell me how can I add print statement to the forward method, directly?

This should work and would print the shape of each chunk of the input batch:

def forward(self, x):
    print(x.shape, x.device)
    ...

If you are printing the tensor outside of the forward, you won’t see any difference to a single GPU run.

Ok, thank you. I got it. So, when I use DataPrarallel I shouldn’t encounter the problem of out of memory, right?

nn.DataParallel could create a memory imbalance as described in this blog post with some workarounds.
E.g. if your single GPU run worked with a batch size of 16, nn.DataParallel could yield an OOM for a batch size of 32.

That being said, we generally recommend to use DistributedDataParallel with a single process per device, as this would avoid the memory imbalance and is the fastest approach.

Hi @ptrblck, I have a question along similar lines.

I don’t think nn.DistrubtedDataParallel behaves differently (printing shape before and after forward pass gives the same value i.e. per gpu batch size). What should I do, if I want to concatenate the outputs of the forward passes across all gpus while using nn.DistributedDataParallel ?

Note: I want this as I want to use the model as a feature extractor, and I want to cluster the embeddings (so I need all the data concatenated).

It depends what you are comparing it against.
DDP with multiple devides per process performs similar to nn.DataParallel, DDP with a single device per process does not and is the recommended way. Here is also a short explanation. Also here is a comparison of the distributed backends.

You don’t have to do it manually and the output of the DDP model should already contain all samples. Have a look at this tutorial for an example.

1 Like

Thanks for the reply. Yes, I am considering the case of single device per process.