Distributed GPU calculations and CUDA extensions

Hello everyone,

I’m building an app that makes calculations using CUDA (it makes some optimization based on Simulated annealing). I successfully followed Custom C++ and CUDA Extensions tutorial and made stable version on simple GPU, so now I would like to use multiple GPUs (some tasks has huge amount of data, that could not be allocated a single GPU + I’d like to speed up my calculations).

I have several tensors that I would like to split by dim=0 and make distributed calculations (all calculations based on map pattern, so all records by dim=0 are undependable). So best choice for me would be create my custom nn.Module class with forward method and use DistributedDataParallel module, but I have not any parameter that requires a gradient and module crushes. (Yeah it rises AssertionError: DistributedDataParallel is not needed when a module doesn’t have any parameter that requires a gradient.)

Could you please recommend something how to solve this problem or some other modules/ways to have distributed calculations.

Best regards, Demetry

Would splitting the data and sending each chunk to a specific device work?
Something like this could already solve your use case:

data = torch.randn(4, 100)
chunks = data.chunk(4, 0)

res = []
for idx, chunk in enumerate(chunks):
    res.append(my_fun(chunk.to('cuda:{}'.format(idx))).to('cuda:0'))
res = torch.stack(res)
1 Like

Thank you, ptrblck,

As I understand your way will calculate consequentially?
I would like to calculate in parallel: my app calculates about million iterations and each one based on the previous, so should I use threading/multiprocessing/concurrent.futures or there is some better solutions?

CUDA operations are asynchronous, so each device should operate on its own.
You could check the GPU utilization during the script execution, which should show that all devices are being used.

Thank you, it really helped)

So I’ve done something like that:

import torch
from concurrent.futures import ThreadPoolExecutor

import MyCustomCudaModule as my_module

class MyClass:
    def __init__(self, data):
        self.gpus = [0, 1]  #  set devices I'd like to use

        # Split some data to chunks and allocate on its own GPU
        self.tensor0 = torch.tensor(data[0], dtype=torch.float64).chunk(len(self.gpus))
        self.tensor0 = [self.tensor0[idx].to(f'cuda:{gpu}') for idx, gpu in enumerate(self.gpus)]

        self.tensor1 = torch.tensor(data[1], dtype=torch.float64).chunk(len(self.gpus))
        self.tensor1 = [self.tensor1[idx].to(f'cuda:{gpu}') for idx, gpu in enumerate(self.gpus)]

    def calculate(self):
        # Prepare input data to use my CUDA method
        chunks = list()
        for idx in range(len(self.gpus)):
            chunk = [self.tensor0[idx], self.tensor1[idx]]
            chunks.append(chunk)

        # Start my calculations asynced 
        futures = self.executor.map(lambda ch: my_module.calculate(*ch), chunks)

        total_result = 0.0
        for result in futures:
            total_result += result.item()  # return calculations result from GPU to CPU

        return result

It splits my data between GPUs and correctly calculates but I have no speedup (the speed is the same as I use 1 GPU).
What should I do to calculate faster?

How large is the data tensor? If it is not large enough, the GIL contention across threads and the extra overhead of setting this up could overshadow the speed up brought by using multiple GPUs. Another thing is how did you measure the delay? As the computation is done on CUDA, you might need to use CUDA events and elapsed_time to get the accurate measure.

If elapsed_time still shows no improvement, can you try:

  1. increase chunk size.
  2. use multiple processes

In average tensors are about 2000 * 1000 * 100 elements, sometimes they are could be about 15000 * 8000 * 100 elements. I split on chunks by dim=0.

Now I have a guess that they are calculating consequentially instead of in parallel (I thought that ThreadPoolExecutor.map will start first GPU thread, return to the main CPU thread, start second one GPU thread etc, then will await for results from any device that finished. But as I can see it waits until first GPU will finish and then start calculations on second one)

So what is the best practice to start my calculations asynchronous? (I could not use asyncio)

Multi-thread should work, and this is how DataParallel is implemented (search for parallel_apply). But if my_module.calculate is composed of many CPU ops, you might see sequential execution due to GIL.