How CUDA do Asynchronous execution really looks like?

According to CUDA semantics, GPU operations are asynchronous, which means operations on different GPUs can work simultaneously once the “data” is prepared, and that’s why we can do techniques like pipeline, isn’t it?

And I also see the description in CUDA stream:

A CUDA stream is a linear sequence of execution that belongs to a specific device. You normally do not need to create one explicitly: by default, each device uses its own “default” stream.

which I think this is how CUDA do asynchronous execution because Pytorch can run multiple streams at the same time, right?

My question is,

  • Is it possible for CUDA do asynchronous execution in backward and optimer.step()?
  • Is the idea of “stream” and “process” in torch.multiprocessing the same?

For the question 1, consider I have a toy model below:

class ToyModel(nn.Module):
    def __init__(self, device):
        super().__init__()
        self.l1 = nn.Linear(10,10).to(device)
        self.l2 = nn.Linear(10,10).to(device)
        self.device = device
        
    def forward(self, x):
        x_1 = self.l1(x.to(self.device))
        x_2 = self.l2(x_1.to(self.device))
        return x_2, x_1.detach()

During forward, I copy the first linear ouput and cut the gradient for the use of other training.

And I have a model which contains two of toy models:

class MyModel(nn.Module):
    def __init__(self, device):
        super().__init__()
        self.m1 = ToyModel('cuda:0')
        self.m2 = ToyModel('cuda:1')
        self.opt1 = torch.optim.Adam(self.l1.parameters(), lr=0.01, weight_decay=5e-4)
        self.opt2 = torch.optim.Adam(self.l2.parameters(), lr=0.01, weight_decay=5e-4)
        
    def forward(self, x):
        r_1, out = self.m1(x)
        r_2, out = self.m2(out) 
        return r_1, r_2

Actually two ToyModel in MyModel can train independently once self.m2 get the value of out, if we do pipeline for training, assuming training a ToyModel takes 10 seconds, I think training a MyModel will take < 20 seconds, but it turns out that the training time of MyModel briefly equals to 20 seconds, which means the cuda didn’t do asynchronous execution. Is there anything wrong?

Thanks for anyone who read my question.

I don’t think that’s the case, since self.m2 depends on the output (out tensor) of self.m1.
This would mean that the second GPU still has to wait for the first one to finish its forward pass before being able to execute its forward method.

@ptrblck Thanks for your reply! I know that this is a truth self.m2 depends on self.m1, but suggest a ToyModel take 2 sec for forward, 6 sec for backward, 2 sec for update parameters and 1 sec for data transfer, if I use the pipeline technique, then self.m2 should able to start forward at the 3rd sec, isn’t it?

By the way, I found that most of training use pipeline technique are about image and NLP because of their big and deep model. I wonder if training a simple model with pipeline, can it still get the benefit of time-safeing? Because I found that I got more training time when I use pipeline on a 2-linear layer model.

E.g. if you split the data into microbatches e.g. via the pipeline util. from PyTorch, Megatron etc. then yes.
Your current approach will not save any time (but memory) as it will just defer the computation to another device as seen in the first figure here. I would recommend to check a few pipeline parallel utils. and see what would work for you.

Does it really matter to split the data into microbatches in my case? As I thought, take gpipe for example, the use of microbatch is because we have to feed next mini-batch after the former mini-batch data finishing it’s process of forward, backward and update parameters for correct gradient computation. But in my case, unlike each layer depends on forward and backward in normal model, self.m1 and self.m2 only depend on forward step, so maybe I don’t need to split the data into microbatches but still can do the trick of pipeline?

The code below is how I do pipeline according to this:

def train_pipeline(model, x, k):
    max_gpu_index = k -1
    stage = {}
    splitsX = iter(x.split(split_size, dim = 0))
    head = 0
    tail = -1
    
    for s_nextX in splitsX:
        stage[0] = s_nextX
        for i in range(head, tail, -1):
            result, stage[i+1] = model.m[i](stage[i])
            model.m[i].backward()  #a method to compute ToyModel loss and do backward
            model.m[i].update() # a method to update parameters in ToyModel
        if head < max_gpu_index:
            head += 1
    
    tail = 0    
    while tail != head:
        for i in range(head, tail, -1):
            stage[i+1] = model.m[i](stage[i])
            model.m[i].backward()  #a method to compute ToyModel loss and do backward
            model.m[i].update() # a method to update parameters in ToyModel
        tail += 1

As my understanding, it use iterator to get next data and store each layer’s output for next layer’s input. Once the iterator can’t get data anymore, feed remaining data of each layer’s output for next layer’s input.

All we need to do is make sure each layer(ToyModel) know where its next input is, Pytorch will enqueue each step to specified CUDA device and make needed syncronization. In most of cases, CUDA will do syncronization between each layer’s forward and backward, like the figure you indicated. But I think in my case the syncronization will only occurs in each layer’s forward, and the figure may looks like F1,1 → (B1,1、F2,1) → (UPDATE1、B2,1) → (F1,2、UPDATE2) and so on.

I’ve also check some utils like gpipe, but I’m not sure it can fit my case because it only support the sequential model with one output, that’s why I try to code myself.

If there is any mistake or problem of my thought, feel free to let me know, thanks!

You don’t need to overlap the computation, but would then end up with the large bubble as shown in the first figure in my link. If you use a more “sophisticated” pipeline parallel approach, you would be able to close the bubble a bit and would not waste compute resources.
The naive approach would still allow you to run larger models, which do not fit onto a single GPU.

You don’t need to overlap the computation, but would then end up with the large bubble as shown in the first figure in my link

Sorry, but I don’t understand what you mean. Do you mean my implementation of pipeline would end up with poor performance because it’s too naive? I do this implementation because I need to test the performance within different numbers of gpu.

Besides, could you give me some explanaiton why MyModel doesn’t work what I thought, but work like the figure in your link? I’m still confused that if Pytorch call CUDA asynchoronous, why self.m1 can’t do backward when self.m2 is doing forward? I think there is no gradient dependency between two of them.

Yes, I think your MyModel implementation will use a single GPU at each step only while all others are idle if you don’t orchestrate the forward/backward passes using pipeline parallel utils.

Try to write down the executions on a timeline for e.g. 2 GPUs using MyModel and a standard training loop:

for data, target in loader:
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

and you would end up with the first picture.
It could also be a good idea to just run your code and check the timelines via the PyTorch profiler or e.g. Nsight Systems.

Thanks for the advice, I haven’t use Pytorch profiler before. After I use it to trace the timeline of train_pipeline and compare to the tutorial in my link, it seems like MyModel doesn’t do pipeline as I thought because each step doesn’t overlap. But I’m still questioned about what’s wrong with my implementation in train_pipeline, could anyone help me find it?

By the way, is there any recommend utils of pipeline parallelism? I have used the Pipe in Pytorch, but it requires the model to be nn.Sequential, which may not suitable for my model.