According to CUDA semantics, GPU operations are asynchronous, which means operations on different GPUs can work simultaneously once the “data” is prepared, and that’s why we can do techniques like pipeline, isn’t it?
And I also see the description in CUDA stream:
A CUDA stream is a linear sequence of execution that belongs to a specific device. You normally do not need to create one explicitly: by default, each device uses its own “default” stream.
which I think this is how CUDA do asynchronous execution because Pytorch can run multiple streams at the same time, right?
My question is,
- Is it possible for CUDA do asynchronous execution in backward and optimer.step()?
- Is the idea of “stream” and “process” in torch.multiprocessing the same?
For the question 1, consider I have a toy model below:
class ToyModel(nn.Module):
def __init__(self, device):
super().__init__()
self.l1 = nn.Linear(10,10).to(device)
self.l2 = nn.Linear(10,10).to(device)
self.device = device
def forward(self, x):
x_1 = self.l1(x.to(self.device))
x_2 = self.l2(x_1.to(self.device))
return x_2, x_1.detach()
During forward, I copy the first linear ouput and cut the gradient for the use of other training.
And I have a model which contains two of toy models:
class MyModel(nn.Module):
def __init__(self, device):
super().__init__()
self.m1 = ToyModel('cuda:0')
self.m2 = ToyModel('cuda:1')
self.opt1 = torch.optim.Adam(self.l1.parameters(), lr=0.01, weight_decay=5e-4)
self.opt2 = torch.optim.Adam(self.l2.parameters(), lr=0.01, weight_decay=5e-4)
def forward(self, x):
r_1, out = self.m1(x)
r_2, out = self.m2(out)
return r_1, r_2
Actually two ToyModel in MyModel can train independently once self.m2 get the value of out, if we do pipeline for training, assuming training a ToyModel takes 10 seconds, I think training a MyModel will take < 20 seconds, but it turns out that the training time of MyModel briefly equals to 20 seconds, which means the cuda didn’t do asynchronous execution. Is there anything wrong?
Thanks for anyone who read my question.