@cbalioglu Thanks for your reply I have this question because I think no matter which way I use, I should implement model parallelism myself (in other words, set model and input/output into correct device), but I found some dicussion said that multiprocessing is not actually capable with CUDA and it might cause the GPU not doing what “multiprocessing” you think.
On the other hand, I also found the difference between this tow docs. The latter one use rpc, which is mainly for multiple machine’s training, but I still have questions:
- In first doc’s, I found I got more training time when I do the following code (same as tutorial):
# With model parallelism only
model = Classification_model_parel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
model.train()
x, y = batchloader(X_train, y_train, batch_size)
start = time.time()
for i in range(epoch):
for x_batch, y_batch in zip(x,y):
out = model(x_batch.to(torch.float32))
loss = F.cross_entropy(out, y_batch.to('cuda:1'))
loss.backward()
optimizer.step()
optimizer.zero_grad()
end = time.time()
model.eval()
pred = model(X_test)
correct = (pred.argmax(dim = 1) == y_test.to('cuda:1')).sum()
auc = correct / len(y_test)
print(f'Auc:{auc:.4f}, trainning time = {end - start}')
# With pipeline and model parallelism
split_test = [4,8,12,24,48,96,128,256]
record = []
for splitset in split_test:
model = Classification_model_parel_pipe(splitset)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
model.train()
x, y = batchloader(X_train.to(torch.float32), y_train, batch_size)
start = time.time()
for i in range(epoch):
for x_batch, y_batch in zip(x,y):
out = model(x_batch.to('cuda:0'))
loss = F.cross_entropy(out, y_batch.to('cuda:1'))
loss.backward()
optimizer.step()
optimizer.zero_grad()
end = time.time()
model.eval()
pred = model(X_test.to('cuda:0'))
correct = (pred.argmax(dim = 1) == y_test.to('cuda:1')).sum()
auc = correct / len(y_test)
print(f'Set {splitset}, Auc:{auc:.4f}, trainning time = {end - start}')
record.append(end - start)
The training time with pipeline in all splt_size are more than the training time without pipeline. Only when I set the split size as original batch_size and batch_size as 2*batch_size, the training time with pipeline will faster than the training time without pipeline. Is this correct?
- Although rpc is use in multiple machines, if I init rpc with only one worker and use Pipe in torch.distributed package according to the second docs, I think this is doing the same thing as the first docs. But the result comes out that using rpc and Pipe will training much time than the first doc’s way, is my concept wrong?