Difference between pipeline parallelism and multiprocessing?

I’m trying to train a model contains two sub models with two GPUs simultaneously, and I’m looking to parallelism and multiprocessing.

As I know, parallelism contains data parallelism and model parallelism, in my case is more likely to use model parallelism, and we usually use pipeline together to reduce the waste of transfering data between different model. This can be implemented by using torch.distributed.pipeline.sync package.

On the other hand, due to GIL in python, if we really want to execute multiple threads (in my case, training two models) at the same time, one way is using torch.multiprocessing based on python’s multiprocessing package, which will spawn new intepreter to solve the limitation of GIL.

What I question about are below:

  1. What is difference between parallelism and multiprocessing?
  2. In my case which one (or which package) should I use?
  3. I found two pytorch’s document related to parallelism:

SINGLE-MACHINE MODEL PARALLEL BEST PRACTICES

TRAINING TRANSFORMER MODELS USING PIPELINE PARALLELISM

But I wonder the former one really does “parallelism”? Doesn’t it just split the model and only one model in calculation in one time?

Thanks for anyone who takes a look of my question.

torch.multiprocessing is nothing more than a glorified version of Python’s standard multiprocessing package. It has no inherent knowledge of models or any sort of parallelism. When using torch.multiprocessing you are responsible for implementing all such concepts such as model parallelism, which can be quite an undertaking.

Regarding your question about the two docs you referred to. Please note that the first doc’s last section “Speed up by Pipelining Inputs” is actually where the “magic” happens.

https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html#speed-up-by-pipelining-inputs

@cbalioglu Thanks for your reply :heart_eyes: I have this question because I think no matter which way I use, I should implement model parallelism myself (in other words, set model and input/output into correct device), but I found some dicussion said that multiprocessing is not actually capable with CUDA and it might cause the GPU not doing what “multiprocessing” you think.

On the other hand, I also found the difference between this tow docs. The latter one use rpc, which is mainly for multiple machine’s training, but I still have questions:

  1. In first doc’s, I found I got more training time when I do the following code (same as tutorial):
# With model parallelism only
model = Classification_model_parel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
x, y = batchloader(X_train, y_train, batch_size)

start = time.time()

for i in range(epoch):
    
    for x_batch, y_batch in zip(x,y):
        out = model(x_batch.to(torch.float32))        
        loss = F.cross_entropy(out, y_batch.to('cuda:1'))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

end = time.time()

model.eval()
pred = model(X_test)

correct = (pred.argmax(dim = 1) == y_test.to('cuda:1')).sum()
auc = correct / len(y_test)
print(f'Auc:{auc:.4f}, trainning time = {end - start}')
# With pipeline and model parallelism
split_test = [4,8,12,24,48,96,128,256]
record = []
for splitset in split_test:
    model = Classification_model_parel_pipe(splitset)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

    model.train()
    x, y = batchloader(X_train.to(torch.float32), y_train, batch_size)

    start = time.time()
    for i in range(epoch):
        for x_batch, y_batch in zip(x,y):
            out = model(x_batch.to('cuda:0'))
            loss = F.cross_entropy(out, y_batch.to('cuda:1'))
            loss.backward()    
            optimizer.step()
            optimizer.zero_grad()

    end = time.time()

    model.eval()
    pred = model(X_test.to('cuda:0'))

    correct = (pred.argmax(dim = 1) == y_test.to('cuda:1')).sum()
    auc = correct / len(y_test)
    print(f'Set {splitset}, Auc:{auc:.4f}, trainning time = {end - start}')
    record.append(end - start)

The training time with pipeline in all splt_size are more than the training time without pipeline. Only when I set the split size as original batch_size and batch_size as 2*batch_size, the training time with pipeline will faster than the training time without pipeline. Is this correct?

  1. Although rpc is use in multiple machines, if I init rpc with only one worker and use Pipe in torch.distributed package according to the second docs, I think this is doing the same thing as the first docs. But the result comes out that using rpc and Pipe will training much time than the first doc’s way, is my concept wrong?