Split Single GPU

Is there any way to split single GPU and use a single GPU as multiple GPUs?
For example, we have 2 different ResNet18 model and we want to forward pass these two models in parallel just in one GPU (with enough memory, e.g., 12Gb). I mean that the forward pass of these two models runs in parallel and concurrent in just one GPU.

if your code is Torch Distributed compatible, you can spawn two processes on the same device.

I am not sure if I understood your question properly.
Can’t you just do
python model1.py
python model2.py
in two different shells??

(Although this slows down both individual processes, havent seen any recommended method that maintains the speed!)

1 Like

I’m not sure if I get what you want, but possibly my last topic (Couple of models in production) will guide you.

Cheers,
Anton

1 Like

You could try using data parallel (as for multiple GPUs) and pass the same GPU ID several times

1 Like

Thanks all for above replies.
I have read the topic (Couple of models in production) and according it I have implemented these codes:
First Scenario (Sequential Forward Pass):

import torch
import time
from torchvision import models
from torch.autograd import Variable

# Check use GPU or not
use_gpu = torch.cuda.is_available()  # use GPU

torch.manual_seed(123)
if use_gpu:
    torch.cuda.manual_seed(456)


# Define CNN Models:
model1 = models.resnet18(pretrained=True)
model2 = models.resnet50(pretrained=True)

# Eval Mode:
model1.eval()
model2.eval()

# Put on GPU:
if use_gpu:
    model1 = model1.cuda()
    model2 = model2.cuda()

# Create tmp Variable:
x = Variable(torch.randn(10, 3, 224, 224))
if use_gpu:
    x = x.cuda()


# Forward Pass:
tic1 = time.time()
out1 = model1(x)
out2 = model2(x)
tic2 = time.time()

sequential_forward_pass = tic2 - tic1
print('Time = ', sequential_forward_pass)  # example output --> Time =  0.6485

Now I want to perform the forward passes in parallel in just one single GPU.
Second Scenario (Parallel Forward Pass):

import time
import torch
from torchvision import models
import torch.multiprocessing as mp
from torch.autograd import Variable

# Check use GPU or not
use_gpu = torch.cuda.is_available()  # use GPU

torch.manual_seed(123)
if use_gpu:
    torch.cuda.manual_seed(456)


# Define Forward Pass Method:
def forward_pass_method(model, tmp_variable):
    output = model(tmp_variable)
    return output


# Define CNN Models:
model1 = models.resnet18(pretrained=True)
model2 = models.resnet50(pretrained=True)

# Eval Mode:
model1.eval()
model2.eval()

# Put on GPU:
if use_gpu:
    model1 = model1.cuda()
    model2 = model2.cuda()

# Create tmp Variable:
x = Variable(torch.randn(10, 3, 224, 224))
if use_gpu:
    x = x.cuda()

# Parallelized the Forward Passes:
tic1 = time.time()

model1.share_memory()
model2.share_memory()

processes = []
num_processes = 2
for i in range(num_processes):
    if i == 0:
        p = mp.Process(target=forward_pass_method, args=(model1, x))
    else:
        p = mp.Process(target=forward_pass_method, args=(model2, x))
    p.start()
    processes.append(p)
for p in processes:
    p.join()

tic2 = time.time()

parallel_forward_pass = tic2 - tic1
print('Time = ', parallel_forward_pass)

However the second method has the below error:
...RuntimeError: CUDA error (3): initialization error
Would you please kindly help me to address the error?
However, It is worth nothing that, I am in doubt that parallelizing by just one single GPU is a feasible task or not.

2 Likes

Dear @ptrblck,
Do you have any idea about my last post?
I have followed your opinion.

The error seems to be related to some issues with multiprocessing and CUDA.
Have a look at the doc on Sharing CUDA tensors.

You have to use the “spawn” or “forkserver” start method.

Also, in your first script your time measurement is a bit wrong, because you have to call torch.cuda.synchronize() before getting the end time.
CUDA calls are asynchronous, so that the end time might be stored before the CUDA operation is done.

Here is a small script for your second use case, which might be a starter (I’m not sure, if you need modelX.share_memory()):

import torch
import torch.nn as nn
import torch.optim as optim

import torch.multiprocessing as _mp
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch.utils.data import DataLoader


# Globals
mp = _mp.get_context('spawn')
use_cuda = True


class Flatten(nn.Module):
    def __init__(self):
        super(Flatten, self).__init__()

    def forward(self, x):
        x = x.view(x.size(0), -1)
        return x


def get_model():
    model = nn.Sequential(
            nn.Conv2d(3, 6, 3 ,1, 1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(6, 16, 3, 1, 1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 1, 3, 1, 1),
            nn.MaxPool2d(2),
            Flatten(),
            nn.Linear(28*28, 10),
            nn.LogSoftmax(dim=1)
    )
    
    return model


def train(model, data_loader, optimizer, criterion):
    for data, labels in data_loader:
        labels = labels.long()
        if use_cuda:
            data, labels = data.to('cuda'), labels.to('cuda')
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()


if __name__=='__main__':
    num_processes = 2
    model1 = get_model()
    model2 = get_model()
    if use_cuda:
        model1 = model1.to('cuda')
        model2 = model2.to('cuda')
    
    dataset = datasets.FakeData(transform=transforms.ToTensor())
    data_loader = DataLoader(dataset, batch_size=2,
                             num_workers=0,
                             pin_memory=False)
    
    criterion = nn.NLLLoss()
    optimizer1 = optim.SGD(model1.parameters(), lr=1e-3)
    optimizer2 = optim.SGD(model2.parameters(), lr=1e-3)
    
    #model1.share_memory()
    #model2.share_memory()
    processes = []
    p1 = mp.Process(target=train, args=(model1, data_loader, optimizer1, criterion))
    p1.start()
    processes.append(p1)
    p2 = mp.Process(target=train, args=(model2, data_loader, optimizer2, criterion))
    p2.start()
    processes.append(p2)

    for p in processes:
        p.join()
    
    print('Done')

However, I’m still not sure, if you’ll see any performance advantage.
It would be nice, if you could time your script and report the results using the sequential and multiprocessing way.

4 Likes

Dear @ptrblck,
Thank you for your time & response. I have used the spawn start method, and the errors have addressed. Now, my modified code is as below:

import time
import torch
from torchvision import models
import torch.multiprocessing as mp
from torch.autograd import Variable

# Check use GPU or not
use_gpu = torch.cuda.is_available()  # use GPU

torch.manual_seed(123)
if use_gpu:
    torch.cuda.manual_seed(456)

# spawn start method:
mp = mp.get_context('spawn')


# Define Forward Pass Method:
def forward_pass_method(model, tmp_variable):
    output = model(tmp_variable)
    return output


# Define CNN Models:
model1 = models.resnet18(pretrained=True)
model2 = models.resnet50(pretrained=True)

# Eval Mode:
model1.eval()
model2.eval()

# Put on GPU:
if use_gpu:
    model1 = model1.cuda()
    model2 = model2.cuda()

# Create tmp Variable:
x = Variable(torch.randn(10, 3, 224, 224))
if use_gpu:
    x = x.cuda()

# model1.share_memory()
# model2.share_memory()

if __name__ == '__main__':

    # Parallelized the Forward Passes:
    tic1 = time.time()
    processes = []
    num_processes = 2
    for i in range(num_processes):
        if i == 0:
            p = mp.Process(target=forward_pass_method, args=(model1, x))
        else:
            p = mp.Process(target=forward_pass_method, args=(model2, x))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

    tic2 = time.time()

    parallel_forward_pass = tic2 - tic1
    print('Time = ', parallel_forward_pass)

However the computational time increased a lot (both qualitatively and quantitatively) in comparison to the sequential way. As a result, I came to the conclusion that the multiprocessing usage of GPU hasn’t any performance advantage.

2 Likes

Have you tried in with a larger workload or just a single forward pass?
I think the startup might take much longer in the multi-processing case, so it’s maybe still faster in the long run.
But as I said, it’s a lot of speculation, since I haven’t used this approach yet.

3 Likes

@ahkarami @ptrblck Failed to use this code to inference multiple models on single GPU on windows.

RuntimeError: cuda runtime error (71) : operation not supported at C:\w\1\s\windows\pytorch\torch/csrc/generic/StorageSharing.cpp:245

Could you pls help me to solve this problem?

Unfortunately I haven’t worked PyTorch in Windows.

Your error seems to be this one.

Actually, you can’t share models across processes on Windows, but you can share tensors instead.

Thanks ptrblck.

I have used if-clause protection, but still run into this issue.

Thanks peterjc123,

Could you share more details?

I’m using three instance segmentation models which were trained by maskrcnn-benchmark, and would like to inference by using multiprocessing.

How to share tensors across processes on Windows?

What about starting three web servers with different ports?

That is a way, but I just want to run locally.

That is also possible, use multiprocessing and pass model paths and two queues (input/output) as arguments and then push tensors into the input queue and retrieve the answers from the output queue.

Thanks!

What’s diffenrence between this solution and multiprocessing.Pool?

Do you mean that I should load a single model in each process and use the producer-consumer pattern like queues to input/output?