Split Single GPU

ahkarami · May 25, 2018, 7:14am

Is there any way to split single GPU and use a single GPU as multiple GPUs?
For example, we have 2 different ResNet18 model and we want to forward pass these two models in parallel just in one GPU (with enough memory, e.g., 12Gb). I mean that the forward pass of these two models runs in parallel and concurrent in just one GPU.

vince62s · May 25, 2018, 9:48am

if your code is Torch Distributed compatible, you can spawn two processes on the same device.

Naman-ntc · May 25, 2018, 10:05am

I am not sure if I understood your question properly.
Can’t you just do
python model1.py
python model2.py
in two different shells??

(Although this slows down both individual processes, havent seen any recommended method that maintains the speed!)

roaffix · May 25, 2018, 3:26pm

I’m not sure if I get what you want, but possibly my last topic (Couple of models in production) will guide you.

Cheers,
Anton

justusschock · May 25, 2018, 4:00pm

You could try using data parallel (as for multiple GPUs) and pass the same GPU ID several times

ahkarami · May 26, 2018, 8:17am

Thanks all for above replies.
I have read the topic (Couple of models in production) and according it I have implemented these codes:
First Scenario (Sequential Forward Pass):

import torch
import time
from torchvision import models
from torch.autograd import Variable

# Check use GPU or not
use_gpu = torch.cuda.is_available()  # use GPU

torch.manual_seed(123)
if use_gpu:
    torch.cuda.manual_seed(456)


# Define CNN Models:
model1 = models.resnet18(pretrained=True)
model2 = models.resnet50(pretrained=True)

# Eval Mode:
model1.eval()
model2.eval()

# Put on GPU:
if use_gpu:
    model1 = model1.cuda()
    model2 = model2.cuda()

# Create tmp Variable:
x = Variable(torch.randn(10, 3, 224, 224))
if use_gpu:
    x = x.cuda()


# Forward Pass:
tic1 = time.time()
out1 = model1(x)
out2 = model2(x)
tic2 = time.time()

sequential_forward_pass = tic2 - tic1
print('Time = ', sequential_forward_pass)  # example output --> Time =  0.6485

Now I want to perform the forward passes in parallel in just one single GPU.
Second Scenario (Parallel Forward Pass):

import time
import torch
from torchvision import models
import torch.multiprocessing as mp
from torch.autograd import Variable

# Check use GPU or not
use_gpu = torch.cuda.is_available()  # use GPU

torch.manual_seed(123)
if use_gpu:
    torch.cuda.manual_seed(456)


# Define Forward Pass Method:
def forward_pass_method(model, tmp_variable):
    output = model(tmp_variable)
    return output


# Define CNN Models:
model1 = models.resnet18(pretrained=True)
model2 = models.resnet50(pretrained=True)

# Eval Mode:
model1.eval()
model2.eval()

# Put on GPU:
if use_gpu:
    model1 = model1.cuda()
    model2 = model2.cuda()

# Create tmp Variable:
x = Variable(torch.randn(10, 3, 224, 224))
if use_gpu:
    x = x.cuda()

# Parallelized the Forward Passes:
tic1 = time.time()

model1.share_memory()
model2.share_memory()

processes = []
num_processes = 2
for i in range(num_processes):
    if i == 0:
        p = mp.Process(target=forward_pass_method, args=(model1, x))
    else:
        p = mp.Process(target=forward_pass_method, args=(model2, x))
    p.start()
    processes.append(p)
for p in processes:
    p.join()

tic2 = time.time()

parallel_forward_pass = tic2 - tic1
print('Time = ', parallel_forward_pass)

However the second method has the below error:
...RuntimeError: CUDA error (3): initialization error
Would you please kindly help me to address the error?
However, It is worth nothing that, I am in doubt that parallelizing by just one single GPU is a feasible task or not.

ahkarami · May 28, 2018, 5:42am

Dear @ptrblck,
Do you have any idea about my last post?
I have followed your opinion.

ptrblck · May 28, 2018, 9:57am

The error seems to be related to some issues with multiprocessing and CUDA.
Have a look at the doc on Sharing CUDA tensors.

You have to use the “spawn” or “forkserver” start method.

Also, in your first script your time measurement is a bit wrong, because you have to call torch.cuda.synchronize() before getting the end time.
CUDA calls are asynchronous, so that the end time might be stored before the CUDA operation is done.

Here is a small script for your second use case, which might be a starter (I’m not sure, if you need modelX.share_memory()):

import torch
import torch.nn as nn
import torch.optim as optim

import torch.multiprocessing as _mp
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch.utils.data import DataLoader


# Globals
mp = _mp.get_context('spawn')
use_cuda = True


class Flatten(nn.Module):
    def __init__(self):
        super(Flatten, self).__init__()

    def forward(self, x):
        x = x.view(x.size(0), -1)
        return x


def get_model():
    model = nn.Sequential(
            nn.Conv2d(3, 6, 3 ,1, 1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(6, 16, 3, 1, 1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 1, 3, 1, 1),
            nn.MaxPool2d(2),
            Flatten(),
            nn.Linear(28*28, 10),
            nn.LogSoftmax(dim=1)
    )
    
    return model


def train(model, data_loader, optimizer, criterion):
    for data, labels in data_loader:
        labels = labels.long()
        if use_cuda:
            data, labels = data.to('cuda'), labels.to('cuda')
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()


if __name__=='__main__':
    num_processes = 2
    model1 = get_model()
    model2 = get_model()
    if use_cuda:
        model1 = model1.to('cuda')
        model2 = model2.to('cuda')
    
    dataset = datasets.FakeData(transform=transforms.ToTensor())
    data_loader = DataLoader(dataset, batch_size=2,
                             num_workers=0,
                             pin_memory=False)
    
    criterion = nn.NLLLoss()
    optimizer1 = optim.SGD(model1.parameters(), lr=1e-3)
    optimizer2 = optim.SGD(model2.parameters(), lr=1e-3)
    
    #model1.share_memory()
    #model2.share_memory()
    processes = []
    p1 = mp.Process(target=train, args=(model1, data_loader, optimizer1, criterion))
    p1.start()
    processes.append(p1)
    p2 = mp.Process(target=train, args=(model2, data_loader, optimizer2, criterion))
    p2.start()
    processes.append(p2)

    for p in processes:
        p.join()
    
    print('Done')

However, I’m still not sure, if you’ll see any performance advantage.
It would be nice, if you could time your script and report the results using the sequential and multiprocessing way.

ahkarami · May 29, 2018, 7:12am

Dear @ptrblck,
Thank you for your time & response. I have used the spawn start method, and the errors have addressed. Now, my modified code is as below:

import time
import torch
from torchvision import models
import torch.multiprocessing as mp
from torch.autograd import Variable

# Check use GPU or not
use_gpu = torch.cuda.is_available()  # use GPU

torch.manual_seed(123)
if use_gpu:
    torch.cuda.manual_seed(456)

# spawn start method:
mp = mp.get_context('spawn')


# Define Forward Pass Method:
def forward_pass_method(model, tmp_variable):
    output = model(tmp_variable)
    return output


# Define CNN Models:
model1 = models.resnet18(pretrained=True)
model2 = models.resnet50(pretrained=True)

# Eval Mode:
model1.eval()
model2.eval()

# Put on GPU:
if use_gpu:
    model1 = model1.cuda()
    model2 = model2.cuda()

# Create tmp Variable:
x = Variable(torch.randn(10, 3, 224, 224))
if use_gpu:
    x = x.cuda()

# model1.share_memory()
# model2.share_memory()

if __name__ == '__main__':

    # Parallelized the Forward Passes:
    tic1 = time.time()
    processes = []
    num_processes = 2
    for i in range(num_processes):
        if i == 0:
            p = mp.Process(target=forward_pass_method, args=(model1, x))
        else:
            p = mp.Process(target=forward_pass_method, args=(model2, x))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

    tic2 = time.time()

    parallel_forward_pass = tic2 - tic1
    print('Time = ', parallel_forward_pass)

However the computational time increased a lot (both qualitatively and quantitatively) in comparison to the sequential way. As a result, I came to the conclusion that the multiprocessing usage of GPU hasn’t any performance advantage.

ptrblck · May 29, 2018, 11:11am

Have you tried in with a larger workload or just a single forward pass?
I think the startup might take much longer in the multi-processing case, so it’s maybe still faster in the long run.
But as I said, it’s a lot of speculation, since I haven’t used this approach yet.

elepherai · August 22, 2019, 11:46am

@ahkarami @ptrblck Failed to use this code to inference multiple models on single GPU on windows.

RuntimeError: cuda runtime error (71) : operation not supported at C:\w\1\s\windows\pytorch\torch/csrc/generic/StorageSharing.cpp:245

Could you pls help me to solve this problem?

ahkarami · August 22, 2019, 8:00pm

Unfortunately I haven’t worked PyTorch in Windows.

ptrblck · August 22, 2019, 10:29pm

Your error seems to be this one.

peterjc123 · August 23, 2019, 1:34am

Actually, you can’t share models across processes on Windows, but you can share tensors instead.

elepherai · August 23, 2019, 3:29am

Thanks ptrblck.

I have used if-clause protection, but still run into this issue.

elepherai · August 23, 2019, 3:33am

Thanks peterjc123,

Could you share more details?

I’m using three instance segmentation models which were trained by maskrcnn-benchmark, and would like to inference by using multiprocessing.

How to share tensors across processes on Windows?

peterjc123 · August 23, 2019, 4:29am

What about starting three web servers with different ports?

elepherai · August 23, 2019, 4:42am

That is a way, but I just want to run locally.

peterjc123 · August 23, 2019, 4:47am

That is also possible, use multiprocessing and pass model paths and two queues (input/output) as arguments and then push tensors into the input queue and retrieve the answers from the output queue.

elepherai · August 23, 2019, 5:49am

Thanks!

What’s diffenrence between this solution and multiprocessing.Pool?

Do you mean that I should load a single model in each process and use the producer-consumer pattern like queues to input/output?