Extend convolution and parallel on single GPU

Hi, i’m a newbie Pytorch.

Now, I have a problem with my work. I dont find any solution for it.

I want to implement one model like behind image. It will run on single GPU.
My inputs are extended more one dimensions (5 dimensions instead of 4 ). In models, there is 2 submodels. Each submodel has one inputs like my image.

I have to implement 2 submodels which run parallelly to improve performance.

I try to use multhreading on one GPU but the time is the same with sequential models.

Thanks for helping.

Hi,

The CUDA api is asynchronous by default, this means that if you do

out1 = net1(in1)
out2 = net2(in2)

This will run in parrallel on the gpu if it can (enough compute available).
So I would expect no improvement by using multithreading on a single GPU :slight_smile:

2 Likes

Thanks for reply.
But with backward () function, Does it still run in parallel ?

Hi,
I also have another question. With same batch input, like that:
out1 = net1(in1)
out2 = net2(in1)
Does this run in parallel?
If not, how to do it?

Hi,

It’s the same for the backward method, it uses cuda calls that are asynchronous, so it will run in parallel if you GPU allows it.

On a single gpu, whenever you ask for something to be done on the gpu, it is added to a stack and returns. So if you do a lot of operations, the only thing that you do is add all of them to the stack. What the gpu is doing, is as soon as it has free compute, it takes the next item in the stack and execute it. So if your gpu has spare compute, it will always run as many stuff at the same time as possible.

Thank for repling,

With second question, with same input for 2 submodels. I think with same input, net2 has to wait the input in1 which net1 is using (althought GPU can enough compute available). Then, it is not parallel. Is it true?

Why would it need to wait? you can read in1 twice without issue.
If you were doing

out1 = net1(in1)
out2 = net2(out1)

Then it will wait for out1 to be computed, but you cannot run in parallel anyway.

I am sorry about naive question. For example.
If GPU can enough compute available.
If this code run in parallel. The time to run 3 command lines (

out1 = net1(in1)
out2 = net2(in2)
out3 = net3(in3)

) equal the time to run 1 command line (out1 = net1(in1)).
Is it true?

In theory if your gpu has enough compute, then running all 3 will be the same as running 1 yes.
It is really unlikely that your gpu will have enough compute to do that.
Also if you start measuring time by looking how long it took to run a python code line, keep it mind that the api is asynchronous which means it is not because the line executed that the code actually ran on the gpu. If you want to wait for all operations to finish on the gpu, use torch.cuda.synchronize().

1 Like

I am using GeForce GTX 1080 Ti. I test with small models( about 40000 parameters). I think, my GPU can enough compute. I also use the @profiler to measure time. The result :
Timer unit: 1e-06 s

Line #      Hits    Time         Per Hit   % Time          Line Contents
   204      4689    5770628.0   1230.7     13.0            output_2 = net_1(data);
   205      4689    2219502.0    473.3      5.0            output_1 = net(data);
   206      4689    2166653.0    462.1      4.9            output_3 = net_2(data);
   207      4689    2156971.0    460.0      4.9            output_4 = net_3(data);

Running with 1 command line:

Line #      Hits    Time         Per Hit   % Time          Line Contents 
204         4689    5422347.0   1156.4     18.9          output_2 = net_1(data);

i think the total time is not the same.
This means: this doesn’t run in parallel

Have you seen my comment in the previous post about being careful if you start to time your python code?

Thanks for repling quickly.
I readed. But now, i want to run in parallel in python to reduce the time running.I don’t find any idea to do it?

The timing you just did are not correct, you just measure how long it takes to add the task to the stack of job for the gpu, which is going to be more or less the same for all your nets.
The conclusion your have “this doesn’t run in parallel” is wrong: you don’t actually measure the execution time in your timing. To measure proper runtime, you need to add torch.cuda.synchronize() to force the python code to wait for the gpu to finish executing everything.

1 Like

Thank !!
i added this command line. My code:

import Net;

net = Net();
torch.cuda.synchronize()

@profile
def train(epoch):
    net.train();
for batch_idx,(data1, label) in enumerate(trainloader,0):
        data  = Variable(data1).cuda();
        label = Variable(label).cuda();
        optimizer.zero_grad();
        output_2 = net(data);
        output_1 = net(data);
        output_3 = net(data);

But the result is the same compare to not add torch.cuda.synchronize()

Please check the doc here for synchronize.
In particular, if you want to check how long a given net takes to run on gpu, you need to do:

torch.cuda.synchronize()
start = time.time()
out = net(in)
torch.cuda.synchronize()
end = time.time()
elapsed_time = end - start

This makes sure that nothing was still runing on the gpu when you start and then waits for your stuff to run before measuring the end time.

out1 = net1(in1)
out2 = net2(out1)

When I understand correctly, you said that operations are added to a stack and then given to the GPUs when computing resources become available. So, when you mentioned that the CUDA api is asynchronous by default, it would only run things in parallel if the inputs don’t have dependencies in the graph that were not previously satisfied?
Or in other words, in this code scenario above, even with 2 GPUs, it would wait until out1 is computed before out2 gets computed. That makes sense.
So, basically, the advantage of using 2 GPUs in PyTorch is that is simple extends the pool of computing resources available but does not affect the order in which the operations are processes. For instance, this could be useful if e.g., one giant matrix multiplication would be too large for one GPU, so the computing resources from the 2 GPUs could be pooled to compute the dot products in parallel?

All the discussion here was for 1 GPU.
If you have 2 GPUs, then it is more complex, as you will need to manually assign the different tasks to the different operations. This cannot be done automatically.

I am sorry about navie question.

def train(epoch):
    net.train();
    elapsed_time=0;
    for batch_idx,(data1, label) in enumerate(trainloader,0):
        data  = Variable(data1).cuda();
        label = Variable(label).cuda();
        optimizer.zero_grad();
        torch.cuda.synchronize()
        start = time.time()
        output_2 = net(data);
        output_1 = net(data);
        output_3 = net(data);
        torch.cuda.synchronize()
        end = time.time()
        elapsed_time += end - start
print(elapsed_time);
def train(epoch):
    net.train();
    elapsed_time=0;
    for batch_idx,(data1, label) in enumerate(trainloader,0):
        data  = Variable(data1).cuda();
        label = Variable(label).cuda();
        optimizer.zero_grad();
        torch.cuda.synchronize()
        start = time.time()
        output_2 = net(data);
        torch.cuda.synchronize()
        end = time.time()
        elapsed_time += end - start
print(elapsed_time);

The result of elapsed_time is not the same in 2 cases!!
Can you explain the reason for me ?

Oh I see, I was somehow getting threads mixed up. What you say is also what I assumed, so I was briefly positively surprised about the possibility that it would automatically extend the pool of computing resources as (i.e., as if you would have one GPU with 2x the computing resources) :stuck_out_tongue:

Then that means that your gpu is not able to run all 3 at the same time. You can check that by using nvidia-smi and check the gpu compute usage when using a single of these nets. If it is already 100%, then I would expect that you get no much speedup.