Extend convolution and parallel on single GPU

dat_pham_thanh · April 11, 2018, 3:21pm

Thank for helping.
I checked and my model still use around 7%. It is so small. I still don’t find the bug for it.

albanD · April 11, 2018, 3:50pm

Ho sorry I assumed you had a big enough model.
So if your model is really so small, then the overhead of launching the job on the gpu might actually be what takes the most time here. Is the model faster if you run it on cpu?

dat_pham_thanh · April 11, 2018, 4:02pm

Now, I am testing with small model. I think, The CUDA api is asynchronous by default, then, my small model still run well (it run in parallel). Now, I will test with bigger models. I have to reduce the tranning time on single GPU(compute in parallel some models (at same time) on single GPU).
I don’t know how to do it?

dat_pham_thanh · April 11, 2018, 4:32pm

I tested with bigger models. The situation is the same above although I only use about 70% GPU. How to train two models at same time on single GPU ?

SimonW · April 11, 2018, 5:12pm

I think this only true if you run on different streams.

dat_pham_thanh · April 11, 2018, 5:17pm

Hi,

I don’t know how to run on different streams on single GPU. Can you give me some idea to do it ?

Hung_Nguyen · January 23, 2020, 12:58am

@albanD
Hi, I have another scenario.

Let’s say I have n sequences, each of which has m timesteps.
The calculation at a timestep will depend on the previous timestep.

Therefore, there are 2 ways to forward these sequences through a model.

Forward 1 sequence and then the next one and so on. (sequence -> timestep)
At each time step t, process the whole batch of the sequences. (timestep -> sequence)

In the first one, the gpu will have to wait until the end of a sequence to add another independent operation to the stack. While for the 2nd one, the gpu can add multiple independent operations (from different sequence).
Does that make the second approach better in terms of performance?

albanD · January 23, 2020, 3:04pm

Hi,

I am not sure, this depends a lot on the size of the different operations. The best way to do this would be to run both and check which one is fastest.