Extend convolution and parallel on single GPU

Thank for helping.
I checked and my model still use around 7%. It is so small. I still don’t find the bug for it.

Ho sorry I assumed you had a big enough model.
So if your model is really so small, then the overhead of launching the job on the gpu might actually be what takes the most time here. Is the model faster if you run it on cpu?

Now, I am testing with small model. I think, The CUDA api is asynchronous by default, then, my small model still run well (it run in parallel). Now, I will test with bigger models. I have to reduce the tranning time on single GPU(compute in parallel some models (at same time) on single GPU).
I don’t know how to do it?

I tested with bigger models. The situation is the same above although I only use about 70% GPU. How to train two models at same time on single GPU ?

I think this only true if you run on different streams.


I don’t know how to run on different streams on single GPU. Can you give me some idea to do it ?

1 Like

Hi, I have another scenario.

Let’s say I have n sequences, each of which has m timesteps.
The calculation at a timestep will depend on the previous timestep.

Therefore, there are 2 ways to forward these sequences through a model.

  • Forward 1 sequence and then the next one and so on. (sequence -> timestep)
  • At each time step t, process the whole batch of the sequences. (timestep -> sequence)

In the first one, the gpu will have to wait until the end of a sequence to add another independent operation to the stack. While for the 2nd one, the gpu can add multiple independent operations (from different sequence).
Does that make the second approach better in terms of performance?


I am not sure, this depends a lot on the size of the different operations. The best way to do this would be to run both and check which one is fastest.