Parallelization on a single GPU


I am a newbie. Now I try to train 2 different model on single GPU, in parallel.
I use the multithreads. I compare to sequential models.
The time to training with multithreads is longer than sequential models.(same dataset)
I dont know why using multithreads is longer?
I think it is faster.

Thank you,


If the GPU is already fully used by a single model, trying to train a second model at the same time will just have to wait for the first one to finish and will slow things down by having to switch from one problem to the other all the time.

Thank for your reply !
The memory of GPU is not full. My implement use the same of the Gb of sequential models.

My point above is that even if there is some memory left, the processor is already being used fully, so you cannot do more computation: you can’t run more things at the same time.

1 Like

Thank you so much!
but I still have a quuestion.
Is there any method (or function) to see the number of processor on GPU which is running?
I will run with smaller models and make a feedback later.

Not sure if such method exists, you can check online the number of cuda cores your GPU has.

I checked with smaller model and the processor is not full.

The time for parallel model is still longer than sequential model.
Below is my function using thread on single GPU.

def parallel_apply(modules, inputs, kwargs_tup=None, devices=None):

assert len(modules) == len(inputs)
if kwargs_tup is not None:
    assert len(modules) == len(kwargs_tup)
    kwargs_tup = ({},) * len(modules)
if devices is not None:
    assert len(modules) == len(devices)
    devices = [None] * len(modules)

lock = threading.Lock()
results = {}

def _worker(i, module, input, kwargs, lock, device=None):

    if device is None:
        device = get_a_var(input).get_device()
        with torch.cuda.device(device):
            output = module.cuda(device)(*input,**kwargs);
        with lock:
            results[i] = output
    except Exception as e:
        with lock:
            results[i] = e

if len(modules) > 1:
    threads = [threading.Thread(target=_worker,
                               args=(i, module, input, kwargs, lock, device),)
               for i, (module, input, kwargs, device) in
               enumerate(zip(modules, inputs, kwargs_tup, devices))]

    for thread in threads:
    for thread in threads:
    _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0])

outputs = []
for i in range(len(inputs)):
    output = results[i]
    if isinstance(output, Exception):
        raise output
return outputs