Parallelization on a single GPU

dat_pham_thanh · March 28, 2018, 2:05pm

Hi,

I am a newbie. Now I try to train 2 different model on single GPU, in parallel.
I use the multithreads. I compare to sequential models.
The time to training with multithreads is longer than sequential models.(same dataset)
I dont know why using multithreads is longer?
I think it is faster.

Thank you,

albanD · March 28, 2018, 3:53pm

Hi,

If the GPU is already fully used by a single model, trying to train a second model at the same time will just have to wait for the first one to finish and will slow things down by having to switch from one problem to the other all the time.

dat_pham_thanh · March 28, 2018, 6:06pm

Thank for your reply !
The memory of GPU is not full. My implement use the same of the Gb of sequential models.

albanD · March 29, 2018, 9:28am

My point above is that even if there is some memory left, the processor is already being used fully, so you cannot do more computation: you can’t run more things at the same time.

dat_pham_thanh · March 29, 2018, 9:55am

Thank you so much!
but I still have a quuestion.
Is there any method (or function) to see the number of processor on GPU which is running?
I will run with smaller models and make a feedback later.

albanD · March 29, 2018, 10:52am

Not sure if such method exists, you can check online the number of cuda cores your GPU has.

dat_pham_thanh · March 29, 2018, 12:39pm

I checked with smaller model and the processor is not full.

The time for parallel model is still longer than sequential model.
Below is my function using thread on single GPU.

def parallel_apply(modules, inputs, kwargs_tup=None, devices=None):

assert len(modules) == len(inputs)
if kwargs_tup is not None:
    assert len(modules) == len(kwargs_tup)
else:
    kwargs_tup = ({},) * len(modules)
if devices is not None:
    assert len(modules) == len(devices)
else:
    devices = [None] * len(modules)

lock = threading.Lock()
results = {}


def _worker(i, module, input, kwargs, lock, device=None):

    if device is None:
        device = get_a_var(input).get_device()
    try:
        with torch.cuda.device(device):
            output = module.cuda(device)(*input,**kwargs);
        with lock:
            results[i] = output
    except Exception as e:
        with lock:
            results[i] = e

if len(modules) > 1:
    threads = [threading.Thread(target=_worker,
                               args=(i, module, input, kwargs, lock, device),)
               for i, (module, input, kwargs, device) in
               enumerate(zip(modules, inputs, kwargs_tup, devices))]

    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
else:
    _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0])

outputs = []
for i in range(len(inputs)):
    output = results[i]
    if isinstance(output, Exception):
        raise output
    outputs.append(output)
return outputs