Extending this short thread: I am performing grid-search of a small LSTM in an 8 GPU instance, and % usage is low. Which one is better to make a more efficient use of resources (given constant model complexity):
- Send multiple jobs to all GPUs, or
- Send multiple jobs to specific (non-overlapping) GPUs?
What are any pro/cons for each of these options (if at all possible)?
In my experience GPUs don’t multitask very efficiently. It’s usually better to run jobs on non-overlapping GPUs. (e.g. use
CUDA_VISIBLE_DEVICES). Often, CUDA kernels are written to take advantage of the entire GPU.
You may still get some speed-up by, say running 2 jobs per GPU, but probably not 2x. Of course, this is dependent on your problem.
If you try multiple jobs per GPU, please report back what you see. I’m curious to know.
For others who might be interested: I tested running multiple jobs on all 8 GPUs in a single instance (P2 instance on AWS, K80 GPU). When running a single job
nvidia-smi shows about 2% usage per GPU. Running a second model spikes the usage of all GPUs to 100% and the jobs choke --literally stop running. So nope.
CUDA_VISIBLE_DEVICES works as expected, but I have a couple of questions:
I notice that
torch.cuda.current_device() is always assigned to 0, regardless of which GPUs are made visible in the environment. Not sure if this is a problem but I would have assumed that if only GPUs 4,5,6,7 are visible then
current_device should be one of these?
With the exact same model settings as when using 8 GPUs, with 2 GPUs only the job is using around 5-40% of each GPU. Expected. But an epoch takes around 7 minutes with 8 GPUs but only 2 minutes with 2. What gives?
if CUDA_VISIBLE_DEVICES=4,5,6,7 then gpu-4 becomes 0, gpu-5 becomes 1, etc.