Setting num_workers>0 increase GPU memory of only 1 GPU


I am training a model on 2 GPUs. In the get_item of my custom dataloader, pre-processing is done on GPU to increase speed.

When I set num_workers to 0, the memory usage of my 2 GPUs are the same : 22k/32k.

However, When I set num_workers=2 with the following code:

trainloader = DataLoader(train, batch_size=4, shuffle=True, num_workers=2, persistent_workers=True)

The memory usage of my first GPU becomes 31k/32k, while the memory usage of my second GPU remains 22k/32k.

Please, do you have any idea ?

EDIT : When I display the device used in the dataloader, only GPU 0 is used. GPU 1 is never used.

array = np.load(path)
tensor = torch.from_numpy(tensor).to(self.device)
print(tensor.device) # shows only cuda:0

where self.device is initialized in init such as :

self.device = torch.device('cuda')

You are using the default CUDA device in each worker so the memory increase on the default GPU is expected. Besides creating additional CUDA contexts the actual data loading will also use memory.

Thank you for your answer.

Sorry I’m new to parallelism so I’m not sure to understand, what can I change from my implementation to solve this issue ?

Thanks in advance!

If you want to use a worker per GPU, you could use the worker ID and call torch.cuda.set_device with it to load the sample on the specified device.
The common approach is to let the CPU load the samples and to move the batch to the GPU inside the training loop.