DataLoader for strings in multi-gpu training

Han_Brian_Lee · June 15, 2020, 8:58pm

Hi,

I have data that consists of numbers as strings, i.e.:
[image c x w x h tensor, image c x w x h tensor, groundtruth number c x w x h tensor, strings],

where tuple of strings is not just “cat”, “dog” etc that can be encoded to 0,1,…, but just some information about where the tensor data came from (i.e. “/data/asdf/qwer.png”).

I’ve been successfully using Pytorch’s Dataset an DataLoader to load the data (excluding strings part) onto 4 GPUs I have. (i.e. with batch size of 32, each gpu gets the 2 image tensors and groundtruth number tensor in size of 8).

Then I wanted to also feed in the strings data as well because I needed to do some processing based on that in forward method of my network, and realized strings cannot be turned into tensors. The DataLoader is able to group randomly sampled 32 of the strings into a batch successfully, but each of the 4 GPUs are getting the exact same 32 strings, instead of 8 strings into each GPU.

Is there any way this can be done?

Thanks,