Oh, that makes sense. I totally missed that c_uint
would refer to an unsigned int32.
Good to hear you’ve isolated it down.
@ptrblck does this approach work for large datasets like imagenet (1200000 images). It seems that you are preserving a memory block for an array of size (nb_samplesch*w) at first.
shared_array_base = mp.Array(ctypes.c_float, nb_samples*c*h*w)
The shared memory approach stores the complete dataset in it, so depending on your system RAM you might not be able to use it in this way.
Hi,
I am facing a kind of similar issue: I use a custom vision dataset and I load the images in RAM in my Dataset
class. Here is a snippet:
class MyDataset(torch.utils.data.Dataset):
def __init__(self, path, preload_images=False):
self.path = path
self.preload_images = preload_images
self.data = json.load(open(path, 'r'))
self.keys = list(self.data.keys())
if self.preload_images:
self.images = []
for k in self.keys:
self.images.append(Image.open(k).convert('RGB'))
def __len__(self,):
return len(self.keys)
def __getitem__(self, idx):
if self.preload_images:
image = self.images[idx]
else:
image = Image.open(self.keys[idx]).convert('RGB')
return image
Everything works well but when I use it (with preload_images=True
) with DDP and torchrun --nproc_per_node=$NGPU train.py
, each process creates its own Dataset
thus increases the RAM usage by a factor $NGPU
. Even though I have a large amount of memory, this does not scale well…
Do you have an idea on how to proceed to only load one copy ?
You couldlazily load the images in the __getitem__
method which would then only create copies of the data paths and should reduce the memory usage significantly.
I know but it would also significantly reduce the speed. Is it gonna work to do a trick like this:
import torch.distributed as dist
if dist.get_rank() == 0:
dataset = MyDataset(path, True)
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
so that only one process loads the data in RAM ?
The DistributedSampler
would split the indices for each rank, but the Dataset.__init__
method would still preload the entire dataset. You could try to pass the corresponding indices into the Dataset
initialization and make sure only the needed samples are preloaded. Using the sampler would be too late so you would need to create the indices beforehand. Maybe you could reuse the DistributedSampler
logic to create the indices manually and could then pass it to the __init__
.
Why not just use one Python process/interpreter, and have that utilize multiple GPU’s…then you can have the Dataset in RAM and all is good. If you are trying to run separate python processes and have them all share, then you have to use a library that supports IPC/shared memory
Because torchrun
creates as many subprocesses as we have GPU’s.
I guess I would (myself) do it monolithically and launch my GPU distributed work from a single Python process. However you can use shared memory libraries in Python to do what you like, it does seem a bit complex but if the IO cost of disk reads is high, it could be worth it.