Hi, and apologies if my question is answered somewhere, I couldn’t really find anything because the questions in the opposite direction always show up. Also apologies because I cannot really provide my code (as its probably proprietary).
My loop looks something like this.
dataset = Dataset_Class(args)
cache_ds = deepcopy(dataset)
dataloader = Dataloader(dataset, otherargs)
for e in epochs:
for step, batch in dataloader:
cache_ds.cache_batch(batch)
optimizer.zero_grad()
# amp block
loss = model(batch)
loss.backward()
optimizer.step()
# the same thing is done for validation
if e == 0:
dataloader = Dataloader(cache_ds, otherargs)
Now, naively, I would think that this will cache all the data within the dataset cache_ds
, which lives in the main process’ RAM during the first epoch (which it does). I would also assume that caching data live in this manner (and later using the cache) is faster than via disk (it is) or via the shared dict suggestion found at pytorch_misc/shared_dict.py at master · ptrblck/pytorch_misc · GitHub, as I have no IPC overhead from the dataloader worker processes accessing the shared memory. Every worker should have its own dataset, fully cached, in its own RAM, after one large copying op at the end of epoch 0, and they should also be quite fast.
That last set of assumptions is where I am wrong, apparently, but I don’t understand why. According to free -mh
running in a separate shell on the machine I work on, the RAM usage does not increase after the first epoch, despite setting the number of workers to 2 or 4 (this one should run me out of memory ftr).
Why does this not happen? Shouldn’t the dataset be copied to every worker unless I explicitly design the dataset around a shared dict behind a proxy object like in the GitHub link?
Is there maybe some hardcoded limit up to which the dataloader will copy your dataset and past which it decides that it will not copy? The in-memory size of the dataset is around 500 odd GB. Available memory is 2 TB.
Don’t know if any of this is relevant:
I checked whether I was even running more than one process, and I am certain that I am.
Also, because I think I read somewhere that tensors have a sort of shared memory implementation by default, I saved all tensors in the cache as numpy arrays instead of tensors (via tensor.clone().detach().numpy()
and then splitting the numpy array along the batch dimension).