DataSaver instead of DataLoader - Efficient way to transfer big amounts of logged data

wildfly · September 25, 2024, 11:00pm

Hey, I want to log big amounts of data (100 GB and more if possible) like activations, gradients, all relevant state_dicts … per batch for research purposes. The data size makes most solutions I know impractical.

The logic of DataLoaders with multiprocessing, pinned-memory, … is quite complex but highly desirable for performant data transfer between CPU(host) and GPU(device) and thus better utilization of the kernel engine. However, they are optimized for pre-loading data that will be processed and not for transferring data to save it that already has been processed, as far as I understand that.

I have seen some methods like check-pointing which doesn’t provide the needed flexibility and recalculate e.g. all activations. TensorDict used mainly in torchrl supports some async operation and multiple threads buts its unclear to me how that would be efficiently used and if that is only performant from host to device or bidirectional. Another option might be the torchrl.data.replay_buffers but I would prefer a clean pytorch/tensordict only solution if possible.

What would be a performant way to stream/move data back like DataSavers (inverse DataLoaders) which will automatically move processed data (preferably in chunks - analog to prefetch_factor argument of DataLoaders) back to the host device where it can be stacked/saved into e.g. memory-mapped tensors.

I guess I would not be the only one requesting such a feature so I guess I missed out on something obvious but I can’t find it …

Happy for any ideas or hints to existing features.

Best, Max

ptrblck · September 26, 2024, 10:09pm

By default you would load and process each sample on the host in the Dataset.__getitem__. The DataLoader will use multiple workers to preload these samples and to create batches.
Inside the DataLoader loop you would then move the processed batch from the host to the GPU, so there is no need to move this data back unless I misunderstand your use case and question.

wildfly · September 28, 2024, 10:40am

I want to do the opposite of what DataLoader does. Let me make a sketch of the workflow I had in mind behind this question:

Dataloader loads data onto GPU (async, multiple workers) → that is obviously already existing
The data is process via a model (forward + backward pass) which produces intermediate (activations, outputs, gradients) which will normally be processed further and then deleted (e.g. after backward pass, optimizer step). Thus there is no need to transfer anything of the results back as you mentioned.
As I want to keep some of this intermediate results (which I obtain copies of e.g. via hooks) and save them permanently I now want to move them back to CPU and then save them to disk in bigger chunks. Like how the loss is often moved to cpu with loss.detach(),cpu() for documentation or monitoring purposes of the training in each epoch.

The problem is that the volume and frequency of data produced in the hook calls, which are used to save a copy temporarily on GPU, is way to big that I would like to use blocking result[i].detach().cpu() calls which would be called at every hook making the data transfer very inefficient (at least thats what I think happens). Therefore, I was wondering if there is a nice way to stream big amounts of result data back to CPU and then save them to disk concurrently.

The main idea I had how to solve this is

save “result” data into GPU buffers (e.g using multiple buffer per forward backward call - one locked to write current batch/epoch + multiple already written ready for non-blocking transfer)
move bigger chunks asyc to the CPU memory / RAM
use multiple workers on the CPU to save the transferred buffer to disk (to be able to handle data that exceeds the RAM size by orders of magnitude).

However, I am not sure how to implement that efficiently and wondered if there might already be some modules existing for part of the solution.

wildfly · February 12, 2025, 8:45am

@ptrblck I guess you misunderstood my use case. In the meantime I have been searching for solutions but still not found any. There are plenty of data loading pipeline functionalities but afaik nothing that optimizes the saving process e.g. allows automatic/adaptive batching or managing distributed settings (see problems described in my second post). In theory a solution could be similar to torchdata.nodes which provides the ability to create DAGs for optimizing the data loading pipelines. I would like to create DAGs for saving that will be part of the optimization as well. In the best case its allows fusing the entire process from loading, processing to saving to reduce potential bottleneck and optimize away unused transformations. Any suggestions?