The best way to use neural network in Dataloader?

luvwinnie · August 15, 2021, 12:07am

I have a neural network implemented in pytorch which i wished to used for augment the dataset within Dataloader. I loaded the neural network in a python generator and pass it to Dataloader. However it seems like it initialized multiple neural network on CPU, it there anyway or anybody know a way to use a neural network with a Dataloader? I keep search on google but seems like no people have done this?

ptrblck · August 16, 2021, 3:35am

The multiple initializations are expected in case you are using num_workers>0, since each worker would create a Dataset copy and would thus also re-initialize the neural network to preprocess the data.
Your approach is uncommon and I would guess that the more common use case would be to use the NN to preprocess the data into the DataLoader loop (in the same way the trainable model would be used).

luvwinnie · August 17, 2021, 2:57am

Thank you for reply!
Hmm… I have done this because my number of data is too large and it take too much time to use the neural network(network such as StyleGAN) to augment the dataset. Therefore, I use the neural network in the Dataset to augment the dataset on the fly. Wondering that someone have done this also.

ptrblck · August 17, 2021, 3:01am

I’m unsure how to understand this in the context of the question.
If the single neural network is too slow in its performance to transform the data fast enough, I would expect that model clones would be desired in each worker, so that the processing could be done in parallel (on the CPU). However, it seems you would like to use a single neural network inside the Dataset (and all workers), which would then yield the slow performance again, wouldn’t it?

luvwinnie · August 17, 2021, 3:11am

Hmm…maybe you are right that even if I use a single neural network inside the Dataset would cause it slower since it is not process in parallel CPUs.
I was thinking to use a single neural in Dataset and send the it to GPU to augment data, if the neural network is accessible on all worker. Since GPU would process faster than CPU.

One way to do this is use the class object of collate_fn for the Dataloader, so that the collate_fn object initialize a neural network in GPU to augment in batch. However my dataset is a aiocr dataset which contains character level and sentence level data and I wish to augment in character level.

Maybe we should find a way to do this. Hmm…

ptrblck · August 17, 2021, 3:14am

If you want to use the model on the GPU, I would probably just use it inside the DataLoader loop, as it would process an entire batch (while in the __getitem__ of the Dataset it would process a single sample by default). Re-initializing it in the collate_fn wouldn’t yield any gain (besides the initialization overhead of course).

luvwinnie · August 17, 2021, 5:00am

Thank you for advise, I would like to take a try with it!