Still new to pyTorch, and my local machine has no GPU so I’ve been prepping my network locally until I’m ready to pay for a cloud instance and use a GPU for speed.
I have a large dataset of non-image data which I am converting into image samples so that I may train a VAE with CNN. I’m struggling with when and which tensors to move on and off of the GPU, which should keep a gradient or whether that is overkill.
The GPU does seem to be working, as I can see it ramp up and down as a batch moves through the training cycle, however I suspect most of my time is wasted trying to build these images on-the-fly (which I need to do for future use case).
are there any helpful examples or guidelines I should be following? For example, should I first construct an entire numpy array for the values of the image and then convert to a tensor placed on the GPU? Or should I construct the values entirely with tensors to begin with? Since I’m generating batches of images, do I wait to place the whole batch on the GPU, or does that not make any difference? I have hundreds of thousands of data samples and so I cannot effectively keep them all within the GPU memory, so I end up clearing the GPU memory after each mini-batch and the image needs to be regenerated from scratch on the next training round. There must be a more efficient way to handle all this data.
Does anyone have experience with generating images from sample data during the load phase that could offer advice.
for reference, I’m referencing this paper as a guide for my approach: non-image-data-classification-using-CNN
One solution could be to preprocess all your dataset and save it as images or pickle numpy arrays then you’ll need only 1 load.
If you want to speed up training on your GPU (with any other fine-tuning consideration) you should use the maximum batch_size allows by your GPU memory.
Could you use the DataLoader class with a custom dataset? It has options for controlling prefetching, memory pinning to speed up moving data to and from the GPU, and multiple workers if the loading can be parallelised.
I hadn’t known that about the DataLoader. I’ll have to look into that, thank you.
Unrelated to your issue but if you want to use a free GPU you can do so with Google Colabs.
It’s not something you’ll want to use for a larger project since everything only exists within the lifetime of your session, but it’s a good away to test code and play around with the GPU.
That will certainly help as I convert the code over to use the GPU. It’s amazing how much effort seems to go into making sure tensors are in the correct place and not contributing to GPU memory when they don’t need to be. I’m sure there’s a good reason for this that I just don’t understand yet, but from a newbie’s perspective it would be nice if most of that were automated for you.
You mean “why does so much work go into keeping tensors on or off the GPU”? Memory limitations are the obvious one. The ideal goal for any GPU training is that you find yourself bound only by the processing speed of the GPU: every batch is loaded and transferred to the GPU in time for training the next epoch and as much VRAM is used as possible (for larger batch-sizing, or with smaller batches to prefetch more). Accessing anything that’s on the GPU from the CPU is either impossible or requires syncing and transferring (depending on what the method is), so that’s something to avoid too.
Fortunately, the newbie friendly option is the data loader rather than writing individual batches, then you just need to make sure the model is running on the GPU.
oh, that’s a helpful bit of information. I was doing a couple things wrong with my dataloader and one of them seems to be that I was trying to focus attention on placing the data onto the GPU while inside the dataloader. I was also generating all the images within the dataloader as well. So, I’m restructuring to generate a custom dataset where the images will be constructed, and my understanding now is that the dataloader will pull from the custom dataset, allowing another batch of image data to be built while the network is processing what the dataloader has just provided.
Yes, you generate images + labels inside your dataset, and the DataLoader will batch the individual samples and allocate GPU memory if Pinned=True (at least, I think it needs to be True for the allocation to be performed automatically).
I feel like I’m much closer to getting this all to work, thanks to the advice here. The issue I seem to be having now is that when I set DataLoader’s num_workers > 0, I receive a warning and the loader never returns any data. if num_workers=0 then it works as expected.
I am on a Mac.
Here is my warning/error:
Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
Could this be due to the fact I am using a Jupyter notebook, which is perhaps already taking control of the thread process? Or do I need to set some other parameters to prepare for the dataloader to properly multithread?
As an update I’ve been able to get a test case of
num_workers > 0 to work after setting the environment variable
OMP_NUM_THREADS=2 though I’m not sure this had anything to do with it. Because when I then attempt to enumerate over the loader in the actual training loop the same error occurs. So something about being nested in the training loop is causing the thread confusion. I’d love to find a solution to this.
But just for information for others stumbling on this thread. Just the act of correcting my dataset creation and then drawing from it with the default DataLoader is much faster than trying to build the dataset within the dataloader and skipping the Custom DataSet as I was previously attempting.