I want to know how to speed up the dataloader. I am using torch.utils.data.DataLoader(8 workers) to train resnet18 on my own dataset. My environment is Ubuntu 16.04, 3 * Titan Xp, SSD 1T.
Epoch: [1079][0/232]
Time 5.149 (5.149)
Data 5.056 (5.056)
Loss 0.0648 (0.0648)
Prec@1 98.047 (98.047)
Epoch: [1079][10/232]
Time 0.208 (0.778)
Data 0.001 (0.610)
Loss 0.0904 (0.0833)
Prec@1 95.312 (96.165)
Epoch: [1079][20/232]
Time 0.206 (0.599)
Data 0.001 (0.421)
Loss 0.0771 (0.0877)
Prec@1 96.484 (95.964)
Epoch: [1079][30/232]
Time 0.208 (0.529)
Data 0.074 (0.353)
Loss 0.0681 (0.0886)
Prec@1 98.047 (96.006)
Epoch: [1079][40/232]
Time 1.799 (0.538)
Data 1.697 (0.376)
Loss 0.0647 (0.0860)
Prec@1 96.484 (96.103)
Epoch: [1079][50/232]
Time 0.201 (0.516)
Data 0.001 (0.360)
Loss 0.0578 (0.0836)
Prec@1 96.484 (96.186)
Epoch: [1079][60/232]
Time 0.443 (0.494)
Data 0.328 (0.340)
Loss 0.0602 (0.0835)
Prec@1 97.656 (96.203)
Epoch: [1079][70/232]
Time 0.201 (0.483)
Data 0.001 (0.334)
Loss 0.1140 (0.0858)
Prec@1 94.922 (96.121)
Epoch: [1079][80/232]
Time 0.616 (0.489)
Data 0.521 (0.343)
Loss 0.0932 (0.0857)
Prec@1 95.703 (96.161)
Epoch: [1079][90/232]
Time 0.200 (0.485)
Data 0.001 (0.341)
Loss 0.0596 (0.0847)
Prec@1 98.047 (96.175)
Epoch: [1079][100/232]
Time 0.362 (0.477)
Data 0.269 (0.334)
Loss 0.0877 (0.0854)
Prec@1 94.531 (96.117)
The log shows that the dataloader takes at least 50% time of the training process. So I want to speed up the training process by reducing the time for dataloader.
I analyses the time for the datalayer get_item()
total time: 0.02
load img time: 0.0140, 78.17%
random crop and resize time: 0.0001, 0.68%
random flip time: 0.0001, 0.40%
other time: 22.36%
It shows that the reading image takes the most of time (I read 3 images in this section). I uses the codes from torch.utils.data.Dataset
def pil_loader(path):
# open path as file to avoid ResourceWarning (https://github.com/python-pillow/Pillow/issues/835)
with open(path, ârbâ) as f:
with Image.open(f) as img:
return img.convert(âRGBâ)
Therefore, I tried Rookie ask: how to speed up the loading speed in pytorch . I save the images as strings with pickle in lmdb. Then I load them back. I found that it doesnât speed up too much. Maybe pickle.loads() still costs too much time.
Now I have no idea to speed up the dataloader. Any hints will help me much.
A database is a good optimisation when your data is a huge text file, but for images stored in individual files it is probably overkill and will add some unnecessary overhead.
Another point to consider is that pickle might be a little slow to save/load pytorch Tensors.
As you say, the image decoding seems to take most of the time, so I would suggest writing a small script that loads each image_file.jpg into a torch Tensor, and then uses torch.save to save the result back into a file named image_file.pt or something like that.
Then you would need to modify your loader to torch.load the .pt files instead of loading the image files.
In my experience, I would first build an HDF5 file with all your images, which you can build easily following the documentation of h5py on http://docs.h5py.org/en/latest/. During training, build a class inheriting from Dataset which returns your images. Something along this line:
Once I save preprocessed images as .pt format. It will load the tensor directly. How can I do random crop and resize? Need I convert them back to PIL.Image?
I hadnât thought of that problem⊠I can think of two approaches but I canât tell you which will work the fastest.
Convert each image into .bmp format instead of .jpg. Then use your original loader to load the .bmp format which will decompress much faster than .jpg.
Use torchvision.transforms.ToPILImage which I think should run pretty fast.
I have implemented dataset_h5. It runs quite well when I set the ânum_workersâ as 1 or 2. I meet some problem when I set ânum_workerâ bigger than 2. It seems related to the h5py version. My h5py now is 2.7.1.
File â/home/titan/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.pyâ, line 281, in next
return self._process_next_batch(batch)
File â/home/titan/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.pyâ, line 301, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
KeyError: âTraceback (most recent call last):\n File â/home/titan/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.pyâ, line 55, in _worker_loop\n samples = collate_fn([dataset[i] for i in batch_indices])\n File â/home/titan/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.pyâ, line 55, in \n samples = collate_fn([dataset[i] for i in batch_indices])\n File â/home/titan/code/res-pytorch/AdobeData/AdobeData.pyâ, line 595, in getitem\n fgimg = self.fgfile['img'][index, âŠ]\n File âh5py/_objects.pyxâ, line 54, in h5py._objects.with_phil.wrapper\n File âh5py/_objects.pyxâ, line 55, in h5py._objects.with_phil.wrapper\n File â/home/titan/anaconda3/lib/python3.6/site-packages/h5py/_hl/group.pyâ, line 167, in getitem\n oid = h5o.open(self.id, self._e(name), lapl=self._lapl)\n File âh5py/_objects.pyxâ, line 54, in h5py._objects.with_phil.wrapper\n File âh5py/_objects.pyxâ, line 55, in h5py._objects.with_phil.wrapper\n File âh5py/h5o.pyxâ, line 190, in h5py.h5o.open\nKeyError: 'Unable to open object (bad object header version number)'\nâ
It looks like HDF5 has some concurrency issues. My suggestion of using it is probably not appropriate when you use several workers. I often use one worker because my networks are computationally heavy and Iâm not limited by the data iterator. Perhaps you should try other approaches like zarr (http://zarr.readthedocs.io/en/stable/) which have been designed to be thread-safe.
Iâm not sure if the problem is same like my, but Iâve problem with reading a very big images (3k by 3k).
If this is your case, it is my advice:
Can you divide images to smaller sub-region, ex. 1k vs 1k? If yes, just make some crops of images to make it smaller. Then reading speed should be much faster.
You can use jpeg4py, library dedicated to encode big jpeg files much faster than PIL. Just read image using this library, then transform it to PIL.
The fastest available option, which I found is using jpeg4py library together with OpenCV data-augmentation (so no PIL image). I used OpenCV technique from this pull request.
After reading the some codes of torch.utils.data.dataloader, I find it does not work like caffe, which prefetches next batch data during the GPUs are working. I find a blog which tries to do it. I will try it.
After I read the blog, I made a mistake which the dataloader tries to prefetch the next batch data . But I find that it can NOT make full use of CPUs (it shows that it only uses about 60% ability of CPU). In the blog it shows that the data precoessing takes less than 18% of time. Actually, if it could fully achieve its target and made full use of CPUs and the disk is fast enough, it should be near 0%, not 18%.
@Hou_Qiqi
have you finally fixed this annoying problem of âin h5py.h5o.open\nKeyError: âUnable to open object (bad object header version number)â\nââ ?
I have the same issue with you when I set num_workers greater than 2.
For anyone reading this, Nvidia DALI is a great solution:
Itâs got simple to use Pytorch integration.
I was running into the same problems with the pytorch dataloader. On ImageNet, I couldnât seem to get above about 250 images/sec. On a Google cloud instance with 12 cores & a V100, I could get just over 2000 images/sec with DALI. However in cases where the dataloader isnât the bottleneck, I found that using DALI would impact performance 5-10%. This makes sense I think, as youâre using the GPU to some of the decoding & preprocessing
Edit: Dali also has a CPU only mode, meaning no GPU performance hit
Just found this thread a few days ago and implemented NVIDIA DALI to load my data while doing some transfer learning with Alexnet.
On an AWS p2.x8large instance (8 Tesla K80 GPUs), using DALI speeds up my epoch time from 480 s / epoch to 10 s / epoch. No need for any code that explicitly prepares data while the GPUs are running. One important thing to note is that if youâre using a DALI dataloader from an external source (i.e. if you have your image classes grouped by folder), you have to manually reset the dataloader using loader_name.reset() at the end of every training epoch. Thatâs not how the pytorch dataloader works, so it took me a while to realize that was what was going on here.
The only irritating thing Iâve found about DALI is that there is no immediately obvious way (to me, anyway) to convert pixel values from uint8 with a 0-255 range to float with a 0-1 range, which is needed for transfer learning with pytorchâs pretrained models. Dividing by 255.0 within a pipeline runs into data type mismatch, the ops.Cast() function only converts to float but doesnât rescale, and all the various flavors of normalize functions donât allow for it either. The only way I was able to do it was by manually scaling the mean and std values given by pytorch.
Other than that, agree that the pytorch integration is simple and fairly clean.
I am facing the similar problem. However, I am more interested in knowing how to make my dataloader efficient.
Ideally, I want to read multiple temporal blocks for a video, perform transformations like random crop, rescale, extract I3D features for them, concat them in a tensor and return with other information like text embeddings for question, multiple answers.
For now, I am preprocessing my videos and extracting features beforehand, which does not allow me to use multiple random crops and not a good way. The video frames are 60GB size in total. Any tips?
I am using Ubuntu 16.04, 24gb RAM, 245gb/959gb disk space is free right now, 1 Titan Xp GPU, Python 3.6.4, Pytorch 0.4.1, cuda 8.0.61, cudnn 7102.
Dataset is on local disk.