Fast data loader for Imagenet

longcw · March 10, 2017, 2:48pm

It is really slow for me to load the image-net dataset for training .
I use the official example to train a model on image-net classification 2012. It costs almost time to load the images from disk.
I also tried to use fuel to save all images to an h5 file before training. But it seems still very slow. A min-batch of size 128 costs about 3.6s while 3.2s is used for data loading.
Is there any way to load data faster? Thanks~

apaszke · March 10, 2017, 4:35pm

How many workers are you using? What kind of disk are you using to store the dataset?

longcw · March 10, 2017, 4:54pm

I use 5 workers and store the dataset in a HDD.
I found that a part of the time consumption is caused by data transform. Maybe disk io is not the only problem.

apaszke · March 10, 2017, 5:12pm

Yes, the data transforms can be quite expensive.

Veril · March 10, 2017, 5:26pm

Unfortunately all those mass consumption dataset loaders and feeders are far from optimal, everyone serious about this ends up writing their own performant code (or compensate by getting expensive SSDs and CPUs) but don’t publish it because it’s embarrassingly written.

People often say you need to use SSDs and whatnot, but I use a couple of ancient HDDs and it’s enough to feed a duo 1080 setup. It’s just troublesome to write a properly multi threaded application for this, you don’t even need to do it in C if you have a good CPU.

Chong_Lv · August 21, 2017, 2:18am

More workers seems slower. The data loading is extremely slow with a low cpu load, which means the dataloader can’t make full use of the cpu.

gngdb · November 13, 2017, 11:20am

I had this same problem; admin at our university thought they should save money on a server with 4 Titan X’s and not get an SSD for it. Made tensorpack’s sequential loader even easier to use in PyTorch: https://github.com/BayesWatch/sequential-imagenet-dataloader

lkywk · December 2, 2017, 5:18am

did you store the images file after preprocessing?

gngdb · December 2, 2017, 10:27pm

Yeah, but it’s a one-off cost. I just copied it to the other servers after.

lkywk · December 3, 2017, 6:14am

I run the script, but I meet the problem.

[1202 12:58:13 @concurrency.py:236] WRN Command failed: 127
[1202 12:58:13 @concurrency.py:237] WRN /bin/sh: 1: protoc: not found

gngdb · December 13, 2017, 5:09pm

Could be that you’re missing protobuf. You could install it with conda. If that works, I’ll add it to the README, I don’t have time to test it myself right now.

lkywk · December 14, 2017, 1:59am

I fixed it days ago. I think it is caused by protobuf.

charlesjiangxm · November 1, 2018, 4:11pm

Hi, I’ve used your data but the loading speed is still 4s for batch=256, num_worker=1, which I think is not fast enough, is it normal?

gngdb · November 1, 2018, 4:14pm

I used 4 workers, and each minibatch took 0.59s to process, including the time for forward and backward propagation. Maybe you should try with more workers?

Chenzhi_Jiang · December 8, 2018, 4:21am

Hi, I use your method. But when I train the imagenet, I meet the problem

Traceback (most recent call last):
File “main.py”, line 388, in
main()
File “main.py”, line 154, in main
num_workers=args.workers,)
File “/home/jcz/github/pytorch_examples/imagenet/sequential_imagenet_dataloader/imagenet_seq/data.py”, line 166, in init
ds = td.LMDBData(lmdb_loc, shuffle=False)
File “/home/jcz/github/tensorpack/tensorpack/dataflow/format.py”, line 91, in init
self._set_keys(keys)
File “/home/jcz/github/tensorpack/tensorpack/dataflow/format.py”, line 109, in _set_keys
self.keys = loads(self.keys)
File “/home/jcz/github/tensorpack/tensorpack/utils/serialize.py”, line 29, in loads_msgpack
return msgpack.loads(buf, raw=False, max_bin_len=1000000000)
File “/home/jcz/Venv/pytorch/lib/python3.5/site-packages/msgpack_numpy.py”, line 214, in unpackb
return _unpackb(packed, **kwargs)
File “msgpack/_unpacker.pyx”, line 187, in msgpack._cmsgpack.unpackb
ValueError: 1281167 exceeds max_array_len(131072)

Chenzhi_Jiang · December 8, 2018, 4:52am

I fixed it. Just need to downgrade msgpack to 0.5.6