Fast data loader for Imagenet

It is really slow for me to load the image-net dataset for training :cold_sweat:.
I use the official example to train a model on image-net classification 2012. It costs almost time to load the images from disk.
I also tried to use fuel to save all images to an h5 file before training. But it seems still very slow. A min-batch of size 128 costs about 3.6s while 3.2s is used for data loading.
Is there any way to load data faster? Thanks~

1 Like

How many workers are you using? What kind of disk are you using to store the dataset?

I use 5 workers and store the dataset in a HDD.
I found that a part of the time consumption is caused by data transform. Maybe disk io is not the only problem.

Yes, the data transforms can be quite expensive.

1 Like

Unfortunately all those mass consumption dataset loaders and feeders are far from optimal, everyone serious about this ends up writing their own performant code (or compensate by getting expensive SSDs and CPUs) but don’t publish it because it’s embarrassingly written.

People often say you need to use SSDs and whatnot, but I use a couple of ancient HDDs and it’s enough to feed a duo 1080 setup. It’s just troublesome to write a properly multi threaded application for this, you don’t even need to do it in C if you have a good CPU.

2 Likes

More workers seems slower. The data loading is extremely slow with a low cpu load, which means the dataloader can’t make full use of the cpu.

I had this same problem; admin at our university thought they should save money on a server with 4 Titan X’s and not get an SSD for it. Made tensorpack’s sequential loader even easier to use in PyTorch: https://github.com/BayesWatch/sequential-imagenet-dataloader

4 Likes

did you store the images file after preprocessing?

Yeah, but it’s a one-off cost. I just copied it to the other servers after.

I run the script, but I meet the problem.

[1202 12:58:13 @concurrency.py:236] WRN Command failed: 127
[1202 12:58:13 @concurrency.py:237] WRN /bin/sh: 1: protoc: not found

Could be that you’re missing protobuf. You could install it with conda. If that works, I’ll add it to the README, I don’t have time to test it myself right now.

I fixed it days ago. I think it is caused by protobuf.

Hi, I’ve used your data but the loading speed is still 4s for batch=256, num_worker=1, which I think is not fast enough, is it normal?

I used 4 workers, and each minibatch took 0.59s to process, including the time for forward and backward propagation. Maybe you should try with more workers?

Hi, I use your method. But when I train the imagenet, I meet the problem

Traceback (most recent call last):
File “main.py”, line 388, in
main()
File “main.py”, line 154, in main
num_workers=args.workers,)
File “/home/jcz/github/pytorch_examples/imagenet/sequential_imagenet_dataloader/imagenet_seq/data.py”, line 166, in init
ds = td.LMDBData(lmdb_loc, shuffle=False)
File “/home/jcz/github/tensorpack/tensorpack/dataflow/format.py”, line 91, in init
self._set_keys(keys)
File “/home/jcz/github/tensorpack/tensorpack/dataflow/format.py”, line 109, in _set_keys
self.keys = loads(self.keys)
File “/home/jcz/github/tensorpack/tensorpack/utils/serialize.py”, line 29, in loads_msgpack
return msgpack.loads(buf, raw=False, max_bin_len=1000000000)
File “/home/jcz/Venv/pytorch/lib/python3.5/site-packages/msgpack_numpy.py”, line 214, in unpackb
return _unpackb(packed, **kwargs)
File “msgpack/_unpacker.pyx”, line 187, in msgpack._cmsgpack.unpackb
ValueError: 1281167 exceeds max_array_len(131072)

I fixed it. Just need to downgrade msgpack to 0.5.6