The bottleneck on loading a sequence of images to compose the input tensor

Hi everyone,

I’m implementing the TTNet paper. The input tensor of the model consists of N=9 sequential images, so it has a size of (batch_size, 27, H, W).

To load an image sequence, I implement the code as below (also here)

origin_imgs = []
for img_path_idx, img_path in enumerate(img_path_list):
origin_imgs = np.dstack(origin_imgs)  # (1080, 1920, 27)

I have trained the model with 4 GPUs using torch.nn.parallel.DistributedDataParallel(), batch_size=32, n_workers = 8.

However, the loading data phase is a bottleneck. This leads to a very slow training process (around 1 hour/epoch).

In the case of N=1, the input tensor has a shape of (batch_size, 3, H, W), training 1 epoch just takes me around 4 minutes.

Please suggest me solutions to boost up the loading data phase.
Thank you.

You do have your data on SSD, right?

Is it the loading itself or augmentation? I usually try to move all augmentation to the GPU except gathering statistics which I do beforehand and store in a tensor.

For the loading:

Finally, you ask about the bottleneck of where your training spends its time. This is important and leads to reducing the time per epoch.
You probably are aware, but it is worth reminding ourselves every now and then that other things (most prominently the learning rate) impact the overall runtime to a given target metric by changing the number of epochs needed. (And I say this as someone who does quite a bit of client work of “benchmark-and-optimize-time-per-epoch”.)

Best regards


1 Like

Yes. I load my data from an SSD disk. I’m using OpenCV to load images.

I’d like provide an example of the loading time:

  • Loading the sequence of 9 images (resolution: 1080x1920x3) — time: 0.8175349235534668 second
  • Augmentation — time: 0.05149197578430176 second

That’s why I think the loading raw images phase is the bottleneck.

Right. :slight_smile:

  • Is TorchVision+pillow-simd or the other backend faster or slower?
  • Is it IO or CPU bound?

You could try if you can save time by writing tensors of 9 images to disk.
However, in my experience, this can make the disk access a problem. One thing to do is to keep the images as byte tensors (if they have been 0…255) to cut the disk use.
(Also, you don’t measure the just the first batch, right? It is known to be slower.)

Of course, you can also look at at the hardware (for IO: M.2 is significantly faster than SATA for SSDs is one obvious thing, for CPU, more cores, I always want to get one of these fancy CPUs…).

Best regards


1 Like

I have solved the problem.
My solutions:

  • Using the TurboJPEG library to load images instead of OpenCV
  • Before stacking images to compose an input tensor, we should resize the images to a smaller size.
    Thank you so much!