Data Loading takes very long time with Docker compared to local Pytorch without Docker

Hussein_A_Hassan · May 13, 2022, 4:04am

Hello

I have a training script when I run it on my local machine, the loading time of the data for one epoch is around 30 minutes but when I run the same script on much powerful server with Docker the loading time takes around 5 hours.

I don’t have much experience with Docker. I used the following Dockerfile to build the image.

FROM nvcr.io/nvidia/pytorch:21.08-py3

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y ffmpeg
RUN pip install pandas
RUN pip install scikit-video
RUN pip install ffmpeg-python
RUN pip install scikit-learn
RUN pip install opencv-python
RUN pip install tqdm
RUN pip install torchsummary
RUN pip install tensorboardX
CMD [“python”, “./train_classifier.py”]

and I run the Docker container with the following command:

sudo docker run --ipc=host -it --rm --gpus=device=0 --name train_container --network=host -v /home/h/data/:/workspace train_image:3.0 bash

I checked the shared memory of the server and it has over 50% free space during training.
In addition, when I set the number of workers in the training script to 0 no noticeable difference in the loading time would be achieved.

Please, if you have any suggestion to solve the problem let me know.

Your help is much appreciated.

Michal_Bogacz · May 13, 2022, 5:54am

Hello,
Are those 5h together with container build, or the script execution have just slowed down so much. Good practice is to define the versions of every package, build process would be much faster.
If the slow-down is only the code slowdown, can you show what happens in code, how do you load data?

Hussein_A_Hassan · May 13, 2022, 9:14am

Thanks @Michal_Bogacz for your help.

The script execution is just went extremely slow, The 5h is the time to load the data and do some augmentations on it only NOT the container image build.

I am trying to re run the training of the following paper:

This is the data loader code from the paper:

github.com

DAVEISHAN/TCLR/blob/main/tclr_pretraining/dl_tclr.py

r'''This dataloader is an attemp to make a master DL that provides 2 augmented version
of a sparse clip (covering minimum 64 frames) and 2 augmented versions of 4 dense clips
(covering 16 frames temporal span minimum)'''
import os
import torch
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
import config as cfg
import random
import pickle
import parameters as params
import json
import math
import cv2
# from tqdm import tqdm
import time
import torchvision.transforms as trans
# from decord import VideoReader

This file has been truncated. show original

I suspect that it could be a limitation on reading speed from the SSD or maybe the
cpu speed. I don’t know exactly but it could that there are some restrictions made by
the Docker.

Many Thanks

ptrblck · May 18, 2022, 5:27am

Did you test the data loading on the server without using docker as it seems you think this issue is docker-related?