Problem in Training Time of model

hi … I am using the data loader below, but while training my on GPU, the model takes too much time. I calculated the time for single line of for loop main it take 2 sec , this make GPU waiting for next batch like gpu performance reach to 100 and for a second and than zero in every step of iteration. I set the no of worker to zero. my cpu usage is about 1800%. the code for data loader and training loop is given below.
My dataset class is
class myDataSet(Dataset):

# 继承Dataset, 重载__init__, __getitem__, __len__
def __init__(self, fileList, transform_mode=None, list_reader=PIL_list_reader, loader=img_loader, cuda=True):
    self.channel_num = channel_num
    self.image_size = img_size
    self.loader    = loader
    self.Nz = 50
    self.imgList, self.pose_label, self.id_label, self.Np, self.Nd = list_reader(fileList)
    if transform_mode=='train_transform': self.transform = multiPIE_train_transform
    elif transform_mode=='test_transform': self.transform = multiPIE_test_transform
    elif transform_mode=='trainaug_transform': self.transform = multiPIE_train_aug_transform
    elif transform_mode=='testaug_transform': self.transform = multiPIE_test_aug_transform
    else: self.transform=None
def __getitem__(self, index):
    imgPath = self.imgList[index]
    img = self.loader(imgPath)
    if self.transform is not None:
        img = self.transform(img)
    #print("time of get item ",time.time()- start)
    return img, self.id_label[index], self.pose_label[index]

def __len__(self):
    return len(self.imgList)

the training loop is

for i, [batch_image, batch_id_label, batch_pose_label] in enumerate(dataloader):


        batch_size = batch_image.size(0)
        batch_real_label = torch.ones(batch_size)
        batch_sys_label = torch.zeros(batch_size)
        # generate noise code and pose code, label: LongTensor, input: FloatTensor
        noise = torch.FloatTensor(np.random.uniform(-1,1, (batch_size, Nz)))
        pose_code_label  = np.random.randint(Np, size=batch_size) # get a list of int in range(Np)
        pose_code = np.zeros((batch_size, Np))
        pose_code[range(batch_size), pose_code_label] = 1
        pose_code_label = torch.LongTensor(pose_code_label.tolist())
        pose_code = torch.FloatTensor(pose_code.tolist())
        batch_pose_code = np.zeros((batch_size, Np))
        batch_pose_code[range(batch_size), batch_pose_label] = 1
        batch_pose_code = torch.FloatTensor(pose_code.tolist())
        # use cuda for label and input
        if args.cuda:
            batch_image, batch_id_label, batch_pose_label, batch_real_label, batch_sys_label = \
                batch_image.cuda(), batch_id_label.cuda(), batch_pose_label.cuda(), batch_real_label.cuda(), batch_sys_label.cuda()

            noise, pose_code, pose_code_label, batch_pose_code = \
                noise.cuda(), pose_code.cuda(), pose_code_label.cuda(), batch_pose_code.cuda()
        # use Variable for label and input
        batch_image, batch_id_label, batch_pose_label, batch_real_label, batch_sys_label = \
            Variable(batch_image), Variable(batch_id_label), Variable(batch_pose_label), Variable(batch_real_label), Variable(batch_sys_label)

        noise, pose_code, pose_code_label, batch_pose_code = \
            Variable(noise), Variable(pose_code), Variable(pose_code_label), Variable(batch_pose_code)
        # generator forward
        generated = G_model(batch_image, pose_code, noise) #forward

In your training loop there is still some pre-processing / creation of data going on.
Since this will be called sequentially, you’ll most likely see your GPU waiting for it.
Could you try to time the noise and pose generation?

Also, why did you set num_workers to zero? This would mean that the main thread is now responsible to load the data which might also slow down your code.

Actually when i increase the no of workers, no other one can use gpu. GPU just struck, i could not stop my processing as admin can not kill my process. i have to shutdown the server and start it again. thats why i set no of worker to zero.

Thank you for your attention. I also checked the time for pose and noise generation. Its about 0.00041866302490234375 sec so its normal. but i could not understand that why training loop too much time.

Can you suggest me some type of solution for this bug? Thank you for help

I’m not sure it’s a bug.
Could you try to time all different parts of your training separately and see which one takes most of the time?
Note, that CUDA calls are asynchronous so you should synchronize before starting and stopping the timer using torch.cuda.synchronize.

1 Like
    for i,[batch_image, batch_id_label, batch_pose_label] in enumerate(dataloader):
        print("time of or loop is ", start.elapsed_time(end))

for just this one line of for loop it gave following time

all the other lines give normal time. Now what should i do in this situation?

torch.cuda-synchronize() should be called before the timer.
Nevertheless, it seems your DataLoader or Dataset takes that much time to load and process each sample. You could try to speed it up, e.g. by moving the data from an HDD to an SSD if possible, or try to speed up the preprocessing code.
Also, I’m currently unsure, why you can’t use multiple workers. Did you code just hang if you tried it?

1 Like

Actually when i increase no of worker, still gpu consumption remains same but for killing the process i have to restart my GPU. This is the problem that i use no of worker=0.

After using torch.cuda.synchronize before time, the output of time is given as follow:
Any suggestion in this regard?
for i,[batch_image, batch_id_label, batch_pose_label] in enumerate(dataloader):

        print("time of or loop is ", start.elapsed_time(end))

It looks like the time is increasing in each iteration.
Can you confirm this?
Also, do you see any GPU memory growth?

1 Like

yeah time is obviously increasing. The gpu usage is 0% when iteration show the time and goes to 100% for while till next step. here you can check the cpu usage

It’s expected behavior that your GPU can’t process anything while waiting for the next batch, since your code runs in the main thread. Multiple workers should help in this case, but as you explained it’s apparently not possible to use them due to some system restrictions. I’m still unsure, why you would have to kill the processes manually.

Could you try to remove the training code and just time the data loading?
Basically just the time to get the next batch without any processing on the GPU etc.

1 Like

For data loading
dataset = multiPIE(args.data_path, transform_mode=‘train_transform’) this line took o.45 sec,
and if i combine both lines
dataset = multiPIE(args.data_path, transform_mode=‘train_transform’)
dataloader =, batch_size=args.batch_size, shuffle=True, num_workers=0) than these took 0.45 same time.

dataset = multiPIE(args.data_path, transform_mode=‘train_transform’)
dataloader =, batch_size=args.batch_size, shuffle=True, num_workers=0)
print(“time for data loading”, time.time()-start)
but this line takes 2.80. :frowning:
still confuse what is the problem

The first lines of code just initialize the Dataset and DataLoader, so no data will actually be loaded at this point, if not done in __init__.

The next(iter()) operation calls __getitem__ and loads the data, thus taking more time.
As you can see the time to load a single batch takes about 2.5 seconds.
So if you are not able to use multiple workers and can’t speed up the loading (e.g. SSD instead of HDD), your code will spend this time for loading each batch.

1 Like

Now i am using no_of_worker=4 and in train it processed 4 step simultaneously and again gpu usage is zero for sometime and and goes to 100 when it shows next four steps. So using no of worker couldnot solve the problem. now the cpu usage is low but still gpu :frowning:

According to my thoughts, the possible reason should

  1. reading many small files from disk is slower than read a large file
    what do you think?
    If the problem is this what is possible solution of this problem in pytorch?

You could try to preload the dataset, if your memory is large enough. This would slow down the initial iteration, but the following ones won’t have to wait for the IO operations.

the data is about 300GB i can not pre-load it. Do you have any other idea to preprocess dataset faster? Your idea will prove great help for me.