Data is loading entirely on system memory but with increase in no. of workers in Dataloader gpu loads data but still gpu usage is very poor

I have created a custom dataloader to load color images but the data gets loaded on system memory instead of gpu memory.But after increasing no. of workers of Dataloader now it has loaded into gpu memory as well(besides system memory) and cpu usage is 100 % ,but the gpu utilisation is zero.Please help to improve my code.

``class SiameseNetworkDataset(Dataset):
def init(self,imageFolderDataset,transform=None,should_invert=True):
self.imageFolderDataset=imageFolderDataset
self.transform=transform
self.should_invert=should_invert

def __getitem__(self,index):
    img0_tuple=random.choice(self.imageFolderDataset.imgs)
    
    
    should_get_same_class=random.randint(0,1)  #making sure 50% images are in same classes
    if should_get_same_class:
        while True:
            #keep looping till same class image is found
            img1_tuple=random.choice(self.imageFolderDataset.imgs)
            if img0_tuple[1]==img1_tuple[1]:
                break
    else:
        img1_tuple=random.choice(self.imageFolderDataset.imgs)
        
    img0=Image.open(img0_tuple[0]).convert('RGB')
    img1=Image.open(img1_tuple[0]).convert('RGB')
    #img0=img0.convert("L")
    #img1=img1.convert("L")
    
    if self.should_invert:
        img0=PIL.ImageOps.invert(img0)
        img1=PIL.ImageOps.invert(img1)
        
    if self.transform is not None:
        img0=self.transform(img0)
        img1=self.transform(img1)
        
    #print(type(img0_tuple))
    #print(img0_tuple[1])
    #print(img1_tuple[1])
    #print()
        
    return img0,img1,torch.from_numpy(np.array([int(img1_tuple[1]!=img0_tuple[1])],dtype=np.float32))

def __len__(self):
    return len(self.imageFolderDataset.imgs)
folder_dataset=dset.ImageFolder(root=Config.training_dir)
siamese_dataset=SiameseNetworkDataset(imageFolderDataset=folder_dataset,
                                        transform=transforms.Compose([transforms.Resize((100,100)),
                                                                      transforms.ToTensor()
                                                                      ])
                                       ,should_invert=False)

train_dataloader = DataLoader(siamese_dataset,
                        shuffle=True,
                        num_workers=16,
                        batch_size=Config.train_batch_size)
#cudnn.benchmark = True
dtype = torch.cuda.FloatTensor 

net=SiameseNetwork().cuda()
criterion=ContrastiveLoss()
optimizer=optim.Adam(net.parameters(),lr=0.0005)

counter=[]
loss_history=[]
iteration_number=0

for epoch in range(0,Config.train_number_epochs):
    for i,data in enumerate(train_dataloader,0):
        img0,img1,label=data
        img0,img1,label=(Variable(img0).type(dtype)).cuda(),(Variable(img1).type(dtype)).cuda(),(Variable(label).type(dtype)).cuda()
        output1,output2=net(img0,img1)
        optimizer.zero_grad()
        loss_contrastive=criterion(output1,output2,label)
        loss_contrastive.backward()
        optimizer.step()
        if i%10 == 0:
            print("Epoch num {}\n Current loss {}\n".format(epoch,loss_contrastive.item()))
            iteration_number+=10
            counter.append(iteration_number)
            loss_history.append(loss_contrastive.item())

show_plot(counter,loss_history)

Your IO might be the bottleneck.
Are you storing the data on a local SSD or on an HDD / network drive?
You could try to time the data loading using the ImageNet example.

As a small side note: Variables are deprecated since PyTorch 0.4.0, so if you are using a newer version (which I would highly recommend :wink: ), you can just use torch.tensor instead. Also, you don’t need to convert your data using type(dtype) and then .cuda(). In the latest version, just push your data onto the GPU using data = data.to('cuda:0') (or any other GPUid).

Thanks for you reply,
I am using local HDD.
After timing out the data I found that on average its taking like this,
#data loading time
data_time.val=0.000
data_time.avg=0.117

measure elapsed time

batch_time.val=0.075
batch_time.avg=0.233

I also made those changes which you have suggested but could’t make any improvement in GPU utilization(still 0 %).

The reading speed of your HDD might be the bottleneck. Do you have a small SSD and could copy some data onto it to check the loading speed with it?

Yeah I resolved it by running it on a faster machine.IO was the bottleneck as you said.