ZERO GPU utilization

SU801T · May 17, 2020, 12:42am

Hi,

So I have investigated my dataloader and it appears to be very slow. This causes the GPU utlization to be zero as not enough samples are fed into the GPU (i’m assuming). I have around 150k instances to process from a hdf5 dataset. Here is my dataset class:

class Features_Dataset(data.Dataset):
    def __init__(self, archive, phase):
        self.archive = archive
        self.phase = phase

    def __getitem__(self, index):
        with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
            datum = archive[str(self.phase) + '_all_arrays'][index]
            label = archive[str(self.phase) + '_labels'][index]
            path = archive[str(self.phase) +  '_img_paths'][index]
            return datum, label, path

    def __len__(self):
        with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
            datum = archive[str(self.phase) + '_all_arrays']
            return len(datum)


if __name__ == '__main__':
    train_dataset = Features_Dataset(archive= "featuresdata/train.hdf5", phase= 'train')
    trainloader = data.DataLoader(train_dataset, num_workers=1, batch_size=4)
    print(len(trainloader))
    for i, (data, label, path) in enumerate(trainloader):
        print(path)

Is there a way of speeding this up? I’m not really sure what else there is to do…

ptrblck · May 17, 2020, 6:59am

This post gives a good overview for potential data loading bottlenecks.

rwightman · May 18, 2020, 3:38am

@SU801T … while I’m not a big h5 person, I’m pretty sure the way you’re using it will be very slow by default. I assume h5py.File does an open and likely some sort of scan/index of the data file (which contains many entries and may be quite slow). You should likely be doing an operating like that once in the __init__ method and then just doing the index lookup in getitem/len.

Depending on whether the h5py file object is safe to copy and access from multiple processes, you may want to do a lazy load of the file, defer it until first __getitem__ call, but still only do it once per loader-worker.

SU801T · May 18, 2020, 4:16pm

Hi,

You’re right, I changed my dataclass to open the file in __init__ and it opens much faster and retreives data quickly. I tried to use more than one worker, but I will still get an OS b-tree error. Upon investigation, if I were to just print paths, I would also get labels and sometimes empty arrays before the b-tree error. It works with num_of_workers set to 0, however, I’m assuming that will make things incredibly slow. Nevertheless, here is the updated class:

import torch.multiprocessing as mp
mp.set_start_method('fork')

from torch.utils import data
import h5py

class Features_Dataset(data.Dataset):
    def __init__(self, archive, phase):
        self.archive = h5py.File(archive, 'r', libver='latest', swmr=True)
        assert self.archive.swmr_mode
        self.labels = self.archive[str(phase) + '_labels']
        self.data = self.archive[str(phase) + '_all_arrays']
        self.img_paths = self.archive[str(phase) + '_img_paths']

    def __getitem__(self, index):
        datum = self.data[index]
        label = self.labels[index]
        path = self.img_paths[index]
        return datum, label, path

    def __len__(self):
        return len(self.data)

    def close(self):
        self.archive.close()

if __name__ == '__main__':
    train_dataset = Features_Dataset(archive= "featuresdata/train.hdf5", phase= 'train')
    trainloader = data.DataLoader(train_dataset, num_workers=0, batch_size=1)
    print(len(trainloader))
    for i, (data, label, path) in enumerate(trainloader):
        print(path)

I still get 0% utilization from the GPUs. Is it a problem with HDF5? Are there alternatives to use instead of hdf5, for example, loading into numpy arrays or would that be just as slow? I will eventually have hdf5 datasets that will contain 2,000,000 instances. This is merely a pilot test!

rwightman · May 18, 2020, 7:04pm

HDF5 usually does a decent job of caching and handling IO. But if each item you’re fetching and feeding to your NN is quite small, and the net itself is also on the small side, you can still wind up in a scenario where most of the work is IO and Python bookkeeping/data munging and your GPU doesn’t have much to do.

As per my comment about lazy init, you should be able to get multiple workers running with

def __init__(...):
  self.archive = None

def _get_archive(archive_file):
  if self.archive is None:
    self.archive = h5py.File(archive, 'r', libver='latest', swmr=True)
  return self.archive

def __getitem__(...)
  archive = self._get_archive()
  ...

Just make sure you don’t call any method on your dataset between creation and passing to the multi-worker loader init… for len, you may need to load the dataset once, calc the len, cache it, let that instance of the HD5py file expire, and then continue…

SU801T · May 20, 2020, 12:03am

Hi, Thanks for the reply.

I had a go at the suggestions you made. For starters, in __init__ , I’m unsure of whether I can open the hdf5 file and calculate the length.

In _getitem_ , I reference the archive file and return the indexed values. I don’t get an OS B-tree error anymore, however, I’m not sure if this the most optimal way of retrieiving the data. Here is my newer attempt:

import torch.multiprocessing as mp
mp.set_start_method('fork') 

from torch.utils import data
import h5py

class Features_Dataset(data.Dataset):
    def __init__(self, file_path, phase):
        self.file_path = file_path
        self.archive = None
        self.phase = phase 
        with h5py.File(file_path, 'r', libver='latest', swmr=True) as f:
           self.length = len(f[(self.phase) + '_labels'])

    def _get_archive(self):
        if self.archive is None:
            self.archive = h5py.File(self.file_path, 'r', libver='latest', swmr=True)
            assert self.archive.swmr_mode
        return self.archive


    def __getitem__(self, index):
        archive = self._get_archive()
        label = archive[str(self.phase) + '_labels']
        datum = archive[str(self.phase) + '_all_arrays']
        path = archive[str(self.phase) + '_img_paths']

        return datum[index], label[index], path[index]

    def __len__(self):
        return self.length

    def close(self):
        self.archive.close()

if __name__ == '__main__':
    train_dataset = Features_Dataset(file_path= "featuresdata/train.hdf5", phase= 'train')
    trainloader = data.DataLoader(train_dataset, num_workers=8, batch_size=1)
    print(len(trainloader))
    for i, (data, label, path) in enumerate(trainloader):
        print(path)

Additionally, you are right about the small network. I am using a small autoencoder, where I am sending values from the hdf5 file to this network:

import torch
import torch.nn as nn

class AutoEncoder(nn.Module):
    def __init__(self, n_embedded):
        super(AutoEncoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(6144, n_embedded))
        self.decoder = nn.Sequential(nn.Linear(n_embedded, 6144))
       
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return encoded, decoded

I also have tensorboard set up where I am supposed to see my training losses every 100 mini-batches. It appears that nothing is recorded. Furthermore, when I set pin_memory to True in my dataloader, nothing is being written at all. I am using 4 CPUS and 2 GPUs to load and train my networks, if that matters…

SU801T · May 20, 2020, 1:49pm

So I have timed the printing of paths from the script above. On a tiny dataset of 51 images on my macbook, the times are output numbers on the right of the file paths:

('mults/train/0/10001.ndpi/40x/40x-236247-10154-18944-5376.png',) 0.0
('mults/train/0/10001.ndpi/40x/40x-236247-10152-18432-5376.png',) 0.0
('mults/train/0/10001.ndpi/40x/40x-236247-10155-19200-5376.png',) 9.5367431640625e-07
('mults/train/0/10001.ndpi/40x/40x-236247-10151-18176-5376.png',) 1.1920928955078125e-06
('mults/train/0/10001.ndpi/40x/40x-236247-10153-18688-5376.png',) 1.1920928955078125e-06
('mults/train/0/1234.ndpi/40x/40x-236247-16658-86528-8704.png',) 9.5367431640625e-07
('mults/train/0/1234.ndpi/40x/40x-236247-16656-86016-8704.png',) 9.5367431640625e-07
('mults/train/0/1234.ndpi/40x/40x-236247-16655-85760-8704.png',) 9.5367431640625e-07
('mults/train/0/1234.ndpi/40x/40x-236247-16657-86272-8704.png',) 1.1920928955078125e-06
('mults/train/0/1234.ndpi/40x/40x-236247-16654-85504-8704.png',) 1.9073486328125e-06
('mults/train/1/5678.ndpi/40x/40x-236247-16635-80640-8704.png',) 9.5367431640625e-07
('mults/train/1/5678.ndpi/40x/40x-236247-16637-81152-8704.png',) 9.5367431640625e-07
('mults/train/1/5678.ndpi/40x/40x-236247-16638-81408-8704.png',) 9.5367431640625e-07
('mults/train/1/5678.ndpi/40x/40x-236247-16634-80384-8704.png',) 0.0
('mults/train/1/5678.ndpi/40x/40x-236247-16636-80896-8704.png',) 0.0
('mults/train/1/10001.ndpi/40x/40x-236247-10142-15872-5376.png',) 9.5367431640625e-07
('mults/train/1/10001.ndpi/40x/40x-236247-10150-17920-5376.png',) 9.5367431640625e-07
('mults/train/1/10001.ndpi/40x/40x-236247-10154-18944-5376.png',) 0.0
('mults/train/1/10001.ndpi/40x/40x-236247-10152-18432-5376.png',) 9.5367431640625e-07
('mults/train/1/10001.ndpi/40x/40x-236247-10155-19200-5376.png',) 9.5367431640625e-07
('mults/train/1/10001.ndpi/40x/40x-236247-10151-18176-5376.png',) 9.5367431640625e-07
('mults/train/1/10001.ndpi/40x/40x-236247-10153-18688-5376.png',) 1.1920928955078125e-06
('mults/train/1/1234.ndpi/40x/40x-236247-16658-86528-8704.png',) 9.5367431640625e-07
('mults/train/1/1234.ndpi/40x/40x-236247-16656-86016-8704.png',) 0.0

The much larger dataset of 150k samples on a separate linux server:

('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-170637-97024-63232.png',) 7.152557373046875e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-133769-50944-51200.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-82090-80896-33536.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-74575-19712-30976.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-226275-81408-81664.png',) 2.384185791015625e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-172632-51712-64000.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-82388-23552-33792.png',) 7.152557373046875e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-150970-73984-56832.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-188390-69632-69120.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-216258-61440-78336.png',) 7.152557373046875e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-226105-57088-81664.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-17811-39424-9216.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-221670-72448-80128.png',) 9.5367431640625e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-77732-62464-32000.png',) 7.152557373046875e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-258005-84480-92672.png',) 7.152557373046875e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-223683-36864-80896.png',) 7.152557373046875e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-176327-25600-65280.png',) 1.1920928955078125e-06
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-100073-33536-39936.png',) 9.5367431640625e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-163497-84992-60928.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-127859-80896-49152.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-167504-97792-62208.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-263262-45568-94976.png',) 4.76837158203125e-07
('/vol/vssp/cvpnobackup/scratch_4weeks/taran/sample/train/1/F17-013461/40x/40x-F17-013461-205135-9984-74752.png',) 4.76837158203125e-07

They appear to be similar. I’m wondering if the speed of retrieving the batches is the issue
Again, no multiprocessing error anymore…which is good…