GPU usage not being utilised- dataset size 150k instances

Hi,

I am training a small autoencoder, using 4 GPUs however it appears that the GPUs aren’t being used properly:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 440.59       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp COLLEC...  Off  | 00000000:02:00.0 Off |                  N/A |
| 25%   40C    P8    12W / 250W |   1031MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:04:00.0 Off |                  N/A |
| 26%   41C    P8    10W / 250W |    749MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:83:00.0 Off |                  N/A |
| 32%   50C    P8    10W / 250W |    743MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp COLLEC...  Off  | 00000000:84:00.0 Off |                  N/A |
| 28%   45C    P8    13W / 250W |    749MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25246      C   python                                      1021MiB |
|    1     25246      C   python                                       739MiB |
|    2     25246      C   python                                       733MiB |
|    3     25246      C   python                                       739MiB |
+-----------------------------------------------------------------------------+

My autoencoder class looks like this :

class AutoEncoder(nn.Module):
    def __init__(self, n_embedded):
        super(AutoEncoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(6144, n_embedded))
        self.decoder = nn.Sequential(nn.Linear(n_embedded, 6144))
        
     
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return encoded, decoded

My Dataset is stored in HDF5 format so I have a custom dataset class:

class Features_Dataset(data.Dataset):
    def __init__(self, archive, phase):
        self.archive = h5py.File(archive, 'r')
        self.labels = self.archive[str(phase) + '_labels']
        self.data = self.archive[str(phase) + '_all_arrays']
        self.img_paths = self.archive[str(phase) + '_img_paths']

    def __getitem__(self, index):
        datum = self.data[index]
        label = self.labels[index]
        path = self.img_paths[index]
        return datum, label, path

    def __len__(self):
        return len(self.data)

    def close(self):
        self.archive.close()

I initalize and train/evaluate my model like this:

device = torch.device("cuda") 
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")

model = AutoEncoder(2048)
nn.DataParallel(model)
model.to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(),weight_decay=1e-5)

for epoch in range(args.start_epoch, args.num_epochs+1):

        train_loss = 0
        model.train()

        for i, (inputs, labels, paths) in enumerate(dataloaders_dict['train']):
            inputs = inputs.to(device)
            inputs = inputs.view(-1, 6144)
            optimizer.zero_grad()
            # ===================forward=====================
            encoded, decoded = model(inputs)
            loss = criterion(decoded, inputs)
            # ===================backward====================
            loss.backward()
            train_loss += loss.item() 

         model.eval()
         with torch.no_grad():
            val_loss = 0
            for i, (inputs, labels, paths ) in enumerate(dataloaders_dict['val']):
                inputs = inputs.to(device)
                inputs = inputs.view(-1, 6144)
                encoded, decoded = model(inputs)
                val_loss += criterion(decoded, inputs).item()

I’m not sure if its my dataclass thats an issue, whether I have even got the GPUs being used properly, or whether I have placed my model evaluation in an inconvenient place…

I should also mention that the number of workers is 0.

Cheers,

Taran

So I’ve increased the number of workers and now get a new error:

OSError: Can't read data (wrong B-tree signature)

I’ve noticed this is an issue with Pytorch and HDF5 datasets… Has there been any real solution to this?

This issue might be related to the HDF5 error.

Adding

import torch.multiprocessing as mp
mp.set_start_method('spawn')

at the top of your script might solve the issue.

I got a new error:

TypeError: h5py objects cannot be pickled

Seems like this is endless and hdf5 is a bad format for pytorch dataloaders, especially in regards to large datasets.

Hi,

So I added:

import torch.multiprocessing as mp
mp.set_start_method('fork')

Since I am using Linux and no longer get an error in regards to not being able to pickle the file. However, when I increase the number of workers, I still get OSError: Can't read data (wrong B-tree signature)

I investigated and it does appear that the multithreading mixes up some of the values. For example, empty arrays being returned or even labels when only asking for paths to be printed.

The main issue is that my GPU doesn’t seem to get utilised. But the memory usage is high. I’m not really sure what else there can be done. I have tried a number of fixes, none of them appear to be useful and sometimes make everything much worse. I have even set pin_memory to True.

This is the updated dataclass that appears to retrieve data fairly quickly without deployment on the GPUs:

import torch.multiprocessing as mp
mp.set_start_method('fork')

from torch.utils import data
import h5py

class Features_Dataset(data.Dataset):
    def __init__(self, archive, phase):
        self.archive = h5py.File(archive, 'r', libver='latest', swmr=True)
        assert self.archive.swmr_mode
        self.labels = self.archive[str(phase) + '_labels']
        self.data = self.archive[str(phase) + '_all_arrays']
        self.img_paths = self.archive[str(phase) + '_img_paths']

    def __getitem__(self, index):
        datum = self.data[index]
        label = self.labels[index]
        path = self.img_paths[index]
        return datum, label, path

    def __len__(self):
        return len(self.data)

    def close(self):
        self.archive.close()

if __name__ == '__main__':
    train_dataset = Features_Dataset(archive= "featuresdata/train.hdf5", phase= 'train')
    trainloader = data.DataLoader(train_dataset, num_workers=2, batch_size=2)
    print(len(trainloader))
    for i, (data, label, path) in enumerate(trainloader):
        print(path)


Baffling. I’m sure people must be feeding in large HDF5 datasets into their networks!

I’m not an expert for HDF5 and have seen a lot of issues using multiprocessing with it.
However, this topic might be useful, where we debugged some issues with it.