Load entire dataset on GPU

I am working on a project which involves implementing RNNs from scratch. As we know, training huge RNNs is a time-consuming task, but I am also facing a problem with optimizing dataset import.

For my project work, I have customized torch.utils.data.Dataset as:

class MakeDataset(Dataset):
    def __init__(self, data):
        self.strings = list(data['string'])
        self.valid = list(data['valid'])
        self.len = len(self.valid)
        self.valid_list = [0, 1]

    def __getitem__(self, index):
        return self.strings[index], self.valid[index]

    def __len__(self):
        return self.len

Following is how I generate dataloaders for use in training:

train = MakeDataset(train_data)
test = MakeDataset(test_data)
train_loader = DataLoader(dataset=train, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test, batch_size=batch_size, shuffle=True)

As of now, these methods are working perfectly without any errors.

The problem arises when trying to train my model on Cuda. My GPU utilization is around 15% while the CPU is at maximum. I believe this affecting the speed of my training.

I read various answers on the forum about loading the dataset on the GPU, but none of which are working for me. It would be a great help if someone could point out a better way to do this.

Which approaches have you tried and what is not working for you?
Also, how large is your dataset and how much memory is currently used during the training?

I tried the following approaches:

  1. using pin_memory=True
  2. loading entire dataset in __init__ of Dataset as:
class MakeDataset(Dataset):
    def __init__(self, key):
        data = pd.read_csv('../input/reberseq/train_data.csv') if key == 'train' else pd.read_csv('../input/reberseq/test_data.csv')
        
        self.strings = list(data['string'])
        self.valid = list(data['valid'])
        
        self.strings, self.valid = self.make_variables(self.strings, self.valid)
        
        self.len = len(self.valid)
        self.valid_list = [0, 1]
        
    def str2ascii(self, string):
        ascii_arr = [ord(s) for s in string]
        return ascii_arr, len(ascii_arr)

    def pad_seq(self, vect_seqs, seq_lens, valid):
        seq_tensor = torch.zeros((len(vect_seqs), seq_lens.max())).long()

        for index, (seq, seq_len) in enumerate(zip(vect_seqs, seq_lens)):
            seq_tensor[index, :seq_len] = torch.LongTensor(seq)

        return seq_tensor, valid

    def make_variables(self, strings, valid):
        seqs_and_lens = [self.str2ascii(string) for string in strings]
        vect_seqs = [s[0] for s in seqs_and_lens]
        seq_lens = torch.LongTensor([s[1] for s in seqs_and_lens])
        valid = torch.LongTensor(valid)
        return self.pad_seq(vect_seqs, seq_lens, valid)

    def __getitem__(self, index):
        return Variable(self.strings[index].cuda()), Variable(self.valid[index].cuda())

    def __len__(self):
        return self.len

Even after trying all this, My GPU utilization is still around 40% while CPU is always 100%.

My dataset set consists of 25000 reber sequence. In size, the entire csv is no more than 700 kB, but still during training, 3 GB of CPU and 1.5 GB of GPU is occupied.

1 Like
  1. pin_memory will used page-locked memory to speed up the transfer between the device and host. You could also use to(non_blocking=True) to transfer the data asynchronously.

  2. Your MakeDataset transfers each sample separately in __getitem__, which might be slower than transferring the complete batch to the device in the training loop. If you have enough memory on the device, you should push the complete data in __init__ to the device.

3 Likes