I am working on a project which involves implementing RNNs from scratch. As we know, training huge RNNs is a time-consuming task, but I am also facing a problem with optimizing dataset import.
For my project work, I have customized torch.utils.data.Dataset
as:
class MakeDataset(Dataset):
def __init__(self, data):
self.strings = list(data['string'])
self.valid = list(data['valid'])
self.len = len(self.valid)
self.valid_list = [0, 1]
def __getitem__(self, index):
return self.strings[index], self.valid[index]
def __len__(self):
return self.len
Following is how I generate dataloaders for use in training:
train = MakeDataset(train_data)
test = MakeDataset(test_data)
train_loader = DataLoader(dataset=train, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test, batch_size=batch_size, shuffle=True)
As of now, these methods are working perfectly without any errors.
The problem arises when trying to train my model on Cuda
. My GPU utilization is around 15% while the CPU is at maximum. I believe this affecting the speed of my training.
I read various answers on the forum about loading the dataset on the GPU, but none of which are working for me. It would be a great help if someone could point out a better way to do this.
Which approaches have you tried and what is not working for you?
Also, how large is your dataset and how much memory is currently used during the training?
I tried the following approaches:
- using
pin_memory=True
- loading entire dataset in
__init__
of Dataset
as:
class MakeDataset(Dataset):
def __init__(self, key):
data = pd.read_csv('../input/reberseq/train_data.csv') if key == 'train' else pd.read_csv('../input/reberseq/test_data.csv')
self.strings = list(data['string'])
self.valid = list(data['valid'])
self.strings, self.valid = self.make_variables(self.strings, self.valid)
self.len = len(self.valid)
self.valid_list = [0, 1]
def str2ascii(self, string):
ascii_arr = [ord(s) for s in string]
return ascii_arr, len(ascii_arr)
def pad_seq(self, vect_seqs, seq_lens, valid):
seq_tensor = torch.zeros((len(vect_seqs), seq_lens.max())).long()
for index, (seq, seq_len) in enumerate(zip(vect_seqs, seq_lens)):
seq_tensor[index, :seq_len] = torch.LongTensor(seq)
return seq_tensor, valid
def make_variables(self, strings, valid):
seqs_and_lens = [self.str2ascii(string) for string in strings]
vect_seqs = [s[0] for s in seqs_and_lens]
seq_lens = torch.LongTensor([s[1] for s in seqs_and_lens])
valid = torch.LongTensor(valid)
return self.pad_seq(vect_seqs, seq_lens, valid)
def __getitem__(self, index):
return Variable(self.strings[index].cuda()), Variable(self.valid[index].cuda())
def __len__(self):
return self.len
Even after trying all this, My GPU utilization is still around 40% while CPU is always 100%.
My dataset set consists of 25000 reber sequence. In size, the entire csv is no more than 700 kB, but still during training, 3 GB of CPU and 1.5 GB of GPU is occupied.
1 Like