How can I speed up my DataLoader?

shamoons · October 20, 2020, 3:50pm

My loader is:


class FeatureLoader(Dataset):
    def __init__(self, data_path, device='cpu'):
        torch.manual_seed(0)

        self.data_path = data_path
        self.device = device
        self.total_length = 0
        self.files = sorted(glob.glob(os.path.join(data_path, "*.pt")))

        self.samples = np.array([])

        for data_file in self.files:
            (audio_pt, feature_pt) = torch.load(data_file, map_location=self.device)

            self.total_length = self.total_length + audio_pt.shape[0]
            self.samples = np.append(self.samples, {"file": data_file, "length": audio_pt.shape[0]})

    def __len__(self):
        return self.total_length

    def __getitem__(self, index):
        samples_index = index % len(self.files)
        sample = self.samples[samples_index]
        (audio_pt, feature_pt) = torch.load(sample['file'], map_location=self.device)

        input_feature = feature_pt[index % sample['length']].detach()
        output_audio = audio_pt[index % sample['length']].detach()

        return input_feature, output_audio

I have 1.5M samples, but my GPU usage doesn’t quite increase at all. Stays low, which means I can probably load more per batch. Currently I have a batch_size of 8192, but if I increase to 16384, then I get some CUDA errors. So I’m wondering what I can do to speed it up.

RoySadaka · October 20, 2020, 5:34pm

Your data loader looks OK, do you suspect that it is slow on loading the data?

Increasing batch size might increase the GPU utilization, but it also affects the learning process, mini-batches have an important part in training, they provide generalization (some would argue that batch_size=1 will be the best but no need to go too extreme)
So keep an eye on the accuracy

Roy

shamoons · October 21, 2020, 1:14am

I’m only using 2313MiB of GPU RAM (out of 24GB) with a batch_size of 8192. If I set increase my batch_size to 16384, then sometimes it randomly crashes.

As for my model and training samples, I have:

Total params: 10,562,592
Trainable params: 10,562,592
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.32
Params size (MB): 40.29
Estimated Total Size (MB): 40.62
----------------------------------------------------------------
Training Samples:       28975661
Validation Samples:     1552227

RoySadaka · October 21, 2020, 3:42am

Can you provide more information about the crash? paste the crash report, call stack, is it an OOM error?

Also, please give more information about the data, are all samples the same size?

Roy

shamoons · October 21, 2020, 11:49am

Next time the error pops up, I’ll copy / paste it. As for data, yes, all the same size. Input is a 512-dim vector. Output is a 400-dim vector.