Extremely slow CUDA tensor allocation


I’m working with affectnet which is about 450K images, totalling about 56Gb.
I’ve got a Titan Xp, and I’ve succesfully created a custom Dataset loader.
However, my training appears to be very slow because for each mini-batch the input/output tensors are re-allocated from host to gpu.
I’d like to try and upload some of the dataset images on the GPU, since it has about 12Gb of DDRAM. Problem is that it is working extremely slow. I’ve tested with CUDA 9.0 and now with CUDA 9.1.
The loop is something very simple such as:

        for idx in range(len(self.labels.rows)):
            if torch.cuda.memory_allocated() < MAX_GPU_MEM:
                pair = self.__getitem__(idx)
                in_tensor  = pair[0].cuda(non_blocking=True).half()
                out_tensor = pair[1].cuda(non_blocking=True).half()
                self.data.append([in_tensor, out_tensor])
                print("GPU nearly maxed out")
        print("in GPU RAM: ", len(self.data))

I can’t see if there is a method to allocate beforehand more GPU memory. I admit that some of the time spent is on pre-processing those images and transforming them, but the rate the GPU RAM increases is phenomenally slow.
Is there a way to upload them all together in a batch?
EDIT: maybe my issue is the pre-processing after all… I’ll try and profile the code

you should just use the stock DataLoader with pin_memory.

@SimonW I though pin_memory only supports CPU tensors?

It is for CPU tensors to be quickly transported to CUDA, which should be exactly what you need.

Thanks Simon, I gave it a try, seems to be working, still somewhat slow even though I multi-threaded the pre-processing, but now I know it’s not the GPU pipeline.

1 Like