Extremely slow CUDA tensor allocation

Alex_Ge · June 28, 2018, 2:42pm

Hi,

I’m working with affectnet which is about 450K images, totalling about 56Gb.
I’ve got a Titan Xp, and I’ve succesfully created a custom Dataset loader.
However, my training appears to be very slow because for each mini-batch the input/output tensors are re-allocated from host to gpu.
I’d like to try and upload some of the dataset images on the GPU, since it has about 12Gb of DDRAM. Problem is that it is working extremely slow. I’ve tested with CUDA 9.0 and now with CUDA 9.1.
The loop is something very simple such as:

        for idx in range(len(self.labels.rows)):
            if torch.cuda.memory_allocated() < MAX_GPU_MEM:
                pair = self.__getitem__(idx)
                in_tensor  = pair[0].cuda(non_blocking=True).half()
                out_tensor = pair[1].cuda(non_blocking=True).half()
                self.data.append([in_tensor, out_tensor])
            else:
                print("GPU nearly maxed out")
                break
        print("in GPU RAM: ", len(self.data))

I can’t see if there is a method to allocate beforehand more GPU memory. I admit that some of the time spent is on pre-processing those images and transforming them, but the rate the GPU RAM increases is phenomenally slow.
Is there a way to upload them all together in a batch?
EDIT: maybe my issue is the pre-processing after all… I’ll try and profile the code

SimonW · June 28, 2018, 4:23pm

you should just use the stock DataLoader with pin_memory.

Alex_Ge · June 28, 2018, 4:29pm

@SimonW I though pin_memory only supports CPU tensors?

SimonW · June 28, 2018, 11:21pm

It is for CPU tensors to be quickly transported to CUDA, which should be exactly what you need.

Alex_Ge · June 29, 2018, 2:44pm

Thanks Simon, I gave it a try, seems to be working, still somewhat slow even though I multi-threaded the pre-processing, but now I know it’s not the GPU pipeline.