No reallocation of batch labels on CUDA

I am training on GPU, so when I need to compute my loss I transfer the label data over to the GPU with .cuda. However, I thought that it would be faster to preallocate an array of max batchsize and just copy the data over to GPU without reallocating it and if the batch is smaller I just index the GPU labels matrix.
What is the recommended way of doing this? I tried a bit around but could not come up with a good solution.

labels = torch.zeros((maxbatch, 100)).float().cuda()
for e in range(nepochs):
    labels.zero_()  # Remove any old data
    # The following won't work as it won't track the assignment of the values as it's not a Variable yet
    labels[:batchlabels.shape[0]] = batchlabels
    loss(pred_labels, Variable(labels[:batchlabels.shape[0]]))
    # The following won't work unless I make batchlabels Variable(torch.cuda.FloatTensor), beats the point
    sublabels = Variable(labels[:batchlabels.shape[0]])
    sublabels[:batchlabels.shape[0]] = batchlabels 
    loss(pred_labels, sublabels)


pytorch uses a custom gpu allocator that caches the old allocations.
This means that if you perform the same allocation/deallocation at every iteration of your loop, the allocator will actually be very efficient at allocating this new memory.
I would expect to see no runtime difference between the code that does the allocation every time and the one that pre-allocate the tensor. So I would say it is not worth the complexity of the code.

1 Like

Ah interesting. What is defined as same? To the same variable with same shape/dtype?

I did this because my profiler was showing the .cuda calls as the bottleneck but I think there is an issue with the profiler so I’ll try to update to 0.4 to try your pytorch profiler.

For the profiler, keep in mind that the cuda api is asynchronous and the time to execute a python line does not reflect how long it took to run on the gpu. Also if a call is a synchronization point, then it will wait for all running jobs on the gpu to finish and so the runtime of this python line might be much higher than the time to run the actual command.

The definition of same corresponds to low level 1D arrays that fit or don’t fit in memory blocks.

1 Like