Processing large batches on GPU by splitting them into smaller ones

I am trying to process a large batch on GPU by splitting it into smaller ones like this:

X = Variable(torch.from_numpy(X), requires_grad=False)
n_samples, batch_size, n_features = X.shape

fX = []
for i in range(0, batch_size, 32):

    # split large batch into smaller ones
    x = X[:, i:i+32]
    # shape is (n_samples, 32, n_features)

    # send small batch to GPU
    x = x.cuda()

    # process on GPU
    fx = recurrent_net(x)
    # shape is (32, n_dimensions)

    # send to CPU
    fx = fx.cpu()

    # keep track of results for later stacking

fX =, dim=0)
# shape is (batch_size, n_dimensions)

I thought (naively, I guess :)) that this would solve my “out of memory” issue as smaller batches are sent to GPU one at a time, then sent back to CPU (hopefully to avoid filling the GPU memory…)

However, my GPU still quickly runs out of memory.
What is the best way to achieve what I want to achieve?

For completeness sake: I do want to backprop later so it is my understanding that using volatile = True is not an option. Correct me if I am wrong.


Hervé, one month in PyTorch, and loving it!