ConvNet+LSTM video classification: use of GPUs

I’m training a classifier for video data (2 classes, n videos, each is represented with a sequence of K frames). I create a dataset:

train_data = torch.utils.data.DataLoader(dataset, **params)

where dataset instance has __getitem__ that returns idx, X,y: idx is the index of the video. X has dimensions (n, K, 3, H, W) and y (n,1)

I don’t want to put the whole dataset on GPU, as it constrains the number of frames and the frame size. So I want to loop through each video, put it on GPU, run it throgh ResNet+LSTM, get the loss, proceed to the next video. After the sample’s finishes, I sum the losses and backprop:

for e in range(epochs):
     total_loss = 0
     optimizer.zero_grad()
     for batch_idx, X,y in train_data:
          for num,id in enumerate(batch_idx):
               X_data,y_label = X[num, :,:,:,:].to("cuda").unsqueeze_(0) ,y[num,:].to("cuda").unsqueeze_(0)
               output = lstm(convnet(X_data))
               loss = binary_loss(sigmoid(output), y_label)
               total_loss += loss
           total_loss.backward()
           optimizer.step()

And yet every time I put a datapoint on CUDA, additional 1.8Gb of VRAM gets used, and after a few feedforward operations GPU runs out of memory. The models are parallelized (I used nn.DataParallel).

Any suggestions on how to optimize the GPU use are very welcome. Most of all, I don’t understand why putting the whole dataset on GPU takes less VRAM than putting its subsets.