I’m training a classifier for video data (2 classes, n videos, each is represented with a sequence of K frames). I create a dataset:
train_data = torch.utils.data.DataLoader(dataset, **params)
dataset instance has
__getitem__ that returns
idx, X,y: idx is the index of the video. X has dimensions
(n, K, 3, H, W) and y
I don’t want to put the whole dataset on GPU, as it constrains the number of frames and the frame size. So I want to loop through each video, put it on GPU, run it throgh ResNet+LSTM, get the loss, proceed to the next video. After the sample’s finishes, I sum the losses and backprop:
for e in range(epochs): total_loss = 0 optimizer.zero_grad() for batch_idx, X,y in train_data: for num,id in enumerate(batch_idx): X_data,y_label = X[num, :,:,:,:].to("cuda").unsqueeze_(0) ,y[num,:].to("cuda").unsqueeze_(0) output = lstm(convnet(X_data)) loss = binary_loss(sigmoid(output), y_label) total_loss += loss total_loss.backward() optimizer.step()
And yet every time I put a datapoint on CUDA, additional 1.8Gb of VRAM gets used, and after a few feedforward operations GPU runs out of memory. The models are parallelized (I used nn.DataParallel).
Any suggestions on how to optimize the GPU use are very welcome. Most of all, I don’t understand why putting the whole dataset on GPU takes less VRAM than putting its subsets.