I’m training a classifier for video data (2 classes, n videos, each is represented with a sequence of K frames). I create a dataset:
train_data = torch.utils.data.DataLoader(dataset, **params)
where dataset
instance has __getitem__
that returns idx, X,y
: idx is the index of the video. X has dimensions (n, K, 3, H, W)
and y (n,1)
I don’t want to put the whole dataset on GPU, as it constrains the number of frames and the frame size. So I want to loop through each video, put it on GPU, run it throgh ResNet+LSTM, get the loss, proceed to the next video. After the sample’s finishes, I sum the losses and backprop:
for e in range(epochs):
total_loss = 0
optimizer.zero_grad()
for batch_idx, X,y in train_data:
for num,id in enumerate(batch_idx):
X_data,y_label = X[num, :,:,:,:].to("cuda").unsqueeze_(0) ,y[num,:].to("cuda").unsqueeze_(0)
output = lstm(convnet(X_data))
loss = binary_loss(sigmoid(output), y_label)
total_loss += loss
total_loss.backward()
optimizer.step()
And yet every time I put a datapoint on CUDA, additional 1.8Gb of VRAM gets used, and after a few feedforward operations GPU runs out of memory. The models are parallelized (I used nn.DataParallel).
Any suggestions on how to optimize the GPU use are very welcome. Most of all, I don’t understand why putting the whole dataset on GPU takes less VRAM than putting its subsets.