[Require Help] Pytorch: Saving Tensor to List result in CUDA Out of Memory
I’m currently experimenting with cascaded model for video denoising and video action classifier for my experimentation. Anyway, I encountered an CUDA Runtime Out of Memory Error:
which I have investigated and turns out I need a better way to optimise my code, but Idk how. I need help regarding this matter. Here, let me go through my code and see if you guys have any idea how to fix this better.
Note: My models are rigid in the input shape for sequence/clip length (time) and image size.
Before, there are some terminology I use:
-
[N, T, C, H, W]
- in this order, is input shape of data coming fromDALIVideoLoader
, which is my data loader.N
isbatch_size
,T
is the clip length per iteration, and[C, H, W]
is the RGB frame shape - I have a sliding window for the time dimension, denoted as small
t
. for each window of time it will contain[N, t, C, H, W]
of data wheret < T
. -
FastDVDNet.NUM_INPUT_FRAMES
is the input to denoiser model, I will explain below -
sequence_length
- the clip length, denoted byT
above
Flow, part of question
-
STEP1
: Suppose I receive[8, 20, 3, 224, 224]
(5 dimension) tensor raw batch of data from my data loader, I need to perform sliding window w.r.t time with window_sizet=5
over the time dimension. Hence each slide, I extract the clip of shape[8, 5, 3, 224, 224]
which in total I will only get 16 sequences of the mentioned shape -
STEP2
: Now for each sequence I get above, I feed to the denoiser model. Denoiser model receive[8, 5, 3, 224, 224]
and output map to[8, 3, 224, 224]
. Now I collect those16 x [8, 3, 224, 224]
denoised tensors in a python list. -
STEP3
: I finally stack the tensors in the python list mentioned into a single 5-dimension[8, 16, 3, 224, 224]
shaped tensor. Feed this to the classifier model and it output me a prediction.
Problem and Question:
As mentioned, I got CUDA Out of Memory error
. I know I can add more GPU/RAM but I wonder if there is anything I can fix from my code to reduce the overhead. Currently I observe that collecting
the tensors in a list is the cause of exploding memory usage (up to 10GB
despite I have reduced the batch_size N
to 1). But how else can I do to perform the above Flow (see section above)? Please kindly advice.
Code:
Here is a sample code if you wanted to take a look and for your reference.
def main(**args):
""" Performs the main training loop
"""
# Load dataset
print('> Loading datasets ...')
train_loader = DaliVideoLoader(file_list = args['trainset_list'],
batch_size = args['batch_size'],
sequence_length = args['temp_patch_size'],
step = 1)
# STEP 0: Precalculate input-output sizes
sequence_length = args['temp_patch_size']
t = window_size = FastDVDnet.NUM_INPUT_FRAMES # here, t=5
den_stride = 1
den_frame_length = int(sequence_length - window_size + den_stride) // den_stride
# Model definition
den_model = FastDVDnet().cuda()
classifier_model = MultiFiberNet3d(num_classes=51)
for data, labels in train_loader: # CUDA torch.Tensor loader
# shape of data: [N, T, C, H, W], e.g. T = 20
# shape of label: [N, ]
# STEP 1: Partition one video clip into segments of t=5 frames.
# sliding window to get tuple of tensor: [N, t=5, C, H, W] x den_frame_length
# e.g. t = 5
frame_patches = tuple(data[:, m:m+t, :, :, :] for m in range(0, den_frame_length))
print("FRAME", len(frame_patches), frame_patches[-1].shape)
den_frames_stack = []
# STEP 2: Sample noise and Denoise Frame with video denoiser model
for frames in frame_patches:
# frame is of shape [N, t=5, C, H, W]
# std dev of each sequence
stdn = torch.rand(N, 1, 1, 1, 1).cuda()
eps = torch.randn_like(stdn)
# draw noise samples from std dev tensor
noise = eps.mul(stdn)
frames = frames + noise
frames = frames.cuda(non_blocking=True)
noise = noise.cuda(non_blocking=True)
noise_map = stdn.expand((N, 1, 1, H, W)).cuda(non_blocking=True) # one channel per image
print("NOISE MAP", noise_map.shape)
# Evaluate model and optimize it
den_frames = den_model(frames, noise_map)
# Then put to list and BOOM, it OOM after 10-20 batches!
den_framden_frames_stackes.append(den_frames)
print("DEN FRAME", den_frames.shape)
# STEP 3: Combine all denoised frames into one tensor
# frames_train is of shape: [N, F, C, H, W]
# where F is calculated as `den_frame_length` above
frames_train = torch.stack(den_frames_stack, dim=1)
output = classifier_model(frames_train)
# Print elapsed time
elapsed_time = time.time() - start_time
print('Elapsed time {}'.format(time.strftime("%H:%M:%S", time.gmtime(elapsed_time))))