[Require Help] Pytorch: Saving Tensor to List result in CUDA Out of Memory

[Require Help] Pytorch: Saving Tensor to List result in CUDA Out of Memory

I’m currently experimenting with cascaded model for video denoising and video action classifier for my experimentation. Anyway, I encountered an CUDA Runtime Out of Memory Error: which I have investigated and turns out I need a better way to optimise my code, but Idk how. I need help regarding this matter. Here, let me go through my code and see if you guys have any idea how to fix this better.

Note: My models are rigid in the input shape for sequence/clip length (time) and image size.

Before, there are some terminology I use:

  • [N, T, C, H, W] - in this order, is input shape of data coming from DALIVideoLoader, which is my data loader. N is batch_size, T is the clip length per iteration, and [C, H, W] is the RGB frame shape
  • I have a sliding window for the time dimension, denoted as small t. for each window of time it will contain [N, t, C, H, W] of data where t < T.
  • FastDVDNet.NUM_INPUT_FRAMES is the input to denoiser model, I will explain below
  • sequence_length - the clip length, denoted by T above

Flow, part of question

  1. STEP1: Suppose I receive [8, 20, 3, 224, 224] (5 dimension) tensor raw batch of data from my data loader, I need to perform sliding window w.r.t time with window_size t=5 over the time dimension. Hence each slide, I extract the clip of shape [8, 5, 3, 224, 224] which in total I will only get 16 sequences of the mentioned shape
  2. STEP2: Now for each sequence I get above, I feed to the denoiser model. Denoiser model receive [8, 5, 3, 224, 224] and output map to [8, 3, 224, 224]. Now I collect those 16 x [8, 3, 224, 224] denoised tensors in a python list.
  3. STEP3: I finally stack the tensors in the python list mentioned into a single 5-dimension [8, 16, 3, 224, 224] shaped tensor. Feed this to the classifier model and it output me a prediction.

Problem and Question:

As mentioned, I got CUDA Out of Memory error. I know I can add more GPU/RAM but I wonder if there is anything I can fix from my code to reduce the overhead. Currently I observe that collecting
the tensors in a list
is the cause of exploding memory usage (up to 10GB despite I have reduced the batch_size N to 1). But how else can I do to perform the above Flow (see section above)? Please kindly advice.

Code:

Here is a sample code if you wanted to take a look and for your reference.

def main(**args):
	""" Performs the main training loop
	"""
	# Load dataset
	print('> Loading datasets ...')

	train_loader = DaliVideoLoader(file_list = args['trainset_list'],
									batch_size = args['batch_size'],
									sequence_length = args['temp_patch_size'],
									step = 1)

	# STEP 0: Precalculate input-output sizes
	sequence_length = args['temp_patch_size']
	t = window_size = FastDVDnet.NUM_INPUT_FRAMES  # here, t=5
	den_stride = 1
	den_frame_length = int(sequence_length - window_size + den_stride) // den_stride

	# Model definition
	den_model = FastDVDnet().cuda()
	classifier_model = MultiFiberNet3d(num_classes=51)

	for data, labels in train_loader:  # CUDA torch.Tensor loader
		# shape of data: [N, T, C, H, W], e.g. T = 20
		# shape of label: [N, ]

		# STEP 1: Partition one video clip into segments of t=5 frames.
		# sliding window to get tuple of tensor: [N, t=5, C, H, W] x den_frame_length
		# e.g. t = 5
		frame_patches = tuple(data[:, m:m+t, :, :, :] for m in range(0, den_frame_length))

		print("FRAME", len(frame_patches), frame_patches[-1].shape)

		den_frames_stack = []

		# STEP 2: Sample noise and Denoise Frame with video denoiser model 
		for frames in frame_patches:
			# frame is of shape [N, t=5, C, H, W]

			# std dev of each sequence
			stdn = torch.rand(N, 1, 1, 1, 1).cuda()
			eps = torch.randn_like(stdn)
			# draw noise samples from std dev tensor
			noise = eps.mul(stdn)

			frames = frames + noise

			frames = frames.cuda(non_blocking=True)
			noise = noise.cuda(non_blocking=True)

			noise_map = stdn.expand((N, 1, 1, H, W)).cuda(non_blocking=True) # one channel per image

			print("NOISE MAP", noise_map.shape)

			# Evaluate model and optimize it
			den_frames = den_model(frames, noise_map)

			# Then put to list and BOOM, it OOM after 10-20 batches!
			den_framden_frames_stackes.append(den_frames)  
			print("DEN FRAME", den_frames.shape)

		# STEP 3: Combine all denoised frames into one tensor 
		# frames_train is of shape: [N, F, C, H, W]
		# where F is calculated as `den_frame_length` above
		frames_train = torch.stack(den_frames_stack, dim=1)

		output = classifier_model(frames_train)

	# Print elapsed time
	elapsed_time = time.time() - start_time
	print('Elapsed time {}'.format(time.strftime("%H:%M:%S", time.gmtime(elapsed_time))))

I don’t see any obvious way to lower the memory usage.
If you’ve already tried to use a batch size of 1, you might need to use the checkpoint utility to trade compute for memory.