Mitigating memory overhead by redundantly indexing a tensor

Hello everybody!

I’m currently dealing with an undesirably high memory overhead in the context of Deep Reinforcement Learning and TransformerXL.

TL;DR
Selecting indices from a tensor creates a new tensor and hence allocates more memory.
Indices are heavily redundant in my case. What are alternative ideas to reduce memory overhead? More context can be found below.

Example Code

import torch
torch.set_default_tensor_type("torch.cuda.FloatTensor")

# Print CUDA memory usage: 991.75 MB
mem = torch.cuda.mem_get_info(device=None)
print("0 mem usage: {} MB".format((mem[1] - mem[0]) / 1024 / 1024))

# Build episode tensor
num_episodes = 120
num_blocks = 4
dim = 512
max_episode_steps = 1024
episodes = torch.randn((num_episodes, max_episode_steps, num_blocks, dim))
print(episodes.untyped_storage().data_ptr())

# Setup episode indices per batch item
batch_size = 512
episode_indices = torch.randint(0, num_episodes, (batch_size,))

# Print CUDA memory usage: 1955.75 MB
mem = torch.cuda.mem_get_info(device=None)
print("1 mem usage: {} MB".format((mem[1] - mem[0]) / 1024 / 1024))

# Select episodes
# Indexing a tensor (NOT slicing) creates a new tensor
episodes = episodes[episode_indices]
# equivalent to:
# episodes = torch.index_select(episodes, 0, episode_indices)
print(episodes.untyped_storage().data_ptr()) # new pointer!

# Print CUDA memory usage: 6061.75 MB
mem = torch.cuda.mem_get_info(device=None)
print("2 mem usage: {} MB".format((mem[1] - mem[0]) / 1024 / 1024))

Deep Reinforcement Learning + TransformerXL Use-Case

I added an episodic TransformerXL memory to Proximal Policy Optimization. The episodic memory stores previous inputs (activations) of TransformerXL blocks of shape (num episodes, max episode steps, num blocks, block dim). During optimization, every batch item needs its suitable episodic memory as it was presented at its particular timestep. To achieve this, I store indices for every timestep indicating which episodic memory to use. Those indices are used to select the episode memories for mini batch optimization. This causes a severe memory overhead because indexing a tensor creates a new tensor.

My only idea so far is to collect and sample the episode memories on CPU. Another step is to extract sliding memory windows. Once these steps are completed, I could send the result to the GPU for optimization. This I/O should cost 1-2 seconds per optimization cycle. In my current experiments, this could cost me 16 hours per run.

Because of this very dynamic reinforcement learning system, I cannot tell how many episode memories are sampled before optimization. The batch size is fixed. Padding is also an issue as I rely on max episode steps. The episode memory data is highly redundant.