Outstanding memory allocation during pack_sequence / pack_padded_sequence

Hi,.
I noticed some odd (and crippling) behaviour in the function torch.nn.utils.rnn.pack_sequence. See my code:

inputs= list(self.buffer.values())
print(torch.cuda.memory_allocated()/1e9)
print(torch.cuda.max_memory_allocated()/1e9)
inputs = [torch.cat(input, dim=0) for input in inputs]
print(torch.cuda.memory_allocated()/1e9)
print(torch.cuda.max_memory_allocated()/1e9)
inputs, batch_sizes, sorted_indices, unsorted_indices = pack_sequence(inputs, enforce_sorted=False)
print(torch.cuda.memory_allocated()/1e9)
print(torch.cuda.max_memory_allocated()/1e9)

Code is structured this way to demonstrate my problem. Below you will see 4 examples of what is printed by this sequence of codes. You will see that the memory allocated by the list of tensors I input into pack_sequence is outstandingly smaller than the maximum allocated memory, which occurs during the packing (logically true as before packing, maximum is much smaller). 4 outputs are a result of different datapoints (achieved by setting different random seeds; am using shuffled data laoders).

Output 1:

0.05142272
0.05142272
0.101368832
0.101368832
0.10063104
3.67196672

Output 2:

0.055465984
0.055465984
0.105938432
0.105938432
0.104708096
0.995354112

Output 3:

0.061822976
0.061822976
0.111620608
0.111620608
0.111021056
1.934670848

Output 4:

0.052079616
0.052079616
0.101535232
0.101535232
0.101254144
0.52041728

Additionally, I would like to note that all tensors inside of inputsare already stored in my single cuda device.

Could anybody explain what is going on and how I can avoid this? I suspect this is because during padding there is a lot more information stored. If this is the case is there no way this function can be run by sparsifying? I would happily trade memory for speed here (without batchifying this procedure).

Thanks!