Hi,.
I noticed some odd (and crippling) behaviour in the function torch.nn.utils.rnn.pack_sequence
. See my code:
inputs= list(self.buffer.values())
print(torch.cuda.memory_allocated()/1e9)
print(torch.cuda.max_memory_allocated()/1e9)
inputs = [torch.cat(input, dim=0) for input in inputs]
print(torch.cuda.memory_allocated()/1e9)
print(torch.cuda.max_memory_allocated()/1e9)
inputs, batch_sizes, sorted_indices, unsorted_indices = pack_sequence(inputs, enforce_sorted=False)
print(torch.cuda.memory_allocated()/1e9)
print(torch.cuda.max_memory_allocated()/1e9)
Code is structured this way to demonstrate my problem. Below you will see 4 examples of what is printed by this sequence of codes. You will see that the memory allocated by the list of tensors I input into pack_sequence
is outstandingly smaller than the maximum allocated memory, which occurs during the packing (logically true as before packing, maximum is much smaller). 4 outputs are a result of different datapoints (achieved by setting different random seeds; am using shuffled data laoders).
Output 1:
0.05142272
0.05142272
0.101368832
0.101368832
0.10063104
3.67196672
Output 2:
0.055465984
0.055465984
0.105938432
0.105938432
0.104708096
0.995354112
Output 3:
0.061822976
0.061822976
0.111620608
0.111620608
0.111021056
1.934670848
Output 4:
0.052079616
0.052079616
0.101535232
0.101535232
0.101254144
0.52041728
Additionally, I would like to note that all tensors inside of inputs
are already stored in my single cuda device.
Could anybody explain what is going on and how I can avoid this? I suspect this is because during padding there is a lot more information stored. If this is the case is there no way this function can be run by sparsifying? I would happily trade memory for speed here (without batchifying this procedure).
Thanks!