Seq2seq RNN decoder GPU memory leaks every time step

Below is my code and output:

# -------------------------------------
# Forward decoder
# -------------------------------------
# Initialize decoder's hidden state as encoder's last hidden state.
decoder_hidden = encoder_hidden

# Run through decoder one time step at a time.
for t in range(max_tgt_len):

    # decoder returns:
    # - decoder_output   : (batch_size, vocab_size)
    # - decoder_hidden   : (num_layers, batch_size, hidden_size)
    # - attention_weights: (batch_size, max_src_len)
    decoder_output, decoder_hidden, attention_weights = decoder(input_seq, decoder_hidden,
                                                                encoder_outputs, src_lens)

    # Store decoder outputs.
    decoder_outputs[t] = decoder_output

    # Next input is current target
    input_seq = tgt_seqs[t]

    # Detach hidden state:
    detach_hidden(encoder_outputs) # <--- I'm not sure whether to detach or not
    detach_hidden(decoder_hidden)
    
    curr_gpu_memory_usage = get_gpu_memory_usage(device_id=torch.cuda.current_device())
    diff_gpu_memory_usage = curr_gpu_memory_usage - prev_gpu_memory_usage
    prev_gpu_memory_usage = curr_gpu_memory_usage
    print('- {}: Diff GPU memory usage: {}'.format(t, diff_gpu_memory_usage))

This is output (timesteps=78):

- 0: Diff GPU memory usage: 2
- 1: Diff GPU memory usage: 20
- 2: Diff GPU memory usage: 2
- 3: Diff GPU memory usage: 22
- 4: Diff GPU memory usage: 0
- 5: Diff GPU memory usage: 22
- 6: Diff GPU memory usage: 22
- 7: Diff GPU memory usage: 0
- 8: Diff GPU memory usage: 22
- 9: Diff GPU memory usage: 0
- 10: Diff GPU memory usage: 22
- 11: Diff GPU memory usage: 2
- 12: Diff GPU memory usage: 20
- 13: Diff GPU memory usage: 22
- 14: Diff GPU memory usage: 2
- 15: Diff GPU memory usage: 20
- 16: Diff GPU memory usage: 2
- 17: Diff GPU memory usage: 20
- 18: Diff GPU memory usage: 2
- 19: Diff GPU memory usage: 22
- 20: Diff GPU memory usage: 20
- 21: Diff GPU memory usage: 2
- 22: Diff GPU memory usage: 22
- 23: Diff GPU memory usage: 0
- 24: Diff GPU memory usage: 22
- 25: Diff GPU memory usage: 0
- 26: Diff GPU memory usage: 22
- 27: Diff GPU memory usage: 22
- 28: Diff GPU memory usage: 0
- 29: Diff GPU memory usage: 22
- 30: Diff GPU memory usage: 2
- 31: Diff GPU memory usage: 20
- 32: Diff GPU memory usage: 2
- 33: Diff GPU memory usage: 20
- 34: Diff GPU memory usage: 22
- 35: Diff GPU memory usage: 2
- 36: Diff GPU memory usage: 20
- 37: Diff GPU memory usage: 2
- 38: Diff GPU memory usage: 22
- 39: Diff GPU memory usage: 0
- 40: Diff GPU memory usage: 22
- 41: Diff GPU memory usage: 20
- 42: Diff GPU memory usage: 2
- 43: Diff GPU memory usage: 22
- 44: Diff GPU memory usage: 0
- 45: Diff GPU memory usage: 22
- 46: Diff GPU memory usage: 2
- 47: Diff GPU memory usage: 20
- 48: Diff GPU memory usage: 22
- 49: Diff GPU memory usage: 0
- 50: Diff GPU memory usage: 22
- 51: Diff GPU memory usage: 2
- 52: Diff GPU memory usage: 20
- 53: Diff GPU memory usage: 2
- 54: Diff GPU memory usage: 22
- 55: Diff GPU memory usage: 20
- 56: Diff GPU memory usage: 2
- 57: Diff GPU memory usage: 20
- 58: Diff GPU memory usage: 2
- 59: Diff GPU memory usage: 22
- 60: Diff GPU memory usage: 0
- 61: Diff GPU memory usage: 22
- 62: Diff GPU memory usage: 22
- 63: Diff GPU memory usage: 0
- 64: Diff GPU memory usage: 22
- 65: Diff GPU memory usage: 0
- 66: Diff GPU memory usage: 22
- 67: Diff GPU memory usage: 2
- 68: Diff GPU memory usage: 20
- 69: Diff GPU memory usage: 22
- 70: Diff GPU memory usage: 2
- 71: Diff GPU memory usage: 20
- 72: Diff GPU memory usage: 2
- 73: Diff GPU memory usage: 20
- 74: Diff GPU memory usage: 2
- 75: Diff GPU memory usage: 22
- 76: Diff GPU memory usage: 20
- 77: Diff GPU memory usage: 2
CPU times: user 548 ms, sys: 4.53 s, total: 5.08 s
Wall time: 9.32 s

I’m confused. This is within an training iteration, and you are certainly storing new outputs in a list in each time step (and thus keeping its computation graph). Wouldn’t you expect the GPU memory usage to increase every time step?

the decoder_outputs is already initialized, here is the completed code:

I independently test the decoder’s forward iteration with some fixed data, sometimes it add memory per timestep, but not, which is weird.

Hi,

I notice the same issue regarding increasing GPU memory usage while using the RNN. Any fix for this ?

Best,
Soumyadip