I train an Encoder-Deceder model implemented in Torch, where each stage I freeze one component.
My model capacity is ~800M params, where the encoder and the decoder are 50M, 750M params respectively.
I noticed that when I freeze the Decoder and train only the Encoder (meaning ~50M trainable params), it takes MORE memory than when I freeze the Encoder and train only the decoder (~750M trainable params).
My intuition is that in the first case the computation graph must be full, as opposed to the first case where only the deeper part of is necessary (because I don’t need to backprop the whole graph up the encoder).
I am wondering if this makes sense, or I have a bug…