Good to hear a batch size of 16 works.
Yeah, the intermediate activations can be quite huge, e.g. especially if you are using a lot of kernels in a conv layer.
.buffers
is used for internal tensors, which do not require gradients, e.g. the running_mean
and running_var
in batchnorm layers.
If you want to get the intermediate outputs, you could register forward hooks as explained in this post.