Training two different CNN architectures (small vs big) take 20GB of vram while training

You are not accounting the intermediate activations, which might use significantly more memory than the parameters and also fits:

This post estimates the memory usage from parameters and forward activations.