The memory consumption of a model during training is actually mostly dominated by the size or number of feature maps, not the network weights (parameters). Replacing one convolution with two may reduce the parameters but roughly doubles the number feature maps. This is normal and doesn’t require a fix.
1 Like