Please, think about it. We have a convolution window (kernel) with shape 3x3 - part 3,3. Now, we need as many windows as there are input channels - so in this case we need 16 kernels - part 16,3,3. And all these numbers 16,3,3 are needed to create only one new channel output. 1,16,3,3. But you want to have 32 outputs, so you need 32,16,3,3 number of weights to be able to calculate this.
See example: gif
To create one new output channel based on three input channels(rgb) you need to store (1, 3, 3, 3) weights. To get two outputs channels you need to double number of weights: (2, 3, 3, 3)