The input is of size (batch_size, n_channels, height, width). Batch dim is 1 in both, so that can be discarded. The second dim of the input is 4, which is the in_channels, and the second dim of the output is 32, which is the out_channels.

Now, how (84, 84) turns into (20, 20) follows the formula described in the conv2d documentation:

So, do I understand correctly that there are in total 4 * 32 * 8 * 8 kernel coefficients? I.e. one 8x8 kernel for each (input layer C_in, output layer C_out) combination?