So I was watching this video as a refresher on how the math behind convolution works and you can see he is multiplying each pixel with the corresponding weight values from the filter. Then the equation shows summing all the values and then find it’s average. In Pytorch if you make a dummy data input like:

input = torch.rand(1, 1, 2, 2)
filter = nn.Conv2d(1, 1, kernel_size=2)
result = x(foo)
# Double checking the process
double_check = input * filter.weight + filter.bias
# Double_check would hold the same value as result variable.
# But as you can see we didn't do
(input * filter.weight) / numbers of input and weight pairs + filter.bias

So can someone please tell me why we don’t average first then add the bias?

Sorry, I just unable to completely understand your point. But in CNN each filter have its on own bias and kernels for different channels. And each filter give you one output. Bias is added after adding all the kernels of filter.
You can get help from following tutorial:

The point is pretty simple actually, I am just bad at explaining things. If you check pytorch convolution process compare to the video above its different and I wanted to know why its different.

The video’s convolution process:

Dot product(input, weights) / number of dot product pairs + bias. product

Pytorch convolution process:

Dot product(input, weights) + bias

Just wanted to know whats the difference between these 2 steps. Are there any pros and cons?

In the video, the person is using a smoothing filter (average) just for developing concept. While in CNN we do convolution and convolution is done in this way product(input, weights). The average filter doesn’t have any weights its only requirement is window size while in CNN we are trying to learn the weights of the kernel. Different filters have there own pros and cons but that is another discussion