When we do 2d convolution with RGB images we are, actually, doing 3d convolution. For this we still use the pytorch 2d_conv layers.

When we do 3d convolution of a set of RGB images, we are doing 4d convolution and can use the 3d conv layer.

My question is: what is the difference, if any, between using the 3d conv layer for a set of grayscale images, as opposed to giving the set of images to pytorch as color channels and using the 2d conv layer?

To be clear, if using 2d conv, my tensor will look like:
[batch size, set of grayscale images, n rows, n cols]

And in the case of 3d conv:
[batch size, set of grayscale images, 1 grayscale channel, n rows, n cols]

After further research I understand that although we have a 3d volume in an RGB image, we are still doing 2d convolution. The convolution dimensionality refers to the filter dimensions, so in doing 2d convolution on an RGB image we are actually convolving each filter by each channel. In the case of 3d convolution we would alternatively be convolving all 3 channels at the same time by each 3d filter.

Is this accurate?

In the case of 2d convolution of grayscale images, is there a difference between using a batch of say 100 images and using an image of “100 channels”? (besides losing the flexibility of mixing images between batches)

I don’t think this is completely correct.
For a “standard” 2-dimensional convolution, each filter will use all of the input channels.
The number of output channels of the convolution layer defines the number of filters.
Have a look at CS231n which explains the applied logic pretty well.
Grouped of depthwise convolutions work a bit different regarding the processing of the input channels, but lets stick to the vanilla use case for now.

3-dimensional convolutions also use all input channels for each kernels. However, now you are performing the convolution in a volume (depth, height, width), so the kernel moves in all three dimensions this time.

Yes, these approaches would yield different results, as in the first use case, the filters would have a single input channel, while in the latter approach they would have 100 input channels.

When doing 2d convolution, the filter moves in only 2 directions (H & W) of the image. That’s why the operation is called 2d convolution although a 3d filter is used to process 3d volumetric data.

In 3d convolution we also use a 3d filter to process 3d volumetric data, but we move in 3 directions (H, W & D).

It was confusing to me to use a 3d filter in 2d convolution, but the aforementioned explanation should clarify the difference.

In a 3-dimensional convolution, you would use a 4-dimensional filter, which still uses all input channel, but moves in all 3 volumetric dimensions.
The method is very similar to a 2-dimensional convolution with an additional depth dimension the filter moves along.