Feeding 3D volumes to Conv3D

It depends.

In the Conv2D case, the expected input is [batch_size, in_channels, height, width]. When we perform a 2d convolution, each filter is acting upon all the channels. We sort of assume there isn’t a depth aspect in our channels. For example, consider RGB, it’s not like it has to be in that order, we just keep it in that order for convention.

In the Conv3D case, the expected input is [batch_size, in_channels, depth, height, width]. As an example of this would be video. In this case, order does matter since it’s not like we can shuffle all the frames, we end up losing meaning. We use a 3D convolution to embed that information, where each filter strides across our frames (rather than act on them all).

Now for your specific case, it doesn’t sound like you have a “depth” aspect to your data. The entire embedding vector as a whole seems important, it doesn’t make sense to me to stride over your embedding dimension. If I’m understanding correctly, I think you want a Conv2d with kernel_size=1 and stride=1. Your in_channels would be embedding_dim, and your output_channels would be whatever feature size. That’ll result in an output of shape [batch_size, output_channels, height, width]. Keep in mind the order of things.