I have 3-dimensional input tensor with size (1,128, 100) when the agent selects the action and (batch_size, 128, 100) when the agent trains. The input is a sequence of words that tokenized and get vector for every token from Word2Vec model and concatenate to a tensor. So 128 is the number of tokens and 100 is W2V vector size. In this convolutional network:
nn.Conv2d layers expect a 4-dimensional input tensor in the shape [batch_size, channels, height, width]. Based on your error and description I guess the channel dimension is missing, so you could add it via x = x.unsqueeze(1) before passing the tensor to the model.
@ptrblck, the dimension of tensor that I want to pass to CNN: is [32, 3, 512, 512]. It has 32 slices of one image, each of which has three chanels. However CNN expects 4 dimensional input tensor [B,C,H,W]. How can change my tensor [32, 3, 512, 512] to get passed as 4D following the expected input order [B,C,H,W].
No, they do not have to be considered as a separate sample, rather all of them have to be considered as 1 sample. that’s why I want to squish the first two dimensions to make it compliant with the expected input order [B,C,H,W] .
Since you’ve mentioned “slices” I would guess you want to treat this dimension as the “depth” then?
If so, you should use a 3D model and pass the input as [batch_size, channels, depth, height, width] via:
x = torch.randn(32, 3, 512, 512)
x = x.permute(1, 0, 2, 3).contiguous().unsqueeze(0)
# > torch.Size([1, 3, 32, 512, 512])
Yes, I will be using 3D CNN later, but at the moment I want to run resnet as a baseline and have 32 slices per sample. [32,3,512,512] is the tensor dimension, I want to squish the first two dimensions to make it a 3D tensor, so that I can pass it as [B,C,H,W] in the network.
I’m not sure how the description fits the shape, but assuming you want to move the sliced into the channel dimension, you could use x = x.view(-1, 512, 512).unsqueeze(0) to get a tensor of [1, 96, 512, 512] which would then of course not work anymore in a standard ResNet model since 3 input chanels are expected. If this is your use case, you could replace the first conv layer with a new one accepting 96 channels.