Stack tensor onto itself along the dimension


I have a tensor of the shape (batch_size, bag_size, num_channels, height, width) for multiple instance learning. Where each batch consists of a certain number of bags (batch size), each bag consists of a certain number of images (bag size).

I would like to take that tensor, and stack all of the images from every bag along one dimension, resulting in a tensor (batch_size * bag_size, num_channels, height, width), without mixing the data. It seems that view is not be the right tool here?

H = cnn_model(x.view(-1, num_channels, width, height)) # x of shape (batch_size, bag_size, num_channels, height, width)

When I compute the loss as a simple sum of outputs, weighting just with one bag in the batch, while zero weighting the rest of the bags. Then calling backward(), the resulting gradients are still with respect to every input. What would be the solution, how could I otherwise restack (batch_size, bag_size, num_channels, height, width) into (batch_size * bag_size, num_channels, height, width)?

weights = torch.zeros(H.size())[1,:] = 1
loss =, H.view(-1))

Since the flattened dimensions are consecutive, the elements won’t be mixed.
Here is a small example to show what the operation is doing:

batch_size, bag_size, num_channels, height, width = 2, 3, 2, 2, 2
x = torch.arange(batch_size*bag_size*num_channels*height*width).view(
    batch_size, bag_size, num_channels, height, width)
x = x.view(-1, num_channels, height, width)
1 Like