Concatinating output of two linear layers with different "batch_size"

Hi,
Hope you are fine!

So I want to concat the output of two linear layers with dynamic batch size. I have Graph Convolutional Netowrk that I am combining with a CNN. The graph network has a Linear layer at the end and so does the CNN. I want to concatenate the two output of these layers and pass it to the next layer. (The batch size of the graph network is not static as the number of nodes changes with each graph)

Here is a code :-

def __init__(self,data, in_channels, representation_size):
        super(customTwoMLPHead, self).__init__()
        self.data=data[0]
        self.conv1 = GCNConv(data[0].x.shape[1], 128)
        self.conv2 = GCNConv(128, 256)  
        self.conv3 = GCNConv(256, 512)
        self.fc1 = torch.nn.Linear(512,1024,bias =True)
        self.fc_cnn_1= nn.Linear(in_channels, representation_size,bias =True)
        self.fc_cnn_2 = nn.Linear(representation_size+1024, representation_size,bias =True)  

    def forward(self,x,data):
        data=data[0]
        y, edge_index, edge_weight = data.x, data.edge_index, data.edge_attr
        y = F.relu(self.conv1(y, edge_index))
        y = F.relu(self.conv2(y, edge_index))
        y=F.dropout(y,p=0.5)
        y = F.relu(self.conv3(y, edge_index))
        y=F.relu(self.fc1(y)) #[batch_size,1024], batch_size is dynamic
        x = x.flatten(start_dim=1)

        x = F.relu(self.self.fc_cnn_1(x)) ##[batch_size,1024], batch_size is static
        x = torch.cat((x, y), 1)
        x = F.relu(self.self.fc_cnn_2(x))
        return x

I get errors like

RuntimeError: Sizes of tensors must match except in dimension 1. Got 1024 and 678(graph network batch_size) in dimension 0 (The offending index is 1)

Please Help !

Thanks :slight_smile:

I’m not deeply familiar with graph networks, so could you explain the use case a bit more? In particular how the number of nodes relates to the batch size, which gives the number of samples?
Is each sample using a different (set of) node(s)?
If so, you won’t be able to concatenate two tensors, which differ in more than a single dimension.

Thanks for your reply peter. So consider two inputs to this network. One is the Image and the other is a graph, the graph is basically the linkage and features of text on the image. I am using the pytorch faster rcnn as my cnn for object detection on that image. I modify the roi heads in there (specifically the two MLP Heads ) the batch size that comes out from that FasterRcnn is 512 (We use batch size of 1), and the batch size of the graph network comes out to be different on every pass (depending on the nodes we have in the graph).

Thanks for the follow-up. I’m still unable to understand why the batch size changes, if you are passing a single input. In “standard” use cases the batch size would be equal for the input shape and would not change, i.e. it indicates how many samples are processed in the current batch by the model.
E.g. if you are passing 5 samples to the model, you would expect to get 5 “results”.
Your current use case would reduce or increase the number of samples, so is this expected?

Ok so from my understanding, on this x = F.relu(self.self.fc_cnn_1(x)) ##[batch_size,1024], batch_size is static the output is (512,1024) 512 comes from flattening out the input. Same goes for graph, but here the number of nodes changes so the dimensions of the graph are not same always. So in one forward pass it may come as 200 in an other it may be 400 and so on. The output of the graph network is as follow (batch_size, 1024 (linear layer representation size)) for example it comes out to be like (200,1024) and in the next pass it is (400,1024) and maybe in the next pass it can become (350, 1024) and so on. I want to combine the output of the graph network with the output of the fasterrcnn. so we have two parallel netowrks that at one point combine thier output and process further as one network. Basically I want the faster rcnn to be able have textual features as well when deciding what the object is.
I hope it makes sense :stuck_out_tongue:

P.S i am also not sure about these things a lot so please help me out here

I want to point out again, that I’m not familiar with GNNs, so I might be completely wrong, but in e.g. CNNs this would not be the case. You are flattening the feature dimensions, but keep the batch size static.
I.e. the output of the last conv layer could be [batch_size, a, b, c] and you would flatten it to [batch_size, a*b*c]. The assumption is that each input sample returns an output.
If you pass an input with a batch size of 5 (5 samples), you would expect to get 5 results. The same applies for 1, 10, 1000, … samples.

In your current model, the output shape depends on some internal conditions (or in general how many nodes are used). E.g. I could pass a single sample to the GNN and would get an output of [400, 1024] and an input with 10 samples could output [200, 1024]?
If that’s the case, how would you be able to map this output to the input?

Thanks a lot Peter for your reply. You are absolutely right, that totally makes sense and is very strange. Ok so lets just stick to the CNN. Why am I getting the output batch size of 512?. When I explicitly stated the batch size to be 1. When I change this batchsize to 2 it becomes 1024.

This shouldn’t be the case. Which CNN are you using? Could you post the code for it here?

I am using the pytorch Faster R-CNN model with a ResNet-50-FPN backbone. I used this tutorial https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

The output at the “TwoMLPHeads” comes out to be (512,1024)

Thanks for the update! Now I can see where the variable output shape is coming from.
Detection models work on candidates for each input sample and will filter out all detection results with a low score (or a hard candidate threshold). This would explain why the internal shape is a factor of the input size. This also explains why e.g. random noise inputs return empty predictions, where no segmentation/detection results could be found.
You would have to check the reference paper to verify this claim, but that’s at least my understanding of detection models.