Different results caused by CatBackward

I tried to convolve my input twice, then connect the input and the convolved vector(all 3 dimensional), send it to the model and pass an LSTM. But I found that when the order of my connection is different, as the training progresses and the loss decreases, my model finally achieves completely different results.
For example

    for idx, (inputs, tlabels) in enumerate(train_loader):
        optimizer.zero_grad()

        inputs = inputs.float().permute(1, 0, 2, 3).to(device)
        hinputs = model.conv1(inputs)  # 2D-Conv, padding=same
        hinputs2 = model.conv2(hinputs)  # 2D-Conv, padding=same

        tlabels = tlabels.float().to(device)
        hlabels = model.conv1(tlabels)
        hlabels2 = model.conv2(hlabels)

        linputs = torch.cat((hinputs, hinputs2, inputs), 1)  # order firstConv, secondConv, raw
        llabels = torch.cat((hlabels, hlabels2, tlabels), 1) 

        # linputs = torch.cat((inputs, hinputs, hinputs2), 1)  # order raw, onceConv, secondConv
        # llabels = torch.cat((tlabels, hlabels, hlabels2), 1)

        lpreds = model.low_level(linputs)  # LSTM

        loss = criterion(lpreds, llabels.reshape(1, 3, -1).permute(1, 0, 2))

        loss.backward()
        optimizer.step()

In the code, I ‘cat’ the three vectors,and the other ‘cat’ method was commented. These two methods lead to completely different prediction results for the model. In general, it is closer to the result of using the first dimension to train alone.
The only difference between them is the order of ‘next_functions’ in ‘grad_fn’.At first I thought that the calculation order in the calculation graph would cause the gradient of the leaf nodes to be different, but after testing with a few custom vectors, I found that it was not.
How to explain this problem?I would be grateful if anyone could answer me.

Based on the code snippet it seems that the model architecture is equal, but the torch.cat call uses a different order of inputs. Could you explain how dim1 is treated in model.low_level? Would it be the temporal dimension and thus the LSTM would see a different sequence?

Thanks for your reply.

class LSTMNet(nn.Module):
    def __init__(self):
        super(LSTMNet, self).__init__()
        self.lstm = nn.LSTM(n_steps, 64, 2)
        self.out = nn.Sequential(
            nn.Linear(64, 256),
            nn.Linear(256, 16),
            nn.Linear(16, 1)
        )

    def forward(self, x):
        x, _ = self.lstm(x)
        x = self.out(x).permute(0, 2, 1)
        return x

This is my LSTM architecture, and the n_steps is the temporal dimension.
This is the function low_level

    def low_level(self, x):
        x = self.lstmnet(x.permute(1, 0, 2, 3).reshape(3, n_steps, -1).permute(0, 2, 1))  # C2D + LSTM
        return x

For example, the shape of my inputs is (1, T, H, W), and I permute it to (T, 1, H, W), which lets T be the batch and variate be the channel, to do the Conv2D(padding=same) twice. Then torch.cat(dim=1) them to get the shape (T, 3, H, W) and run the model.low_level().Before LSTM, do some permute and reshape, (T, 3, H, W) → (3, T, H, W) → (3, T, H * W) → (3, H * W, T), and my class LSTM within FC makes the shape change to (3, 1(also represent T), H * W), which is my output.

In the default setup nn.LSTM expects an input in the shape [L, N, H_in] as described in the docs.
Based on your description you are feeding inputs as [3, H*W, T]. Would this correspond to [N=batch_size, H_in, L]? If so, you would be mixing the dimensions and might want to check the inputs again.

Thanks for your reply!
Is the L - sequence length a concept in NLP? The task of my LSTM is the prediction for the time series. ‘3’ means the three Tensors torch.cat() by me. ‘H*W’ means the all points in my dataset. ‘T’ means the timesteps in the train window. I think H*W correspond to N because the prediction is for every single point. And ‘T’ correspond to ‘H_in’ because I want to map it to the hidden state. How should I understand the L? 3 is just the demension torch.cat() by three Tensors, it can be any number as I add some Conv2D(I also have plans to try this operation in the future).
In fact, my code has achieved the best results I have ever gotten, and it has already met my expectations.

        linputs = torch.cat((hinputs, hinputs2, inputs), 1)
        llabels = torch.cat((hlabels, hlabels2, tlabels), 1)

is better than

        linputs = torch.cat((hinputs2, hinputs, inputs), 1)
        llabels = torch.cat((hlabels2, hlabels, tlabels), 1)

and

        linputs = torch.cat((inputs, hinputs, hinputs2), 1)
        llabels = torch.cat((tlabels, hlabels, hlabels2), 1)
        #or this
        #linputs = torch.cat((inputs, hinputs2, hinputs), 1)
        #llabels = torch.cat((tlabels, hlabels2, hlabels), 1)

In the absence of any changes in other codes, poor results were obtained that were completely out of line with expectations.
Therefore, the possible reason is that the dimension by torch.cat() is used as L, which causes these three Tensors to play different roles in the calculation of LSTM, which leads to the different predictions, right?

Yes, a sequence length is often used in NLP (e.g. you can see words as a sequence forming a sentence) but is not limited to NLP (e.g. EEG readings can be processed in a similar way).

Yes, that could be the case. This post explains it in more detail. While the processing would be different, I wouldn’t claim that only a specific sequence input would work, but you would need to run some experiments.