Moving TF code to PyTorch

Hi! I am completely new to PyTorch, I would like to move my TF code to PyTorch, and I think I am missing something.

I have X as input and Y as output. X is a time series data, on which I would like to do 1D convolution. Y is just a plain number.

X has a shape of (1050589, 81, 21). I have 1050589 experiments, each experiment has 81 timestamps and each timestamp has 21 points of data. This is the required format for TF, but as far as I was able to get out in PyTorch the time dimension should be the last one.

I have my data in a numpy array, so first I transformed the data to fit PyTorch, and also transformed into a list.

a = []
for n, i in enumerate(X):
    a.append([X[n].T, Y[n]])

train_data = DataLoader(a, batch_size=128)

My model looks like this:

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear_relu_stack = nn.Sequential(
            nn.Conv1d(EMBED_SIZE, 32, 7, padding='same'),
            nn.Linear(81*32, 32),
            nn.Linear(32, 1),

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits.double()

The architecture is simple, as I want to keep it the same as I have in Tensorflow. One convolution with a kernel of 7 and 32 channels, followed by a dense layer and a single output layer.

Same network in Tensorflow:

def conv_1d_model():
    model = Sequential(name="model_conv1D")
    model.add(Conv1D(filters=32, kernel_size=7, activation='relu', input_shape=(81, 21), padding="same"))
    model.add(Dense(32, activation='relu'))
    return model

Now when I try to optimize this network in PyTorch my losses are all over the place, not decreasing at all, while in TensorFlow it runs perfectly well.

I am sure I am missing something, can anyone point me in the right direction?

My optimization function in PyTorch:

model = NeuralNetwork()

loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = torch.squeeze(model(X))  # I was getting a warning about the pred being in different shape than y, so I squeezed it
        loss = loss_fn(pred, y)
        # Backpropagation

        if batch % 10 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

Optimization in Tensorflow

model = conv_1d_model()
opt = Adam(learning_rate=learning_rate)
model.compile(loss='mse', optimizer=opt, metrics=['mae'])

model_history =, Y, validation_split=0.2, epochs=epochs, batch_size=batch_size, verbose=1)

From a first glance, it looks like your samples are not being shuffled. Is there a reason for that?

I dont think it would matter, these are actually protein sequences embedded, also I think the TF code lacks shuffling as well, so that should not yield these strange results.

# I was getting a warning about the pred being in different shape than y, so I squeezed it

Could you check the shape of the model output and the targets?
I would assume nn.MSELoss was warning about a potentially unwanted broadcasting, so could you calculate the loss for one batch manually and compare it to the output of MSELoss?

While the optimization is running the shape of y is [256], while the returned shape of pred was [[256, 1]]. After I squeezed, the warning was gone.

How can I check the loss on one batch manually?

You could write the loss function e.g. via ((x - y)**2).mean() and check if the intermediates would be broadcasted. However, since you’ve squeezed one tensor, I would assume the shapes are equal (you could unsqueeze the other tensor in dim1 which should also work and might be more explicit).
Assuming the loss calculation is correct I would recommend to try to overfit a small dataset (e.g. just 10 samples) and see if your model is able to do so.

Dear ptrblck!

Thanks a lot for helping me! I tried what you said, and to my surprise with only 10 elements in X and Y the model converges perfectly well, the loss goes to 1e-2 in 100 epochs.

I thought it would fail. Now have have absolutely no idea what causes the difference between the TF run and the PyTorch one.

Are you seeing the same final loss in TF as ~1e-2 for these 10 samples as well?
If so, could you try to scale up the training use case in PyTorch again and see when it breaks?
If the initial training looks good, I would start checking the parameter initialization next.