Converting my model from Keras to PyTorch doesn't work for some reason; produces unchanging outputs

Hey everyone!

I’ve been trying to convert a set of 12 concurrent signals of length 512 each into a singular signal of length 100. I initially wanted to start with a simple Keras implementation as follows:

model = Sequential()
model.add(Conv1D(filters=16, kernel_size=5, strides=2, input_shape=(512, 12)))
model.add(Conv1D(filters=32, kernel_size=5, strides=2))
model.add(Conv1D(filters=64, kernel_size=5, strides=2))
model.add(Dense(100, activation='tanh'))

model.compile(loss='mean_squared_error', optimizer=Adam(lr=1e-6))
history =, y, epochs=100, batch_size=32, verbose=1, validation_data=(x_val, y_val))

Now this model performs quite poorly*, so I wanted to attempt to use a pix2pix-style approach in PyTorch: for model definitions for training routine

For some reason, my PyTorch generator keeps producing extremely similar outputs:

y_ = netG(x_val)

Obviously, the PyTorch pix2pix implementation is different from the simple Keras implementation, but even when I switch the criterion_reconstruction to MSE, disable the loss obtained from the GAN (and simply use the raw reconstruction L1 or L2 loss), use the same optimizers, the same batch size, and same dataset, PyTorch has the issue while Keras does not.

Have I made some glaring error somewhere?

Any help would be greatly appreciated.

*It still generates varying outputs that at least slightly resemble the expected output moire often than not. This differs from my flawed PyTorch implementation, which produces a singular constant output with less semblance to the expected output than anything the Keras model yields.

Two things I notice:

  1. Why are you adding in a reconstruction_lambda scalar to the loss? Try removing that.
  2. Your Discriminator uses a Sigmoid layer and you’re using BCELoss. Try removing the Sigmoid and using nn.BCEWithLogitsLoss. This takes advantage of the log-sum-exp trick for better numerical stability.

Thanks for taking the time to reply!

The “reconstruction lambda” is taken from the pix2pix paper (see bottom-left of p. 6), which I’m adapting to my signal generation problem. They use a lambda of 100, I found that 200 seemed to work better (but it’s all mostly garbage anyway).

When I attempt to directly replicate the Keras model, I’ve essentially replaced that line with just the reconstruction loss, so like the Keras model the reconstruction lambda and discriminator-based loss aren’t used.

As for the second suggestion, thanks! I am familiar with that but for some reason neglected to use that here.

In any case though, I’m still stumped as to why simply training the generator on the L1 or MSE loss without any discriminator involvement is unable to perform similarly to the Keras model. For context, I implement a replication of the Keras model by replacing lossG = loss_adversarial + reconstruction_lambda * loss_reconstruction with lossG = loss_reconstruction, setting loss_reconstruction to MSE, and appropriately setting hyperparameters like batch size, number of epochs, and learning rate.

I feel like something is majorly wrong somewhere and I just can’t sniff it out.

For some reason, choosing a reconstruction lambda of ~50 (lower seems to perform worse, higher also is quite poor but somehow somewhat okay at ~200 as well), increasing the learning rate to a much higher value, and making the batch sizes large (like in the hundreds) seems to break the issue. I also suspect some mode collapse is involved as the model isn’t producing very diverse outputs, and I’ll get that fixed shortly to see if that’ll help alleviate the problem of the generator failing to produce remotely diverse outputs.

In any case though, I still find it very weird that the PyTorch implementation (even removing the GAN/discriminator component) operates so differently from Keras.