Network not converging sometimes for a simple XOR network

Hello I’m new to using the C++ API of pytorch (libtorch) and I implemented a simple XOR net for learning purposes and then I encounter this behavior, I don’t know what I did wrong.

Basically what is happening is sometimes when I run the program there are some instances where the XOR net will not converge and get stuck (the loss is not going down on an acceptable rate anymore), but most of the time it works and converge fast.

My hunch is it might be because of the random initialization of the weights of the network, that sometimes it initializes weights that are not good, but I don’t really know if this is the case.

Here is my implementation:

#include <iostream>
#include <memory>
#include <torch/torch.h>

struct XorNet: torch::nn::Module {
  XorNet() {
    linear_stack = register_module("linear_stack",
      torch::nn::Sequential(
        torch::nn::Linear(torch::nn::LinearOptions(2, 2).bias(true)),
        torch::nn::Sigmoid(),
        torch::nn::Linear(torch::nn::LinearOptions(2, 1).bias(true)),
        torch::nn::Sigmoid()
      )
    );
  }

  torch::Tensor forward(torch::Tensor x) {
    auto x1 = linear_stack->forward(x);
    return x1;
  }

  torch::nn::Sequential linear_stack{nullptr};
};

int main() {
  torch::Device device(torch::cuda::is_available() ? torch::kCUDA : torch::kCPU);

  XorNet model;
  model.to(device);
  
  std::cout << "Weights BEFORE training:\n";
  for (const auto& pair : model.named_parameters()) {
    std::cout << pair.key() << ":\n" << pair.value() << "\n\n";
  }

  torch::optim::Adam optimizer(model.parameters(), torch::optim::AdamOptions(0.01f));
  torch::nn::BCELoss loss_fn;
  
  torch::Tensor inputs = torch::tensor({
    {0.f, 0.f}, {0.f, 1.f}, {1.f, 0.f}, {1.f, 1.f}
  });

  torch::Tensor labels = torch::tensor({{0.f}, {1.f}, {1.f}, {0.f}});

  inputs = inputs.to(device).to(torch::kFloat32);
  labels = labels.to(device).to(torch::kFloat32);

  std::cout << "inputs  : \n" << inputs << "\n";
  std::cout << "labels  : \n" << labels << "\n";

  auto initial_output = model.forward(inputs);
  auto initial_loss = loss_fn(initial_output, labels);

  std::cout << "\nBefore Training:\n";
  std::cout << "output  : \n" << initial_output << "\n";
  std::cout << "loss    : \n" << initial_loss   << "\n";

  model.train();

  const size_t MAX_EPOCH = 1'000'000;
  size_t epoch = 0;

  while (epoch < MAX_EPOCH) {
    optimizer.zero_grad();

    auto training_output = model.forward(inputs);
    auto training_loss = loss_fn(training_output, labels);

    training_loss.backward();
    optimizer.step();

    // log loss every 500 epochs
    if (epoch % 500 == 500 - 1) {
      std::cout << "training loss (epoch[" << epoch + 1 << "]) : " << training_loss.item().toFloat() << "\n";
    }

    if (training_loss.item().toFloat() < 0.005f) {
      std::cout << "TRAINING DONE LOSS BELOW < 0.005f | epochs: " << epoch + 1 << "\n";
      break;
    }

    epoch++;
  }

  if (epoch == MAX_EPOCH) {
    std::cout << "MAX EPOCH ACHIEVED\n";
  }

  model.eval();

  std::cout << "\n\nWeights AFTER training:\n";
  for (const auto& pair : model.named_parameters()) {
    std::cout << pair.key() << ":\n" << pair.value() << "\n\n";
  }

  auto trained_output = model.forward(inputs);
  auto trained_loss = loss_fn(trained_output, labels);

  std::cout << "\nAfter Training:\n";
  std::cout << "output    : \n" << trained_output << "\n";
  std::cout << "loss start: " << initial_loss.item().toFloat() << "\n";
  std::cout << "loss end  : " << trained_loss.item().toFloat() << "\n";

  std::cout << "Total Training Epochs Done : " << epoch + 1 << "\n";

  return 0;
} 

Example where the loss seems to be stuck

training loss (epoch[604000]) : 0.346574
training loss (epoch[604500]) : 0.346574
...
training loss (epoch[610500]) : 0.346574
training loss (epoch[611000]) : 0.346574

Example where it converge fast:

Weights BEFORE training:
linear_stack.0.weight:
 0.6285  0.4231
-0.5949 -0.6533
[ CPUFloatType{2,2} ]

linear_stack.0.bias:
-0.1745
-0.5666
[ CPUFloatType{2} ]

linear_stack.2.weight:
 0.4732  0.4593
[ CPUFloatType{1,2} ]

linear_stack.2.bias:
 0.2365
[ CPUFloatType{1} ]

inputs  : 
 0  0
 0  1
 1  0
 1  1
[ CPUFloatType{4,2} ]
labels  : 
 0
 1
 1
 0
[ CPUFloatType{4,1} ]

Before Training:
output  : 
 0.6500
 0.6473
 0.6537
 0.6536
[ CPUFloatType{4,1} ]
loss    : 
0.742499
[ CPUFloatType{} ]
training loss (epoch[500]) : 0.313808
training loss (epoch[1000]) : 0.0393235
training loss (epoch[1500]) : 0.0170335
training loss (epoch[2000]) : 0.00958167
training loss (epoch[2500]) : 0.0060585
TRAINING DONE LOSS BELOW < 0.005f | epochs: 2737


Weights AFTER training:
linear_stack.0.weight:
 9.1454 -9.0918
 9.2082 -9.3059
[ CPUFloatType{2,2} ]

linear_stack.0.bias:
 4.5912
-4.9715
[ CPUFloatType{2} ]

linear_stack.2.weight:
-10.6332  11.3042
[ CPUFloatType{1,2} ]

linear_stack.2.bias:
 5.0559
[ CPUFloatType{1} ]


After Training:
output    : 
 0.0045
 0.9929
 0.9962
 0.0045
[ CPUFloatType{4,1} ]
loss start: 0.742499
loss end  : 0.00499383
Total Training Epochs Done : 2737