Training Models in Sequence changes Output Results

Hi Everyone,

I am working in a project that receives data and divides it in 3 subsamples. For each subsample i run a for loop that trains an MLP model, predicts future values and generate metrics. The issue is that i realized that the results are not the same depending on which subsample the model starts training on. Ex:

1st Run:
Subsample 1 - Accuracy: 0.54
Subsample 2 - Accuracy: 0.65
Subsample 3 - Accuracy: 0.55

2nd Run, excluding Subsample 1 from for loop:
Subsample 2 - Accuracy: 0.47
Subsample 3 - Accuracy: 0.56

Please note that the subsamples sizes and content, and the MLP architecture and hyperparams do not change at any moment.

My guess would be that the results were to be the same (sub2 - 0.65,sub3-0.55), independently of previous model training. Please note that im using Microsoft DML for AMD GPU implementation. I already tried a number of possible fixes, such as:

  1. Setting a seed with torch.manual_seed(0)
  2. Stop using the GPU and using only CPU
  3. Reset model parameters with module.reset_parameters() for each iteration
  4. Recreating the Optimizer object for each iteration
  5. del the Model and using gc.collect for each iteration
  6. Since im not using NVIDIA GPU there is no point in using torch.cuda.empty_cache()

None of those changes helped me get consistent results. I really dont know what else i can change to get the consistent results.

Here’s the code for the MLP:

class MLP(nn.Module):
    
    def __init__(self, input_size, dropout):
        super(MLP, self).__init__()
        self.linear1 = nn.Linear(input_size, 18)
        self.linear2 = nn.Linear(18,18)
        self.linear3 = nn.Linear(18, 1)
        self.F = nn.ReLU()
      
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        
        x = self.linear1(x)
        x = self.F(x)
        x = self.dropout(x)
        
        x = self.linear2(x)
        x = self.F(x)
        x = self.dropout(x)
        
        x = self.linear3(x)     
        
        return x[:, -1, :]

The model training step:

class Optimization:

    def __init__(self, model, filepath, outputpath):
        self.model = model
        self.filepath = filepath 
          
    def train_network(self, train_pct=0.9, n_epochs=50, rolling_window=False, device='cpu'):
          
            self.optimizer = Adam(self.model.parameters(), lr=0.001)
            self.loss_fn = MSELoss(reduction='mean') 
            
            ## gets data from dataframe (x_train, y_train......) ####

            train_loader, test_loader, test_loader_one = self.toDataloader(x_train, y_train, x_test, y_test, batch_size)
            
            # Model Fit
            for epoch in tqdm(range(1, n_epochs + 1)):
                
                for x_batch, y_batch in train_loader:
                    x_batch = x_batch.view([batch_size, -1, x_batch.shape[1]]).to(device)
                    y_batch = y_batch.to(device)
                    loss = self.train_step(x_batch, y_batch)
                    batch_losses.append(loss)
                training_loss = np.mean(batch_losses)
                self.train_losses.append(training_loss)
            
            # Model Prediction
            with torch.no_grad():

                for x_test, y_test in test_loader_one:
                    x_test = x_test.view([1, -1, x_test.shape[1]]).to(device)
                    y_test = y_test.to(device)
                    self.model.eval()
                    yhat = self.model(x_test)
                    yhat = scaler.inverse_transform(yhat.cpu().detach().numpy().reshape(-1,1))
                    predictions.append(yhat)
                    
            predictions = np.concatenate(predictions, axis=0).ravel()
            ### Sends results to dataframe

    def train_step(self, x, y):
          
            # Sets model to train mode
            self.model.train()
            # zeroes gradients
            self.optimizer.zero_grad()
            # Makes predictions
            yhat = self.model(x)
            # Computes loss
            loss = self.loss_fn(y, yhat)
            # Computes gradients
            loss.backward()
            # Updates parameters
            self.optimizer.step()
            # Returns the loss
            return loss.item()

The Runner function:

list_ds = os.listdir('processed')
        for ds_name in tqdm(list_ds):

            # Instance model object -------------------------------------------
            mlp = MLP(input_size, dropout).to(device)
            opt = Optimization(model=mlp, filepath=filepath, outputpath=outputpath, lr=lr)
            df_test, fig_prediction, fig_loss=opt.train_network(train_pct=train_pct,n_epochs=n_epochs, device=device)

I’m not entirely sure what SubSample X represents, but I assume it’s a subset of the entire dataset.
If so, then I don’t think your assumption is valid and I would not expect to achieve the same results using different datasplits in different training sessions.

Hey @ptrblck, you are correct, the subsamples are a subdivision of the entire dataset. Im curious as to why the same subsample wouldnt get the same results, if it has the same values as before. I assume it has something to do with the way that the weights are initializated?

Please note that there are no new datasplits or any new preprocessing on the second run, its exactly the same sample, but instead of running the for loop for all 3 of them. The first one is excluded. Shouldnt they be independent from each other?

I might have misunderstood your use case as it seems you are not sequentially training the model but are resetting the training?
If so, I would guess the calls into the pseudorandom number generator might differ between the runs somehow. E.g. are you able to get the same initial results by just changing the order of training instead of removing a split?

@ptrblck i think this is the main problem (random initialization). Yes the training is reset after every iteration, sorry if i wasnt clear. When the order of subsamples is shuffled the results are also different. In this context, I was able to track down the random initialization and noticied that the weights for example are initiliazed at random but in a consisten ordet consistent with each run.

What i mean is that, for the fist iteration the init Weight is always the same, same thing for the second and so on. In this case, by changing the order of samples, they are initialized with different weights, thus providing a different output result each time the sample orders change.

Not sure if i was clear in my explanation, see below what im trying to describe:

In the First run:
Sample 1 - weight 1
Sample2 - weight 2
Sample 3 - weight 3

In the second Run:
Sample2 - weight1
Sample3 - weight2

To ensure reproducibility i made a change to model by initializing the same weights and bias everytime by using:

class MLP(nn.Module):
    def __init__(self, input_size, dropout):
        
        super(MLP, self).__init__()
        for i in range(3):
            self.sequence.add_module(f'Linear{i}', nn.Linear(input_size, 36))
            self.sequence.add_module(f'Activation', nn.ReLU())
            self.sequence.add_module(f'Dropout', nn.Dropout(0.1))
            input_size = 36
        self.sequence.add_module(f'Linear-Out', nn.Linear(36, 1))
        
        # Deterministic Results for every iteration
        torch.manual_seed(0)
        for m in self.modules():
                if isinstance(m, nn.Linear):
                    m.weight = nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                    torch.nn.init.zeros_(m.bias)

Not sure if its the best way to fix the results or if it is good practice to use the same weights and bias for all iter, but it was able to get the same results regardless of the order of samples.

[EDIT] - Thanks in advance for the help!

I think depending on the seed and thus the PRNG could be a valid approach, but might also be really tricky, since you would need to guarantee that all calls into the PRNG are equal.
Your code to set the seed before initializing the parameters looks valid and I assume you have confirmed that indeed exactly the same parameters were sampled between different runs (e.g. Run1-Sample1 uses exactly the same parameters as Run2-Sample2).
If this is guaranteed I would then check the data loading and in particular the sampling to make sure the same samples are drawn during the training.
As a quick test you could create a fake dataset using static values (e.g. torch.ones as the inputs and just zero targets) and make sure these runs actually yield the same results.