I'm going insane trying to figure out why this block of code is affecting my model accuracy

EvanZ · December 14, 2021, 7:33pm

Sorry for the title, but I didn’t know how else to say this. I am using PyTorch Lightning modules and building a classifier model using Transformers on tabular data. Here is the main training loop:

    def training_step(self, batch, batch_idx):
        tokens = batch['tokens']
        y = batch['label']
        mask = batch['mask']

        # x = self.base_model(tokens, mask)
        x = self.features_embed(tokens)
        x = self.encoder(x, src_key_padding_mask=mask)
        x = self.linear(x).mean(axis=1).squeeze(1)

        loss = F.binary_cross_entropy_with_logits(input=x,
                                                  target=y)
        self.log('train_loss', loss)
        return loss

The base model commented out here is defined in the init:

class LitModelWithCategoryEmbeddings(pl.LightningModule):
    def __init__(self,
                 num_tokens: int,
                 num_categories: int,
                 dim_model: int = 96,
                 dim_h: int = 128,
                 n_head: int = 1,
                 dropout: float = 0.1,
                 activation: str = 'relu',
                 num_layers: int = 2,
                 lr: float = 1e-3):
        """

        :param num_tokens:
        :param dim_model:
        :param dim_h:
        :param n_head:
        :param dropout:
        :param activation:
        :param num_layers:
        """
        super().__init__()
        self.base_model = LitModel(
             num_tokens=num_tokens,
             dim_model=dim_model,
             dim_h=dim_h,
             num_layers=num_layers,
             n_head=n_head
         )
        summary(self.base_model)

        self.features_embed = torch.nn.Embedding(num_embeddings=num_tokens,
                                                 embedding_dim=dim_model)
        self.categories_embed = torch.nn.Embedding(num_embeddings=num_categories,
                                                   embedding_dim=dim_model)
        encoder_layer = torch.nn.TransformerEncoderLayer(d_model=dim_model,
                                                         nhead=n_head,
                                                         dim_feedforward=dim_h,
                                                         dropout=dropout,
                                                         activation=activation,
                                                         batch_first=True)
        self.encoder = torch.nn.TransformerEncoder(encoder_layer=encoder_layer,
                                                   num_layers=num_layers)
        self.linear = torch.nn.Linear(in_features=dim_model, out_features=1)
        self.lr = lr
        self.valid_auc = AUROC(dist_sync_on_step=True)
        self.test_auc = AUROC(dist_sync_on_step=True)
        self.save_hyperparameters()

The base model is another class that I was using for pretraining, but I’m trying to write similar functionality without using any pre-training, and so I created a new class. When I run this code with the base model defined in the init function, I get the AUC in the validation set that I am expecting.

The part that is driving me insane is that if I simply comment out the defining of self.base_model in the init, the AUC significantly decreases:

        super().__init__()
        # self.base_model = LitModel(
        #     num_tokens=num_tokens,
        #     dim_model=dim_model,
        #     dim_h=dim_h,
        #     num_layers=num_layers,
        #     n_head=n_head
        # )
        # summary(self.base_model)

It makes no sense to me why this snippet of code is affecting the model performance in any way because it is not being used at all in either the training loop or the validation loop. What on earth could be going on here?

mMagmer · December 14, 2021, 7:47pm

What do you mean by “significantly”?
Maybe not populating LitModel has effect on initializing model parameters after it.
Try setting random seed to fix number right before populating your actual model (after part that you comment ), and see if there is different still.

random.seed(args.seed)
torch.manual_seed(args.seed)
np.random.seed(args.seed)
cudnn.deterministic = True

EvanZ · December 14, 2021, 7:48pm

Thanks @mMagmer. I mean the AUC drops from like 0.71 to 0.62 which is a massive difference in my application. I will try your solution. I am not using cudnn currently anywhere. Is that necessary?

mMagmer · December 14, 2021, 7:57pm

if you’re not using GPU, it is not necessary.
also take a look at this.
https://pytorch.org/docs/stable/notes/randomness.html

EvanZ · December 14, 2021, 8:02pm

I am using GPU. But I don’t currently have cudnn installed. I guess I can try that.

mMagmer · December 14, 2021, 8:04pm

i think if you install pytorch gpu and cudatoolkit form default source, it installs cudnn too.
but i’m not sure.

EvanZ · December 14, 2021, 9:24pm

Ok, this is quite nuts. When I add this code, both the commented and uncommented versions of defining self.base_model have the lower/worse performance. What in the heck am I supposed to do with this now??? How do I get the better performing model without including this extraneous code as a hack? I’m more perplexed now than I was before haha.

        random.seed(42)
        torch.manual_seed(42)
        np.random.seed(42)
        torch.backends.cudnn.deterministic = True

mMagmer · December 15, 2021, 6:49am

First, it is not okay to see that much difference based on seed selection.
Maybe your validation set is small, and you have big variance in your metric estimation.
If you want to publish a scientific paper you can run your code for 10 different seed and report best, average, and sigma for all 10 run and compare it to other methods.
Or if you want, you can search for best seed!!

my3bikaht · December 15, 2021, 8:25am

Different seeds may significantly affect initial weights but not so much for fully trained model. Yes, you may stuck at some local minimum, but still, results should not be that different.

Since you are using transformers, have you tried experimenting with learning rate (value, warmups, annealing)?