How to ensure the deterministic batch data when using RandomSampler and DataLoader

pyxiea · April 25, 2020, 4:08pm

The code below is how I define the dataloader

train_dataset = TweetDataset(
            tweet=train_df.text.values,
            sentiment=train_df.sentiment.values,
            selected_text=train_df.selected_text.values
        )
train_sampler = RandomSampler(train_dataset)
train_loader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE, pin_memory=True)

Due to some reason, I have to define two model, like this:

model = TweetModel()
swa_model = TweetModel()

But I found that after each time of initiating a model, the dataloader defined above will be affected, for example, the first batch will be different.

As a result, even I set every random seed I can think of, the behavior of dataloader is “not deterministic”. I can confirm that if I always use one model(or always use two model), the data of every batch is deterministic, and my whole experiment is reproducible.

But here is the problem, I have to compare the performance between 1) using one model and 2) using two model. Due to the phenomenon above, they will give me different batch data, which will lead to different performance (~0.003, not very big, but very essential in a competition).

I know the impact happens in the __init__ method of TweetModel, here is its definition:

from transformers import BertPreTrainedModel,RobertaModel

class TweetModel(BertPreTrainedModel):
    def __init__(self):
        super(TweetModel, self).__init__(config)
        self.bert = RobertaModel.from_pretrained(model_dir, config=config)
        # do something

    def forward(self, ids, mask, token_type_ids):
        #do something

After debug for a while, I notice that after the line RobertaModel.from_pretrained(model_dir, config=config), the data of the fist batch of train_loader will be different. I check this by adding the following snippet before and after the line self.bert =... and compare the first batch data:

tmp=None
for d in train_loader:
    tmp=d
    break
# stop here and check the data of the first batch

I don’t know why loading a pretrained model will affect train_loader, this is an unexpected behavior. Why would this happen and how to avoid it? Am I missing something? Any help will be highly appreciated.

PS: I noticed this issue in pytorch repo, [Feature Request][PyTorch] Deterministic and predictable behaviour of batch sampler , but it seems like the problem mentioned in the issue has been fixed and I don’t know whether it is related to my question.

ptrblck · April 26, 2020, 7:20am

Each call into the pseudo-random number generator will increase its state.
Your model creation is most likely calling into a the pseudo-random number generator at some point, which will make all following (random) calls different.

A simple code snippet is here:

torch.manual_seed(2809)

lin1 = nn.Linear(100, 100)
# Comment out lin2
lin2 = nn.Linear(100, 100) # another call to the PRNG
for _ in range(10):
    print(torch.randn(1))

Each run yields deterministic results for itself, but if you comment lin2, the PRNG will not be called for the model instantiation, and thus the following torch.randn calls will still yield deterministic results, but different values compared to the first run.

Have a look at the Wikipedia article about PRNG for more information.

If you need deterministic results for your workflow, I would recommend to initialize both models once and store their state_dicts.
In the test scripts create the model, load their state_dicts, and set the manual seeds afterwards to reset the PRNG for the training.

Note, since the forward methods of both models might also contain calls to the PRNG, the next iteration might again not yield the same batch.
If you see this behavior, you would have to make sure to use exactly the same call sequence and discard one model, if its output is not needed for the run.

pyxiea · April 26, 2020, 8:43am

@ptrblck Thanks for your reply!