The code below is how I define the dataloader
train_dataset = TweetDataset(
tweet=train_df.text.values,
sentiment=train_df.sentiment.values,
selected_text=train_df.selected_text.values
)
train_sampler = RandomSampler(train_dataset)
train_loader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE, pin_memory=True)
Due to some reason, I have to define two model, like this:
model = TweetModel()
swa_model = TweetModel()
But I found that after each time of initiating a model, the dataloader defined above will be affected, for example, the first batch will be different.
As a result, even I set every random seed I can think of, the behavior of dataloader is “not deterministic”. I can confirm that if I always use one model(or always use two model), the data of every batch is deterministic, and my whole experiment is reproducible.
But here is the problem, I have to compare the performance between 1) using one model and 2) using two model. Due to the phenomenon above, they will give me different batch data, which will lead to different performance (~0.003, not very big, but very essential in a competition).
I know the impact happens in the __init__
method of TweetModel
, here is its definition:
from transformers import BertPreTrainedModel,RobertaModel
class TweetModel(BertPreTrainedModel):
def __init__(self):
super(TweetModel, self).__init__(config)
self.bert = RobertaModel.from_pretrained(model_dir, config=config)
# do something
def forward(self, ids, mask, token_type_ids):
#do something
After debug for a while, I notice that after the line RobertaModel.from_pretrained(model_dir, config=config)
, the data of the fist batch of train_loader
will be different. I check this by adding the following snippet before and after the line self.bert =...
and compare the first batch data:
tmp=None
for d in train_loader:
tmp=d
break
# stop here and check the data of the first batch
I don’t know why loading a pretrained model will affect train_loader
, this is an unexpected behavior. Why would this happen and how to avoid it? Am I missing something? Any help will be highly appreciated.
PS: I noticed this issue in pytorch repo, [Feature Request][PyTorch] Deterministic and predictable behaviour of batch sampler , but it seems like the problem mentioned in the issue has been fixed and I don’t know whether it is related to my question.