I am using the huggingface library and PyTorch, hopefully this question is best suited here.
I have tokenized a dataset in two different ways resulting in two different tokenized datasets. I want to iterate over the datasets with shuffling enabled, such that the batches correspond to the same examples. I created a small example of the problem I am facing:
import torch
from transformers import AutoTokenizer
from datasets.load import load_dataset
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding
data = load_dataset("glue", "mnli", split='train[:4]')
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Tokenize data
tokenized_data = data.map(lambda example: tokenizer(
example["premise"], example["hypothesis"], truncation=True))
tokenized_data = tokenized_data.remove_columns(["premise", "hypothesis"])
tokenized_data = tokenized_data.rename_column("label", "labels")
tokenized_data.set_format("torch")
h_tokenized_data = data.map(lambda example: tokenizer(
example["hypothesis"], truncation=True))
h_tokenized_data = h_tokenized_data.remove_columns(["premise", "hypothesis"])
h_tokenized_data = h_tokenized_data.rename_column("label", "labels")
h_tokenized_data.set_format("torch")
# Create loaders
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
train_loader = DataLoader(tokenized_data, batch_size=2, shuffle=True, collate_fn=data_collator)
h_train_loader = DataLoader(h_tokenized_data, batch_size=2, shuffle=True, collate_fn=data_collator)
for batch in zip(train_loader, h_train_loader):
break
# I want the output to be the same:
print(batch[0]["idx"]) # out: tensor([2, 1])
print(batch[1]["idx"]) # out: tensor([3, 1])
I guess the most interesting solution would allow for different tokenizers and collators used, but anything that can help the particular example case I provided is appreciated.
I’m not sure if I understood everything but if you create 2 dataloaders with shuffle=True, they’re both gonna get shuffled randomly and differently.
What you could try is set the random seeds to a fixed same value for both dataloaders using worker_init_fn as described here.
Also check this whole thread: Fix seed for data loader