Implementing Dynamic Data Sampling for BERT Language Model Training with PyTorch DataLoader

Ali_H_Ahmad · November 18, 2023, 6:52pm

I’m currently in the process of building a BERT language model from scratch for educational purposes. While constructing the model itself was a smooth journey, I encountered challenges in creating the data processing pipeline, particularly with an issue that has me stuck.

Overview:

I am working with the IMDB dataset, treating each review as a document. Each document can be segmented into several sentences using punctuation marks (. ! ?). Each data sample consists of a sentence A, a sentence B, and an is_next label indicating whether the two sentences are consecutive. This implies that from each document (review), I can generate multiple training samples.

I am utilizing PyTorch and attempting to leverage the DataLoader for handling multiprocessing and parallelism.

The Problem:

The __getitem__ method in the Dataset class is designed to return a single training sample for each index. However, in my scenario, each index references a document (review), and an undefined number of training samples may be generated for each index.

The Question:

Is there a recommended way to handle such a situation? Alternatively, I am considering the following approach:

For each index, an undefined number of samples are returned to the DataLoader. The DataLoader would then assess whether the number of samples is sufficient to form a batch. Here are the three possible cases:

The number of samples returned for an index is less than the batch size. In this case, the DataLoader fetches additional samples from the next index (next document), and any excess is retained to form the next batch.
The number of samples returned for an index equals the batch size, and it passes it to the model.

I appreciate any guidance or insights into implementing this dynamic data sampling approach with PyTorch DataLoader.

JuanFMontesinos · November 18, 2023, 8:53pm

You can load a batch directly in getitem with the proper config of the dataloader

Ali_H_Ahmad · November 19, 2023, 10:03pm

Thank you very much for your interest,

Are you suggesting to disable automatic batching and take control of the batch formation in my training loop, something like the following?

my_dataset = MyDataset(data)
data_loader = DataLoader(my_dataset, batch_size=None, batch_sampler=None)

batch_size = 32
current_batch = []

# The data_loader may return one or more samples ..
for samples in data_loader:
    samples = process_samples(samples)

    # Accumulate samples until reaching the desired batch size
    # ...
    current_batch.append(samples)
    # ...

    if len(current_batch) == batch_size:
        processed_batch = process_batch(current_batch)

        # Forward pass, backward pass, and optimization steps ...

        current_batch = []

Thanks again

JuanFMontesinos · November 20, 2023, 6:58pm

Hi,
when you disable the automatic batching, you can return a batch directly from the getitem or, as you show in the script, decide when your batch size is large enough.
The idea is you manually conform your batch:

and an undefined number of training samples may be generated for each index.

I think it would be more optimized if you can do everything in getitem but I imagine it’s harder that way.

Ali_H_Ahmad · November 24, 2023, 4:03pm

I’ve thought about it, but it’s really difficult, so I guess I’ll just stick to the method I showed you. I think it would not be bad if the Batch size was small (such as 32).