Slower iterations with larger datasets

Hello everyone,

I am working with the glue dataset and am using the run_glue_no_trainer.py script from the transformer github page (LINK).

I am using BERT-base with a batch size of 32 on a single GPU (using standard finetuning setting from BERT paper). When working with the CoLA dataset (~8.5k train samples) I get around 10/iterations per second. When switching to the QNLI dataset (~100k train samples) I only get around 5/iterations per second.

This slowdown is significant and even more noticeable with larger models. I would assume the iteration/s to stay approximately stay the same as long as only the dataset is swapped out.

My hardware setup is as follows:
CPU Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz
GPU NVIDIA A40
RAM: 64GB

Not necessarily as it would depend on the data processing, and if it could create a bottleneck, as well as the sample loading, which could also create a blottleneck e.g. if you have more random accesses from a spinning HDD.
You could profile the data loading and check if it is indeed creating the slowdown or if it’s coming from another part of your training pipeline.
Here is a simple example showing how the DataLoader is times in the ImageNet example.

I did some timing with the PytorchProfiler (schedule with skip=10,warmup=2,active=3) to get insight into the training steps. Unfortunately I am not deep enough in the fundamentals to draw definite conclusions myself.

The following is one step in the training loop using CoLA dataset (~small).

The following is one steps in the training loop using QNLI dataset (~large).

The step is significantly slower for QNLI than for (~240ms vs ~100ms) CoLA. Notably, the aten::to operation is notably slower aswell as the autograd::engine::... EmbeddingBackward0. The enumerate(DataLoader)#_SingleProcessDataLoaderIter.__next__ is taking roughly the same amount of type in both cases.

I can provide the trace files for both runs.

Another thing I noted is that all steps in CoLA (~small) take roughly the same time. But for QNLI, some are shorter and some longer:

Hopefully this helps to analyze whats happening.

PS: Harddisk is SSD
PPS: If you need the trace files, I can send them to you.

Bump. Think this could be interesting to many.

I don’t believe that’s the case as both ops wait for a synchronization, so you would need to check what exactly is synchronizing the code.

Any clue on how to find out whats synchronizing?

As its single-gpu I am not quite sure what kind of sources for synchronization there are.

You could profile your code with Nsight Systems and enable CPU stacktraces. Hovering over the synchronizing call would then show which operation is calling into it. Adding nvtx markers will also help in narrowing down which part of the code the sync is coming from.