Is it possible to make the first batch as fast as the subsequent ones?

Hi!

I have found such 2 discussions about an interesting behavior when the very first batch takes more time than the subsequent ones:

I did an experiment:

from transformers import AutoModel, AutoTokenizer
import time
import torch


model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# create dummy input
input_ids = tokenizer("Hello, my dog is cute", return_tensors="pt")


for i in range(5):
    for j in range(5):
        torch.cuda.synchronize()
        start = time.time()
        output = model(**input_ids.to("cuda"))
        torch.cuda.synchronize()
        took = time.time() - start
        print(j, ":", took)
    time.sleep(40)
    print("==============")

and I got the following output:

0 : 0.26381897926330566
1 : 0.005286693572998047
2 : 0.0050601959228515625
3 : 0.005187034606933594
4 : 0.005078554153442383
==============
0 : 0.027104854583740234
1 : 0.007577419281005859
2 : 0.005268096923828125
3 : 0.005065202713012695
4 : 0.005182743072509766
==============
0 : 0.0278928279876709
1 : 0.007410287857055664
2 : 0.0052073001861572266
3 : 0.005073070526123047
4 : 0.0050258636474609375
==============
0 : 0.017473459243774414
1 : 0.005105018615722656
2 : 0.005366325378417969
3 : 0.0053863525390625
4 : 0.005162954330444336
==============
0 : 0.0260617733001709
1 : 0.008086204528808594
2 : 0.006853342056274414
3 : 0.0065691471099853516
4 : 0.006029367446899414
==============

As you see, I was able to reproduce this behavior. What is very interesting - it’s that after sleeping some cache is reset.

And my question is: Why does it happen and is it possible to remove this behavior?

And yeah, if I do the same on the CPU - all batches will have approximately the same time to compute

Likely its because the cuda caching allocator needs to allocate fresh memory the first batch A guide to PyTorch’s CUDA Caching Allocator | Zach’s Blog

Also to get more accurate results, you’d want to torch.cuda.synchronize() before measuring the time. Operations on gpu are performed asynchronously, and the cpu runs ahead.

Your profile is invalid as CUDA operations are executed asynchronously. If you are using host timers you would need to synchronize the code before starting and stopping the host timers as @soulitzer explained.
Warmup iterations might also be needed not only to perform the expensive memory allocations, but also to make sure the GPU is not in IDLE state.

Yeah, my bad, @ptrblck @soulitzer I have added torch.cuda.synchronize() and updated the output, but the behavior is still present.

I read an article that @soulitzer suggested, and regarding the memory allocation: I understand it for the very first batch (i==0 and j==0), but why then CUDA free the memory when we sleep? As I understood from the article: the torch takes some memory and uses it until the process is running.

But checking the state as @ptrblck mentioned - it’s a possible reason.

Moreover, check this experiment where I make the input sequences longer for subsequent batches, so they will require more memory:

from transformers import AutoModel, AutoTokenizer
import time
import torch


model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


for i in range(5):
    for j in range(5):

        # longer sequences for subsequent batches
        input_ids = tokenizer("Hello, my dog is cute " * (80 if j > 0 else 1), return_tensors="pt")

        torch.cuda.synchronize()
        start = time.time()
        output = model(**input_ids.to("cuda"))
        torch.cuda.synchronize()
        took = time.time() - start
        print(j, ":", took)
    time.sleep(40)
    print("==============")

Output:

0 : 0.22526955604553223
1 : 0.009119033813476562
2 : 0.008877277374267578
3 : 0.008785724639892578
4 : 0.008809089660644531
==============
0 : 0.037750244140625
1 : 0.01584768295288086
2 : 0.011595487594604492
3 : 0.010252952575683594
4 : 0.009835004806518555
==============
0 : 0.02537083625793457
1 : 0.028826475143432617
2 : 0.013889312744140625
3 : 0.011127948760986328
4 : 0.01016688346862793
==============
0 : 0.01690077781677246
1 : 0.008993387222290039
2 : 0.009119749069213867
3 : 0.008840322494506836
4 : 0.008825540542602539
==============
0 : 0.02893829345703125
1 : 0.015749216079711914
2 : 0.011925935745239258
3 : 0.010996818542480469
4 : 0.009769678115844727
==============

I don’t know why you are adding sleeps to your code but it seems you have ignored my last statement as well:

Yes, I think that making sure that the GPU is not in IDLE state might be a very probable reason. However, I do not think that the problem is in the expensive memory allocations, as when we sleep, we actually do nothing, and memory shouldn’t reset during this time. And it’s the reason why I call it sleeping - to show that each pause causes this behavior when the very first batch is much slower.