from transformers import AutoModel, AutoTokenizer
import time
import torch
model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# create dummy input
input_ids = tokenizer("Hello, my dog is cute", return_tensors="pt")
for i in range(5):
for j in range(5):
torch.cuda.synchronize()
start = time.time()
output = model(**input_ids.to("cuda"))
torch.cuda.synchronize()
took = time.time() - start
print(j, ":", took)
time.sleep(40)
print("==============")
Also to get more accurate results, you’d want to torch.cuda.synchronize() before measuring the time. Operations on gpu are performed asynchronously, and the cpu runs ahead.
Your profile is invalid as CUDA operations are executed asynchronously. If you are using host timers you would need to synchronize the code before starting and stopping the host timers as @soulitzer explained.
Warmup iterations might also be needed not only to perform the expensive memory allocations, but also to make sure the GPU is not in IDLE state.
Yeah, my bad, @ptrblck@soulitzer I have added torch.cuda.synchronize() and updated the output, but the behavior is still present.
I read an article that @soulitzer suggested, and regarding the memory allocation: I understand it for the very first batch (i==0 and j==0), but why then CUDA free the memory when we sleep? As I understood from the article: the torch takes some memory and uses it until the process is running.
But checking the state as @ptrblck mentioned - it’s a possible reason.
Moreover, check this experiment where I make the input sequences longer for subsequent batches, so they will require more memory:
from transformers import AutoModel, AutoTokenizer
import time
import torch
model = AutoModel.from_pretrained("bert-base-uncased").cuda()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
for i in range(5):
for j in range(5):
# longer sequences for subsequent batches
input_ids = tokenizer("Hello, my dog is cute " * (80 if j > 0 else 1), return_tensors="pt")
torch.cuda.synchronize()
start = time.time()
output = model(**input_ids.to("cuda"))
torch.cuda.synchronize()
took = time.time() - start
print(j, ":", took)
time.sleep(40)
print("==============")
Yes, I think that making sure that the GPU is not in IDLE state might be a very probable reason. However, I do not think that the problem is in the expensive memory allocations, as when we sleep, we actually do nothing, and memory shouldn’t reset during this time. And it’s the reason why I call it sleeping - to show that each pause causes this behavior when the very first batch is much slower.