I’m comparing GPU memory usage of two models. One of them is trained with half precision while other is full. Even size on disk is close to half for the half precision model but it’s using more GPU memory.
Size of models on disk -
Half: 221743899
Full: 442221694
BatchSize: 1536
Half: max memory used: 11096.01171875
Full: max_memory used: 6763.52685546875
Code to log:
def start_timer():
global start_time
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.synchronize()
start_time = time.time()
def end_timer():
torch.cuda.synchronize()
end_time = time.time()
global start_time, total_time
total_time += (end_time - start_time)
print(f"max memory used: {torch.cuda.max_memory_allocated()/(1024**2)}")
Only difference in inference code is -
if model_type=='half':
with torch.autocast('cuda'):
output = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
else:
output = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
Is it expected behavior, if yes then what operation in autocast is using this memory?