Mixed precision model using more memory in inference(Didn't compare in finetuning)

avijha · April 21, 2022, 12:22pm

I’m comparing GPU memory usage of two models. One of them is trained with half precision while other is full. Even size on disk is close to half for the half precision model but it’s using more GPU memory.
Size of models on disk -
Half: 221743899
Full: 442221694

BatchSize: 1536
Half: max memory used: 11096.01171875
Full: max_memory used: 6763.52685546875

Code to log:

def start_timer():
    global start_time
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()
    start_time = time.time()

def end_timer():
    torch.cuda.synchronize()
    end_time = time.time()
    global start_time, total_time
    total_time += (end_time - start_time)
    print(f"max memory used: {torch.cuda.max_memory_allocated()/(1024**2)}")

Only difference in inference code is -

if model_type=='half':
                with torch.autocast('cuda'):
                    output = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
            else:
                output = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)

Is it expected behavior, if yes then what operation in autocast is using this memory?

ptrblck · April 21, 2022, 5:50pm

Could you explain the issue a bit more, please?
Your report:

BatchSize: 1536
Half: max memory used: 6763.52685546875
Full: max_memory used: 11096.01171875

doesn’t fit the title claiming that mixed-precision is using more memory.

avijha · April 29, 2022, 2:00pm

Aaah my bad!
I pasted it with wrong labels, I’ve edited it now. It still is the half that’s using more memory.