Mixed precision model using more memory in inference(Didn't compare in finetuning)

I’m comparing GPU memory usage of two models. One of them is trained with half precision while other is full. Even size on disk is close to half for the half precision model but it’s using more GPU memory.
Size of models on disk -
Half: 221743899
Full: 442221694

BatchSize: 1536
Half: max memory used: 11096.01171875
Full: max_memory used: 6763.52685546875

Code to log:

def start_timer():
    global start_time
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()
    start_time = time.time()

def end_timer():
    torch.cuda.synchronize()
    end_time = time.time()
    global start_time, total_time
    total_time += (end_time - start_time)
    print(f"max memory used: {torch.cuda.max_memory_allocated()/(1024**2)}")

Only difference in inference code is -

if model_type=='half':
                with torch.autocast('cuda'):
                    output = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
            else:
                output = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)

Is it expected behavior, if yes then what operation in autocast is using this memory?

Could you explain the issue a bit more, please?
Your report:

BatchSize: 1536
Half: max memory used: 6763.52685546875
Full: max_memory used: 11096.01171875

doesn’t fit the title claiming that mixed-precision is using more memory.

Aaah my bad!
I pasted it with wrong labels, I’ve edited it now. It still is the half that’s using more memory.