BFloat16 training - explicit cast vs autocast

victorx · May 13, 2024, 4:36am

In Pytorch, there seems to be two ways to train a model in bf16 dtype.

One is to explicitly use input_data=input_data.to(torch.bfloat16) and model=model.to(torch.bloat16) to cast both input data and model to bfloat 16 format.

Another is to use torch.autocast(device_type=device, dtype=torch.bfloat16) context manager, where you don’t need to explicitly cast the input data and model to bfloat16.

Are there any difference between these two options? If yes, what’s the recommended option between the two?

ptrblck · May 13, 2024, 1:21pm

Mixed-precision training via amp keeps the parameters in float32 and downcasts activations to the desired lower dtype for save operations. Calling .to() on the model directly transforms all parameters and buffers to this dtype and it’s the user’s responsibility to avoid under- and overflows.

victorx · May 14, 2024, 4:56am

Thanks @ptrblck. I have a follow-up question about the implication of memory usage between these options.

According to some articles, like this one (Performance and Scalability: How To Fit a Bigger Model and Train It Faster — transformers 4.7.0 documentation), training in mixed precision with fp16 actually increase the memory footprint of the model weights, as both fp32 and fp16 copies of the weights will be kept in the GPU RAM. Is this also the case for mixed precision with bf16?

ptrblck · May 14, 2024, 12:06pm

That’s not the case for float16 used for mixed-precision training via torch.cuda.amp and the copies were used in the O2 training recipe in apex before amp was implemented directly in PyTorch.

victorx · May 14, 2024, 11:56pm

Thanks @ptrblck . What you said is both interesting and somewhat suprising, because even the latest version of the HuggingFace performance tuning guide (Methods and tools for efficient training on a single GPU) still says training in fp16-based mixed precision will cause 1.5x memory usage in terms of the model weight footprint.

Given what you said, I was wondering whethter there is any benefit at all to do explicit-cast-to-bf16 instead of using autocast-to-bf16 . I have this confusion because I noticed the latest torchtune package announced by Pytorch last month used the exliclit-cast-to-bf16 option. See torchtune/recipes/full_finetune_single_device.py at f3611e505dfb3a7fd1bf81181ea250e63d854e0e · pytorch/torchtune · GitHub, where they basically use torch.set_default_dtype(torch.bfloat16) context manager to explicitly cast data and model to bf16, and also see line 47 where they mentioned they do not use mixed precision training. Any idea why they would choose this option?

ptrblck · May 15, 2024, 12:43pm

Explicitly casting the model to a lower dtype will reduce the model size as all parameters are transformed. Keep in mind that amp keeps all parameters in float32 and applies casts to activations (no parameter copies are created unless you are using the deprecated apex.amp module). So if the training recipe and model are stable enough explicit casts can work.

victorx · May 15, 2024, 10:32pm

@ptrblck Aha, I think I finally understand your point. To recap,

old apex-based autocast to fp16/bf16 will use 1.5x memory for parameters
new amp-based autocast to fp16/bf16 will use 1x memory for parameters
explicit-cast to fp16/bf16 will use 0.5x memory for parameteters

lizecheng19 · December 2, 2024, 9:05am

Does this mean that when I use bfloat16 for training I can use the following code snippet to train? At the same time, I am curious whether bfloat16 does not require gradient scaling?

for batch in dataloader:
with torch.amp.autocast(dtype=torch.bfloat16):
loss = model(batch)
loss.backward()
optim.step()
optim.zero_grad()

ptrblck · December 2, 2024, 1:48pm

Your code is not properly formatted so check the examples in the docs and stick to their usage.
Yes, bfloat16 does not require gradient scaling.

lizecheng19 · December 2, 2024, 2:17pm

Thank you for your reply. I did not find the pytorch native bfloat16 training demo. Could you tell me how to implement it?