AMP on cpu: No Gradscaler necessary / available?

I want to implement automatic mixed precision into my training framework. According to autocasting is available for cuda and cpu.

I wondered that for their training example on cpu

# Creates model and optimizer in default precision
model = Net()
optimizer = optim.SGD(model.parameters(), ...)

for epoch in epochs:
    for input, target in data:

        # Runs the forward pass with autocasting.
        with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
            output = model(input)
            loss = loss_fn(output, target)


they do not use a Gradscaler, which, however, turned out to be crucial for my training on gpu. It seems that GradScaler is only available for cuda (torch.cuda.amp.GradScaler) and it throws errors when trying to use it with tensors on cpu.

Thus my question(s):

  1. Is there any reason why one would, in contrary to cuda, not need a GradScaler on cpu?
  2. If not, is there any implementation for a GradScaler when running amp on cpu?

AMP on the CPU supports bfloat16, which does not need gradient scaling.