Deploy mixed precision model in libtorch

mcarilli · September 5, 2020, 7:56pm

For non-nested autocast usage it’s not too complex:

    torch::optim::Adam optim(model->parameters(), torch::optim::AdamOptions(1e-3));
    for(int b=0; b<num_batch; b++)
    {
        optim.zero_grad();

        at::autocast::set_enabled(true);
        torch::Tensor result = model(source[b]);
        at::autocast::clear_cache();
        at::autocast::set_enabled(false);

        torch::Tensor loss = torch::l1_loss(result, target);
        loss.backward();
        optim.step();
    }

For nested usage, ie nesting autocast-disabled regions in autocast-enabled regions, manually imitating the behavior of the Python context manager for nested usage would be harder, but the planned RAII guard can make this easy on the user side.

I’m not sure when I’ll PR the RAII guard, I have a backlog of other bugs to work through (mostly adding coverage for ops that currently dont work with autocast).

However, there’s a bigger problem here: your code is a training loop. When training with autocast, you should also use gradient scaling. If your network happens to converge without gradient scaling, great, but I expect you will need it.

Imitating torch.cuda.amp.GradScaler on the C++ side is not as easy as imitating torch.cuda.amp.autocast. autocast's Python source is (mostly) a thin wrapper around the calls above, but GradScaler's Python source contains a fair amount of important logic. I do plan to create an equivalent C++ GradScaler, but that’s a substantial effort. In the meantime, you can try to manually duplicate GradScaler’s logic. Much of GradScaler’s complexity comes from handling the possibility of different layers on different devices. If you can rule that out in your case you can write the logic more simply, but it’s still not trivial.

Using autocast as shown above for inference (not training) should be fine without GradScaler.