Deploy mixed precision model in libtorch

Hi,

I tried the torch.cuda.amp.autocast in PyTorch and it works well for my model. Now I want to deploy my trained model in C++ with the nightly built libtorch (version 1.7.0). However, I cannot find a corresponding function for autocast in the libtorch library API.

May I ask what is the proper way to deploy a mixed precision model in libtorch?

Thanks,

Rui

2 Likes

I’m not sure, if libtorch supports amp fully, but @mcarilli would know. :slight_smile:

2 Likes

libtorch doesn’t officially support autocast in C++ yet. You can try to rig it by imitating the C++ calls made by torch.cuda.amp.autocast during __enter__ at the beginning of your C++ autocasted region and imitating the C++ calls made by __exit__ at the end of your C++ autocasted region. The mapping between torch._C functions and C++ -side function calls can be seen here.

Right now doing the above is definitely “off menu.” I plan to support autocast in libtorch by adding an RAII guard that does all those things in an exception-safe way.

2 Likes

Hi @mcarilli could you provide a very simple example on how to do it “as is” with libtorch ?

Here’s my training loop right now:

    torch::optim::Adam optim(model->parameters(), torch::optim::AdamOptions(1e-3));
    for(int b=0; b<num_batch; b++)
    {
        optim.zero_grad();
        torch::Tensor result = model(source[b]);
        torch::Tensor loss = torch::l1_loss(result, target);
        loss.backward();
        optim.step();
    }

I’m not sure where to insert the (unofficial) amp calls and how… thanks !

BTW when can we expect a proper libtorch implementation (RAII guard as you mentionned) ?

For non-nested autocast usage it’s not too complex:

    torch::optim::Adam optim(model->parameters(), torch::optim::AdamOptions(1e-3));
    for(int b=0; b<num_batch; b++)
    {
        optim.zero_grad();

        at::autocast::set_enabled(true);
        torch::Tensor result = model(source[b]);
        at::autocast::clear_cache();
        at::autocast::set_enabled(false);

        torch::Tensor loss = torch::l1_loss(result, target);
        loss.backward();
        optim.step();
    }

For nested usage, ie nesting autocast-disabled regions in autocast-enabled regions, manually imitating the behavior of the Python context manager for nested usage would be harder, but the planned RAII guard can make this easy on the user side.

I’m not sure when I’ll PR the RAII guard, I have a backlog of other bugs to work through (mostly adding coverage for ops that currently dont work with autocast).

However, there’s a bigger problem here: your code is a training loop. When training with autocast, you should also use gradient scaling. If your network happens to converge without gradient scaling, great, but I expect you will need it.

Imitating torch.cuda.amp.GradScaler on the C++ side is not as easy as imitating torch.cuda.amp.autocast. autocast's Python source is (mostly) a thin wrapper around the calls above, but GradScaler's Python source contains a fair amount of important logic. I do plan to create an equivalent C++ GradScaler, but that’s a substantial effort. In the meantime, you can try to manually duplicate GradScaler’s logic. Much of GradScaler’s complexity comes from handling the possibility of different layers on different devices. If you can rule that out in your case you can write the logic more simply, but it’s still not trivial.

Using autocast as shown above for inference (not training) should be fine without GradScaler.

1 Like

Thanks for the autocast code.

Indeed, the scaler doesn’t seem trivial. I guess I’ll have to wait for a C++ implementation of it, as the tensor core speedup is specially interesting for training more than inference. Hopefully PyTorch 1.7 ?

Not 1.7. The branch cut + code freeze for each release precedes the actual release by a month or more (as it must, to iron out bugs) and it’s coming up too soon. For 1.7 my priority is autocast coverage for all high-demand ops that 1.6 missed (RNNs most notably).

1.8 is plausible though. File an issue on Pytorch github and tag @mcarilli @gchanan @ngimel so we know people want mixed precision training with the C++ API.

@mcarilli with the upcoming RTX 3080/3090 introducing BF16 support (correct if I’m wrong), maybe I can skip gradient scaling and use your code as-is ?
However, how do I tell AMP that I want to use BF16 and not FP16 ?

Hardware support for bfloat16 is there, but software support in pytorch’s cuda backend and cuda libraries is WIP. Once it’s present, I’ll add a bfloat16 option to autocast.

Why is it essential that you train in C++? Python training usually isn’t a problem and there’s lot of near-term work underway (and done) to reduce eager mode cpu overheads and make the difference between C++ and Python less important.

Simply because I’m much more used to C++ than Python :slight_smile:
And PyTorch is the only big framework to provide that possibility (TensorFlow doesn’t offer C++ model generation+training for instance).

Why full C++ support is relevant : python is the de-facto language for data scientists, but more and more desktop app developers are coming to deep learning, including for training models, and those desktop app developers are usually more familiar with C++ than python (you don’t code software in python).
For instance my desktop app is entirely in C++ and I know exactly how to format data in C++, but I have zero clue in python, and I’m not a big fan of type-less languages too. Not judging python here, but it’s really not my cup of tea and I know I’m not the only one :slight_smile:

Another argument for that: I’m also generating and managing my training data in C++. So it’s a consistent pipeline from end to end, from data generation and management, to training, to implementation in the end-user software.

Hi guys, and more specifically @mcarilli
What is the current status of AMP using libtorch for inference?

I tried a similar approach to your suggestion in this thread:
at::autocast::set_enabled(true);
torch::Tensor result = model(source[b]);
at::autocast::clear_cache();
at::autocast::set_enabled(false);

This get me no speed difference at all, so I’m wondering if it’s actually used or not. Any other pointers here?

Cheers,
David

Just to follow up on my own post from yesterday. I posted a bit too early, and had a classic coding mistake present that made the provided code not run. I’ve got it running correctly now, and I do experience around a 15% increase in performance compared to running my model in full float (an a RTX2080Ti card). Thanks all!

1 Like

Have you tried this with torchscript models as well?