Can't run inference on FP16 trained model

tilhi · April 25, 2023, 7:33am

I am using a transformer model that utilizes MultiScaleDeformableAttention. The training of the model using mixed precision was easy by just utilizing the trick described here. However, inference was not as straight forward. I tried autocasting using torch.autocast but MSDA module gave ‘expected half but got float’. Then when forcing the input values to fp16 using .half() I got ‘expected long but got half’ with one of the inputs (attention_weights). Then when changing that using .long() I got ‘expected half but got long’.

So my question is, is it even possible to run fp16 inference with MSDA module-based model? Input is fp16 and the weights are also fp16.

ptrblck · April 25, 2023, 7:36am

Could you describe your use case in more detail as it seems you have written a custom kernel using raw float16 data instead of the native mixed-precision utils.?

tilhi · April 25, 2023, 7:48am

torch.autocast does not work automatically when using MSDA (Getting an error like ‘half not implemented for MSDA’ or something along those lines), but the training was performed successfully using torch.autocast after the FLOATING_TYPES_AND_HALF-change to that specific module. However, even inside the autocast there is still complaint about the inputs being float instead of half. This happens in the forward of MSDeformAttn-function:

output = MSDeformAttnFunction.apply(
            value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)

How are there still floats inside that function and why changing them by force to be half gives still errors (expected long got half for attention_weights, expected half got long when not changing it).

ptrblck · April 25, 2023, 7:51am

Since you are using custom autograd.Functions, check this example which explains how to add the @custom_fwd and @custom_bwd context managers.

tilhi · April 25, 2023, 11:58am

I have experimented with the custom context managers, however I did not use them during training and there were no errors. However, now adding them during inference to avoid errors I get less than desirable results with my model.

Is there any way to run the model in a way that the context would be identical to the one during training?