Confused about FUSE_RELU flag

WANGSSSSSSS · June 7, 2022, 3:03am

when i read the source code of quantization operation, I find there exits an outprocess called requantize behind every possible operation, I guess the purpose of this outprocess is making the activation value bewteen 0-255(maybe my viewpoint is wrong), so the source code uses FUSE_RELU to indicate to update the min vlaue clamp to the zero point , the max value must not bigger than 255.
this is FBgemm outprocess source code
Finally, my question is : is this FUSE_RELU flag redundant when the zero point and scale of activation is calculated from relu instead of the previous conv layer?

HDCharles · June 8, 2022, 12:27am

the requantization is because the fbgemm kernels for quantized ops accumulate to an int32 value and the output is supposed to be qint8 so it needs to be requantized to that dtype.

I’m not sure what the FUSE_RELU flag does, i think you’d need to ask the FBGEMM team about it.

However, the quantization apis use module fusion in order to get around the fact that for example, if you have a conv followed by a relu, it makes no sense to requantize values that are below 0 since they will be set to 0 anyway, you can get better fidelity by letting the quantization scheme only map from 0 to max_val_of_conv rather than min_val_of_conv to max_val_of_conv like would normally happen. See the fuse_modules step in Quantization — PyTorch 1.11.0 documentation

WANGSSSSSSS · June 8, 2022, 10:41am

Thanks for reply.
I find the pytorch quantization code convert convrelu fused module into nniq.ConvRelu2d, which use ops.quantized.conv_relu to forward , so i want to find out the difference between ops.quantized.conv_relu and ops.quantized.conv.
In fact i think the ops.quantized.conv_relu is redundant as the scale and zero point is collected from relu layer.

HDCharles · June 8, 2022, 5:47pm

They both go to the same place, it just sets or unsets the flag:

github.com

pytorch/pytorch/blob/a56f4e23b959fda8e61b9f7c478a300b0b551053/aten/src/ATen/native/quantized/cpu/qconv.cpp#L1426


      
                    + c10::to_string(kSpatialDim) + "d, " +
                    "have been removed, please update your model to remove these arguments.");
                return packed_weight->apply(act, output_scale, output_zero_point);
              }
            }
          };
          
          TORCH_LIBRARY_IMPL(quantized, QuantizedCPU, m) {
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv1d"),          QConv1dInt8<false>::run);
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv1d_relu"),     QConv1dInt8<true>::run);
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv2d.new"),      QConvInt8<2, false>::run);
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv2d_relu.new"), QConvInt8<2, true>::run);
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv3d.new"),      QConvInt8<3, false>::run);
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv3d_relu.new"), QConvInt8<3, true>::run);
            // for backward compatibility
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv2d"), QConvInt8ForBC<2, false>::run);
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv2d_relu"), QConvInt8ForBC<2, true>::run);
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv3d"), QConvInt8ForBC<3, false>::run);
            m.impl(TORCH_SELECTIVE_NAME("quantized::conv3d_relu"), QConvInt8ForBC<3, true>::run);
          
            // transpose

you are right though, mathematically it does seem to be redundant, theoretically conv-relu being fused to a convrelu is redundant if the qparams are set so that the range is 0->max_range

HDCharles · June 8, 2022, 6:43pm

@kimishpatel is there any reason to not do things like @WANGSSSSSSS suggests, i.e. for any fused op like e.g. Conv3dReLU, since the requant with correct scale and zp would effectively perform the relu for free wouldn’t calling torch.ops.quantized.conv3d be the same/faster than calling torch.ops.quantized.conv3d_relu ?