Quantization as pytorch.script graph manipulation

Gentle Introduction
Quantization is often tackled as a rewrite of the original model. We can overload Convolution (i.e, the module convolution), we can add quantization layer before and after (Glow style plus above), but if we use the convolution as functional we may want to add different quantization for the different slots (input, weights, and bias).

When I talk to HW people, they would like to break a convolution into a correlation and a bias addition. Thus, reorganize the convolution operation into two distinct operations. Quantization can be different for the weights, the correlation, the bias and their sums.

Then Quantization affects the forward computation. It affects the backward and the range of the quantization can be used as parameter and the gradient computation could/should use it for training.

Quantization as pass
In this forum, there is a nice tutorial how to introduce an optimization pass. This pass uses CustomFuseGraph. The Good: it boils down to an import. The Bad: FuseGraph optimizations are based on same input same output operations (convolution does not belong here)[ PLEASE CORRECT ME]. This pass will change the forward computation, thus the pass should be done before any AUTOGRAD. With this example, we do not have much control when the optimization and it seems too late.

Trainable and Automatic quantization for TF/Caffe
What automatic tools do in TF, Caffe, they modify the computation graph by pattern recognition and HW requirements, they train the network, then they remove those layers for inference. After that a dedicated compiler will take the computation graph and write code for a specific architecture.

Quantization as jit pass
The way I see it, it will be nice to register a jit pass. This pass must be before gradient computation. This pass will be basically an IR graph manipulation where a few targeted operation will be at first a sub graph but the inputs are completely qualified so that the “rewrite” of the graph can be local, complete and without side effects (nice functional).

Question for the masters
Would you like to let me know how to get started in a practical way ?

Please, hit me with questions, pointer, github links … whatever you consider important.

Check out: https://github.com/pytorch/pytorch/issues?utf8=✓&q=is%3Aissue+is%3Aopen+quantization. We are actively working on something very similar to your proposal, soon to be released!

Let me know what I can do to contribute (time,resource, coding).

Just trying to throw more context here, this is the earlier issue tracking the proposal: https://github.com/pytorch/pytorch/issues/18318 and the proposal itself: https://github.com/pytorch/pytorch/wiki/torch_quantization_design_proposal

1 Like

From the proposals and implementations, I will try to learn any simple way to add a pass that will target any layer and create a sub-graph (At graph layer) before gradient computation.

Such an addition, will help third parties to describe the computation differently and closer to a dedicated engine (we are interested in FPGA kernels): the computation is similar to to CPU, possibly using the same basic operation and changing the order.

The introduction of fake quantization nodes is one way we pursue to modify the subgraph. But our fake nodes will affect the forward and the backward.

Hi Paolo! I’m currently working on graph mode quantization (along with other people, of course), so I’ll try to shed some light on what we have in our plans.

As you noticed, quantization can be implemented as a jit pass. But we consider this as one of two modes of how quantization could be done.

The first mode is called eager mode quantization, where the user is expected to structure their model into submodules and insert quantization stubs manually. This way they completely control the quantization process and can fine-tune the results (and also it can be applied to non-scriptable models). The drawback of this approach is that the user needs to edit their model.

The second mode is called graph mode and it is based on jit. It’s more than one pass, but the idea is the same as you described: a user scripts/traces their model, passes it to some black-box quantizer and gets a quantized version of their model as a result.

In graph mode we roughly expect the following passes:

  1. inserting instrumentation nodes for collecting runtime statistics (aka observers)
  2. replacing quantization/dequantization nodes into the model
  3. fusing dequantize-some_op-quantize patterns to quantize_op

There are actually more more-specific pass than these three, however these should give you an idea of what functionality would be there. As we work on those passes, we’re trying to build them on top of generic features that could be useful for other purposes as well (for instance, several months back we implemented subgraph-matcher and subgraph-rewriter to facilitate fusion of arbitrary graph patterns). We probably will add some features to facilitate transformations on module level (rather than on a graph, or function, level), but the specific details are still TBD.

I think you might help us trying out API as soon as it’s ready and letting us know if the the workflow makes sense for you and covers your use cases.

For now please stay tuned, we’re actively working on it and expect to show something soon!

We = I + Xilinx (HW+SW developers)

We are interested in the graph mode.

The most common scenario we can imagine it is a model that has been designed and trained; however, part of it will be executed on a fixed point precision device. This opens a lot of interesting questions.

Please, consider to use us not only as Guinea Piggys but we can help you in giving you test cases, suggestions, and also “me” the worst coder in the west world I will need to figure this out for the purpose to then plug in a specialized partitioner and compiler.

Please, consider to come down (if you are in the Bay Area) and visit Xilinx and see what we can do for you as well.

thank you

Hi Paolo!

Sorry for the delayed reply. Right now we’re on the final stretch before we should have graph mode working end-to-end (see e.g. this stack of PRs: https://github.com/pytorch/pytorch/pull/24426), so I don’t see an easy way to offload some of the current work to you right now. This is a critical path, so it can’t be parallelized that much I think. However, once it lands I expect we would discover many places where it can be improved or bugs which need to be fixed - and at that point your help in both finding these spots and helping with fixes would be much appreciated!

Until then I suggest you to get familiar with JIT IR in case you haven’t worked with it much before. A good overview of it can be found here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/docs/OVERVIEW.md . I could try to provide more pointers if you need them - let me know if that’s the case.

Thank you Michael. This week I will try to interact with the IR.
Let me know if you guys are willing to come down and introduce you work when you are ready.

Let me address a few questions so you can help me and I can help you.

Q Can we (will be able to) register a JIT pass that may/will affect the “gradient computation” ?

Registering a pass, as for code optimization, is elegant and you and I can work independently. The HW people can change their minds to suite different HW configurations; I can create versions; and our packages are imported afterwords. The main concern is that you will need to give us partial control when to call the pass. This means that I can break a perfectly fine JIT.

In your proposal, you will give us the opportunity to activate “the pass”, which is great. Next question.

Q Can we target layers that affect “the gradient computation” ?

I know quantization of convolutions is HW dependent. This means that I will need space to wiggle in choosing the convolution layer and what do do with it. In practice, the HW will dictate the computation shape and format: in turn, the computation of the convolution. The computation is still based on basic operations such as aten::conv and addition of the bias, I know the HW people want them separated.

Q is it possible to have a tutorial I can build upon so I can practice on the master code for:

I work at Xilinx as well. We released Graffitist earlier this year, a TensorFlow package for quantization-aware training. I’d like to point out that there is a critical issue in the way FakeQuant’s gradients (wrt min/max vars) are implemented in TensorFlow. We shed more light on it in our paper, but it has to do with the correct straight-through-estimator (STE) for training thresholds or scale-factors. It’d be nice to address this in PyTorch early on (I’m happy to be of help and can contribute as well). This is essential for quantizing traditionally difficult networks such as mobilenets, with almost no loss in accuracy (refer Table 4 in paper).

I agree with Paolo, in that having control over the JIT IR pass is essential for target-specific transforms on the graph. I did go over the documentation (JIT README), but some things are not clear to me yet. For instance, how to (i) pattern-match and manipulate sub-graphs, (ii) insert custom ops, (iii) invoke custom passes on the IR.

Please let us know
(i) if you can visit Xilinx (San Jose) for a JIT deep-dive, and
(ii) how we can be of help/contribute.

Please, let me know what we can do to help?
Would you like to come down to Xilinx and give a talk ?