Support of Dynamic and Static Shape in One Module

chenghua_wang · August 24, 2024, 7:24am

I’m trying to use ExecuTorch to deploy a LLM with a QNN backend. Unfortunately, QNN doesn’t support dynamic shapes, so some trick is needed to utilise up the NPU. A simple idea is shown below:

Divide the Prompt into fixed-size Chunks, and these fixed Chunks first enter the Static Shape QKV Proj (run using QNN-NPU), then proceed to the Dynamic Shape Attention (run using CPU). Because of the chunking, the computation of adjacent Tokens can be pipelined.

I am not familiar with ExecuTorch. In the examples of Dynamic Shape provided by ExecuTorch, it seems that dynamic shapes are specified within the export function. such as:

aten_dialect: ExportedProgram = export(
    Basic(), example_args, dynamic_shapes=dynamic_shapes
)

However, it seems that I can only specify dynamic shapes for the Top-Level Module, but not for the nested Modules within it. such as:

class MatMul(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
        return a @ b

class Basic(nn.Module):
    def __init__(self):
        super().__init__()
        self.MatMul = MatMul()

    def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        out1 = self.MatMul(x, y) # 3x6x3
        out2 = self.MatMul(out1, x) # 3x3x6
        return out2

example_args = (torch.randn(3, 6), torch.randn(6, 3))
dim1_x = Dim("dim1_x", min=1, max=10)
dynamic_shapes = {"a": {0: dim1_x, 1: dim1_x}, "b": {0: dim1_x, 1: dim1_x}}
aten_dialect: ExportedProgram = export(
    Basic(), example_args, dynamic_shapes=dynamic_shapes
)

A feasible approach is to compile and call each Module separately; for instance, compile the QKV-Proj Module/O-Proj Module with a Static Shape and the Attention Module with a Dynamic Shape. However, this method is quite cumbersome because it requires disassembling and compiling each Transformer Block within the Large Language Model (LLM) individually.

So my first question is, how can I make the nested Modules support dynamic shapes?

After the model is compiled, ExecuTorch can dispatch different subgraphs to run on various hardware (as I mentioned, dispatch QKV/O Proj to the NPU and Attention to the CPU/GPU), which is very convenient. But how can I achieve the pipeline form I mentioned above? For example, once the QKV Proj computation for the first token is completed, the computation kernel for the QKV Proj of the second token can be launched.

My second question is, can I call a compiled Module individually, rather than starting from the Top-Level Module? In this way, I can lock different Modules to achieve a pipeline.

The Dynamic KV Caches seems also unsolved? see Support for dynamic caches · Issue #4740 · pytorch/executorch · GitHub

jackzhxng · August 26, 2024, 11:26pm

Hi @chenghua_wang,

Thanks for providing this diagram! Firstly, I would like to offer an update for our progress on dynamic kv caches here.

Next - do you need to have a dynamic cache? If not then we can keep the entire model static, in which case you can go back to using a QNN backend. Here is an example.

If you do need a dynamic cache and thus dynamic shapes, then the entire model would need to support dynamic shape, including your Q/K/V and O proj layers.

jackzhxng · August 26, 2024, 11:27pm

In case you were interested though in calling different modules within your program, this is how you would do it. Note that you can only export the forward() method of a module, so you will need to wrap your individual submodules in a nn.module with a forward() method.

chenghua_wang · August 27, 2024, 2:40am

Hi @jackzhxng,

Thank you for your helpful reply.

This example has solved my problems, I will give it a try.

Yes, I need dynamic cache, but during the pre-fill stage, the QKV/O Proj Layers is static due to fixed chunk size.

Now I have another question. Because I need to call multiple compiled modules, for example:

out_1 = module_1.forward(...)
out_2 = module_2.forward(out_1, ...)

The memory of the tensor out_1 is managed by module_2 , or is it managed by the user?

It should be managed by the user? Because during execution, out_1 does not know how many successor nodes it has.

cccclai · August 27, 2024, 4:39am

Calling part of the graph in cpu and part of the graph in npu might cause slower latency. There are two ways to use QNN. One way is to prefill token by token, and you can checkout the instructions here executorch/examples/models/llama2 at main · pytorch/executorch · GitHub

For QNN, the command line for running floating point is

python -m examples.models.llama2.export_llama --disable_dynamic_shape -kv --qnn -c stories110M.pt -p params.json

and for 4bit quantized, it’s still wip. The corresponding command line is

python -m examples.models.llama2.export_llama --disable_dynamic_shape -kv --qnn --pt2e_quantize qnn_16a4w

chenghua_wang · August 27, 2024, 7:33am

Thank you for your answer, this is a very good example, I will try it later:

The original intention of dividing the prompt into chunks and executing them separately on the CPU and NPU is to utilize dynamic cache. Indeed, the NPU is much faster, after using the Pipeline, it may not be possible to completely overlap the CPU’s computation time.

It seems that QNN2.21.0+ support dynamic shape? see document: Qualcomm Documentation