Small depthwise Conv1d: maximum perf on CPU?

We have extremely small depthwise convolutions, input of shape [B=1, C=64, T=96] with [1x3] kernel sizes.

All shapes are fixed, and inference is done on CPU in a tight loop (the goal of this setup is to be a baseline). How can I figure out which conv impl is used under the hood? It shows some thnn_conv2d/_slow_conv2d_forward.

  • I would imagine that either Winograd algorithm or unroll+gemm can be used. How to understand which one is being used?
  • Is it possible to influence which one is used?
  • Should I use channels_last memory format?
  • How to get the best depthwise conv perfs on CPU in eager PyTorch?

Thanks :slight_smile:

@albanD worth creating a separate perf category on this forum?


We actually have some private API to check which conv implementation will be used that we use for testing:

That should tell you which impl is used under the hood.
Unfortunately, each of these implementations is free to use whatever algorithm they want internally. You will have to check the particular implementation. On CPU our own implementation uses Winograd AFAIK but that is not guaranteed.

Is it possible to influence which one is used?

it depends on the backends. Things like mkldnn or cudnn can be disabled via torch.backends.*. But not everything can be changed there.

Should I use channels_last memory format?

Hard to say tbh, I would benchmark both for your particular size and use the fastests.

1 Like

I guess, maybe worth to have some sort of a public page with micro-benchmarks numbers of different algo backends - maybe similar in spirit to MLcommons, but specific for PyTorch ecosystem (also important to control for intra-op multi-threading which can kill perf on cpu for small problem sizes). If there is a fixed, good microbenchmark harness,
people can add their own numbers and compare with up-to-date numbers for fwd/bwd for depthwise on cpu/gpu (and for other ops):

if you have small depth-wise conv1d, then IMO it’s better to just hand-roll something and tune it with inductor / TVM – or write a pass for inductor CPU. I think there’s a lot more mileage you’ll get out of that, because basically they’ll be bandwidth-bound computations.

1 Like

Thank you, Soumith for the ideas! By tuning something hand-roled with Inductor, do you mean writing a loop which does apply a standalone conv for each channel, concats it all and write-out to the result, right? In the hope that Inductor / TVM could make it better parallelized?

Regarding benchmarks, I found that Inductor / AITemplate / OneFlow / kernl (GitHub - ELS-RD/kernl: Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.) / TensorRT / FasterTransformer all are doing their own benchmarks :slight_smile:

I think there would be a lot of value in creating some sort of common / shared test harness (at least sample problem sizes / configurations) for both microbenchmarks and for integrated tests of whole models (with fixed “reference” model code) - maybe under PyTorch Foundation? or by contributing micro-benchmarks and democratizing GitHub - mlcommons/inference: Reference implementations of MLPerf™ inference benchmarks?

The goal is to have robust samples and very simple harness to have reliable, up-to-date numbers on known single hardware. I guess, inference benchmarks are not so expensive to run on the cluster (especially for single-GPU setups).

1 Like

What you describe is a lot like GitHub - pytorch/benchmark: TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance. (with some handlers at No?

Maybe! But I never saw the results of these benchmarks being published to some HTML page (would be good to auto-push it to some public

It would also help if it somehow included ways to include benchmarks of alternative kernels (like AITemplate / tvm / oneflow / kernl). I checked out mlcommons/inference as well, but it seems to have only a few not very modern architectures and mostly vendor-driven (so somehow not tapping into the community effort like timm community)

You can check out PyTorch CI HUD which has end to end model speedups - it’s not too focused on microbenchmarks rn tho

1 Like

I would say, these benchmarks are now targeted for PyTorch’s own assistance to development. And thus this CI is oriented towards that. A user-targeted perf numbers publishing is a slightly different story.

It would be nice to have something for perf-looking users. E.g. FlashAttention publishes some plot which makes PyTorch look extra-bad: GitHub - Dao-AILab/flash-attention: Fast and memory-efficient exact attention, probably it somehow makes sure to not use PyTorch’s own integration of FlashAttention :slight_smile: But it would be much better if PyTorch allowed contrib of such benchmarks and ran them. Same for bitsandbytes. People are often struggling to reproduce these benchmark, and small package authors don’t have resources to invest in regular benchmarks, so PyTorch core could probably help them (and users) out and provide concise reliable harnesses both for macro and micro-benchmarks. I think this is even more important now given that there are 5-10 different compiler stacks and the situation is changing every day. For earnest results, users need to invest in studying each of these compilers and preparing an apples-to-apples benchmarks themselves (e.g. there does not exist even a faithful ORT vs TRT benchmarking harness). If PyTorch invested in this (e.g. more user-friendly, more concise, more broad alternative to mlcommons/inference + including micro-benchmarks) , the users of all compilers would benefit a lot (especially given that PyTorch’s own inductor is improving a lot, so having automatic comparisons of Inductor to other backends with contributions of other backend authors would be nice!)