Why does Winograd algorithm speedup convolution, given that MUL and ADD cost the same clock cycles on GPU?

Note: this is a general (algorithmic & CUDA) question, not related to Pytorch.

The Winograd algorithm transforms normal convolution computation by reducing MUL and at the same time increasing ADD operations. The number of MUL reduced is less than the number of ADD increased. However, CUDA programming guide says both floating point MUL & ADD equally costs 4 clock cycles.

Question: How does Winograd speedup rather than slowdown convolution computation?

Hello zfzhang!

The Winograd matrix-multiplication algorithm does not speed up
convolution. I am not aware that the Winograd algorithm is ever
used in practice.

Note, the Winograd algorithm does reduce the total number of
additions needed. The confusion might be that the algorithm is
recursive, and, at a given step in the recursion, the number of
matrix additions is, indeed, larger than in the straightforward
algorithm, but the number of matrix multiplications is reduced,
more than compensating as you recurse down to the bottom.

(Some notes: The straightforward algorithm for multiplying n x n
matrices has time-complexity O(n^3). The Winograd algorithm is
one of a series of sub-O(n^3) matrix-multiplication algorithms, but
they are all primarily of purely theoretical interest, and are used
rarely (if ever) in practice. They only become competitive with the
“naive” algorithm when n is quite large, because, even when
only counting floating-point operations, they have much larger
pre-factors. Furthermore, they break the benefits of floating-point
pipelines, so, in practice, they make use of significantly fewer
floating-point operations per unit of time. Nonetheless, this whole
story is quite interesting (if you care about such things). I believe
it all started with Strassen’s algorithm.)

Best.

K. Frank

Hi KFrank, thanks for the prompt reply :slight_smile:

Current research does indicate that Winograd speeds up convolution, see:

https://arxiv.org/abs/1509.09308

http://cs231n.stanford.edu/reports/2016/pdfs/117_Report.pdf

Hi zfzhang!

These two papers you cite:

https://arxiv.org/abs/1509.09308
and
http://cs231n.stanford.edu/reports/2016/pdfs/117_Report.pdf

refer to a “minimal filtering algorithm” published by Winograd in 1980.

The Wikipedia link you give for the “Winograd algorithm” in your original
post:

is for the Strassen-like Coppersmith-Winograd matrix-multiplication
algorithm published by them in 1990. Same Winograd, two different
algorithms.

I’m willing to believe that Winograd’s minimal-filtering algorithm can
be used in practice to speed up CNNs.

Best.

K. Frank

For the sake of completeness: winograd algos can be picked by cudnnFind (via torch.backends.cudnn.benchmark = True), if they are the fastest algo for this particular workload.
E.g. the cudnn docs mention it here.

Hi ptrblck,

my point is: the algorithm makes the assumption that MUL is more expensive than ADD, but this assumption doesn’t seem to hold in the CUDA computation model, where they both cost 4 clock cycles. (side note: on CPU the cost ratio is typically 4:1)

That’s a valid concern from a theoretical point of view.
Note however, that a lot of operations are not compute bound, but often memory bandwidth bound.
I.e. the GPU computation would be “fast enough”, but the overall speed is limited by the reads and writes of the data.
Even if the MUL and ADD are more expensive on GPUs, which could theoretically mean that there wouldn’t be any expected speedup using Winograd, in reality the kernel performance might still be better than other (less compute-intensive) kernels.

cudnn in particular uses the benchmark mode to profile different algorithms and should only select the Winograd kernel, if it’s the fastest one (and also meets other requirements specified by the user such as determinism).