Note: this is a general (algorithmic & CUDA) question, not related to Pytorch.

The Winograd algorithm transforms normal convolution computation by reducing MUL and at the same time increasing ADD operations. The number of MUL reduced is less than the number of ADD increased. However, CUDA programming guide says both floating point MUL & ADDequally costs 4 clock cycles.

Question: How does Winograd speedup rather than slowdown convolution computation?

The Winograd matrix-multiplication algorithm does not speed up
convolution. I am not aware that the Winograd algorithm is ever
used in practice.

Note, the Winograd algorithm does reduce the total number of
additions needed. The confusion might be that the algorithm is
recursive, and, at a given step in the recursion, the number of matrix additions is, indeed, larger than in the straightforward
algorithm, but the number of matrix multiplications is reduced,
more than compensating as you recurse down to the bottom.

(Some notes: The straightforward algorithm for multiplying n x n
matrices has time-complexity O(n^3). The Winograd algorithm is
one of a series of sub-O(n^3) matrix-multiplication algorithms, but
they are all primarily of purely theoretical interest, and are used
rarely (if ever) in practice. They only become competitive with the
“naive” algorithm when n is quite large, because, even when
only counting floating-point operations, they have much larger
pre-factors. Furthermore, they break the benefits of floating-point
pipelines, so, in practice, they make use of significantly fewer
floating-point operations per unit of time. Nonetheless, this whole
story is quite interesting (if you care about such things). I believe
it all started with Strassen’s algorithm.)

For the sake of completeness: winograd algos can be picked by cudnnFind (via torch.backends.cudnn.benchmark = True), if they are the fastest algo for this particular workload.
E.g. the cudnn docs mention it here.

my point is: the algorithm makes the assumption that MUL is more expensive than ADD, but this assumption doesn’t seem to hold in the CUDA computation model, where they both cost 4 clock cycles. (side note: on CPU the cost ratio is typically 4:1)

That’s a valid concern from a theoretical point of view.
Note however, that a lot of operations are not compute bound, but often memory bandwidth bound.
I.e. the GPU computation would be “fast enough”, but the overall speed is limited by the reads and writes of the data.
Even if the MUL and ADD are more expensive on GPUs, which could theoretically mean that there wouldn’t be any expected speedup using Winograd, in reality the kernel performance might still be better than other (less compute-intensive) kernels.

cudnn in particular uses the benchmark mode to profile different algorithms and should only select the Winograd kernel, if it’s the fastest one (and also meets other requirements specified by the user such as determinism).