Let’s say that I have a linear layer = nn.Linear(1024,4096).
If the input data for this layer is skinny, i.e. for example, a torch.rand((1,1024)) tensor,
one may predict that the operation would be latency-bound rather than memory-bound,
as the operation time itself will be short, and the runtime will mainly be used on reading
data from memory rather than computation.
In this case, what sets apart the memory-boundness and latency-boundness is the
length of total runtime excluding the head/tail memory read/write times.
But I am curious if there is a quantative way of demonstrating the latency-boundness
if Pytorch operations. Any suggestions?
In general, this is hard because measuring these things can be very delicate and performance characteristics of operations can be quite intricate (see e.g. this account of memory alignment impact Andrej Karpathy (@karpathy): "the latency of the entire training loop, the whole network. yes it's that bad." | nitter )
That said, if we imagine to latency (kernel launch overhead etc.) to be fixed (independent of size) and memory to be (at least) proportional to input size, you would expect that, say, doubling the size does not double the runtime. The other part is that you essentially know how much time it takes to launch a kernel and sync, so that might be the latency(?).