How is broadcast implemented?

We all know that broadcast is one of the most important tools in numpy and pytorch. In general, an array is broadcast automatically to align another array. How is this done?

essentially, with binary ops you have two read pointers, that would move at the same time in case of identical shapes. when one of the tensors is smaller, its read pointer is moved less often. more specifically, as you can only do the broadcasting when some dimension size is 1, one pointer is fixed while another iterates over a dimension. then they both move to a next dimension (next memory blocks) according to tensor strides, this is done from right to left.