The class definition of LinearFunctionhere confuses me in one way –

is the bias here a vector of size output.shape where all the entries are the same?
Wouldn’t the proper definition (where all of entries of the bias are free) have grad_bias = grad_output ?

Or is there a misunderstanding here on my part … thanks.

You need to keep in mind that this module expects a batch of inputs.
And so in the forward pass, the bias is actually automatically broadcasted over all the elements in the batch. In the backward pass, we need to do the backward of this broadcast on the 0th dimension which is a sum over this 0th dimension.