Look, the class is defined to take in , while I sent in a batch of 100 of  by using [-1, 784], and the class calculated the whole batch, which is not a behavior defined by me. I thought it is broadcasting working here but in a confusing way.
It is exactly the defined behavior.
The linear layer expects inputs in the shape [batch_size, in_features]. Your input has 100 samples (batch_size=100) and you are flattening the input to [batch_size=100, in_features=784] so the layer will process this batch as specified. No broadcasting is used in this case as the batch dimension is expected (in all layers).