I am wondering what causes the difference between the following two way of writing the same operation? Here ‘‘s’’ is a matrix with shape (batch_size x 4), and ‘‘model’’ is a DNN.
batch size will be used by some layers (e.g. batch norm). If your network is complicated, it’s not surprised that different batch sizes have different results.
This was my initial guess as well, but it is not consistent with the following experiment. For example, let’s say all rows of ss is the same. Then the following set of operation results in exact zero difference.