Why can't we use mean and std used for normalization of image data per image and per channel?

It’s a long story.

Check this plot (which is totally off topic, from an mnist example)


If you normilize your pixels wrt images, you are losing perspective of differences between images. You map everything into the same normal distribution like fig 2.
If you normalize wrt dataset, each image still has unique distribution with its own mean and std.

You can probably yet classify learning distributions over images but there is no gain on loosing information about differences between images.

In addition we are simplifying ideal classification that would be having per pixel estimators rather than per-channel estimators.

Anyway normalization was dataset-wise because ideally we are supposed to iterate over the whole dataset instead of using mini-batch strategy. There are arguments about using batch-wise normalization instead of dataset.