Batch size at inference is influencing accuracy

Ive created a custom computer vision model which randomly samples points from a LiDAR point cloud, and performs semantic segmentation. This is a fairly simple model, essentially embedding the raw points, performing batch normalisation, and running through a series of 8 transformer blocks. A set of dense layers in the head creates an output, which separates the points in to two classes. This uses a crossentropy loss.
At training model maxes out at about 97.5% accuracy. However my results at test depend on batch size i.e.:
batchsize = 1: accuracy = 91%
batchsize = 2: accuracy = 95%
batchsize = 3: accuracy = 96%
batchsize =4: accuracy = 97%
batchsize 20: accuracy = 97.4%
Why should this be the case?
Im using model.eval(), torch.no_grad(). Ive this with/without batchnormalisation - no massive difference.