Does number of gradient accumulation steps affect model's performance?

Your gradient accumulation approach might change the model performance, if you are using batch-size-dependent layers such as batchnorm layers.
Batchnorm layers will use the current batch statistic to update the running stats. The smaller the batch size the more noise these stats updates will have. You could try to counter this effect by changing the momentum of these layers.

Let me know, if that might be the case or if you are not using batchnorm layers.

2 Likes