While I’ve tried to max out batch size, to actually utilize GPU like nvidia-smi helps monitor in its rightmost column, does increasing the number of gradient-accumulation-steps help?
I’m noticing low GPU utilization and was wondering on ways to maximize it without needing to increase batch size.