This pattern would point towards a data loading bottleneck, i.e. the workers are not fast enough in preparing the next batch relative to the model training.
This could be the case, if your model workload is small or if your data loading is the bottleneck as explained e.g. here.
1 Like