What i am suspecting is that the data augmentation used is augmenting the source images without applying the same augmentation to its corresponding mask / label.
I would suggest training without the random data augmentation while recording the evolution of the loss function value across consecutive iterations