Ground truth shape for multiclass segmentation

I am setting up a basic Unet model to segment three classes. The individual mask data is set up with shape (H, W, 1) where each pixel has value between 0 and 3. I am wondering whether I can use masks as they are or do they need to be transformed to one-hot-encoded format?
I.e. can the target simply be (H, W, batch_size)?

Assuming you are using nn.CrossEntropyLoss for the multi-class segmentation, the model output should contain logits in the shape [batch_size, nb_classes, height, width] and the target should contain values in the range [0, nb_classes-1] and have the shape [batch_size, height, width].
Given that you are using class indices in [0, 3] you would be dealing with 4 classes (maybe you implicitly explained it as 3 classes + background).