I have a four gpu setup (24 GB each) where I am trying to train a DeepLabV3Plus model using segmentation_models_pytorch library.
I am facing these errors: ValueError: Caught ValueError in replica 0 on device 0. ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1, 1])
The batch size I am using is 8.
Can you please help me resolve this issue?
Thank you!
It works with the default settings. Now there is a pool of encoders among which we can select which one to use. I am experimenting with different encoders. For some it works, for some it doesn’t.
Thank you for replying so quickly!
Really appreciate it.
Right, I misread the original error. It looks like your “encoder” might be downsampling the input too much. Could you check if e.g., downsampling is a setting that is available to you or workaround the issue by increasing the input size so that the spatial dimensions are not reduced to 1x1?
Right, so you might look into:
decreasing encoder_output_stride, increasing upsampling, or increasing the resolution of the input image as possible solutions.
@eqy already explained why the error might be raised, however it still doesn’t fit your description:
I have a four gpu setup (24 GB each) where I am trying to train a DeepLabV3Plus got input size torch.Size([1, 256, 1, 1])
I don’t know of you are using data parallel (I would assume so), which should yield a batch size of 2 for each of the 4 GPUs assuming the global batch size is 8. If the local batch size is set to 8 then of course each GPU should get 8 samples while the error indicates a single sample.
Which would mean that each of the four GPUs should process 2 samples for a batch size of 8. Could you add print statements to the forward method and post the shape of the input as well as all activation tensors? I guess you might either use an invalid reshaping operation in the forward or your batch size is not 8.