Number of output channels for binary segmentation

It might work if you flatten the output and targets, but I would rather stick to the explicit shape of [batch_size, 1, height, width].
Yes, pos_weight is a proper way to try to reduce overfitting to a majority class.