You are explicitly pushing the tensors to GPU0 via e.g.
torch.zeros((self.channel_num, self.size, self.size), device="cuda")
which will raise errors as e.g. DataParallel or DistributedDataParallel would either push the model to the specified devices automatically in the former case or you would use the rank in DDP.
If you need to create a new tensor in the forward use the .device attribute of e.g. the input or a parameter:
torch.zeros((self.channel_num, self.size, self.size), device=x.device)