Network partitioning

I tried to train real-esrgan on my dataset with a single GPU, but couldn’t due to GPU memory limitation.
However, the machine I am running this model on has multiple GPUs, so I considered partitioning the model and training sub-networks in multiple GPUs. Here is the code I used:

class SRResNet(nn.Module):

    def __init__(self, img_feat = 1, n_feats = 64, kernel_size = 3, num_block = 16, act = nn.PReLU(), scale=4):
        super(SRResNet, self).__init__()

        self.conv01 = conv(in_channel = img_feat, out_channel = n_feats, kernel_size = 9, BN = False, act = act).to('cuda:0')

        resblocks = [ResBlock(channels = n_feats, kernel_size = 3, act = act) for _ in range(int(num_block/2))]
        self.body1 = nn.Sequential(*resblocks).to('cuda:0')
        self.body2 = nn.Sequential(*resblocks).to('cuda:1')

        self.conv02 = conv(in_channel = n_feats, out_channel = n_feats, kernel_size = 3, BN = True, act = None).to('cuda:1')

        self.last_conv = conv(in_channel = n_feats, out_channel = img_feat, kernel_size = 3, BN = False, act = nn.Tanh()).to('cuda:1')

    def forward(self, x):

        x = self.conv01(x)
        _skip_connection = x.to('cuda:0')

        x = self.body1(x)
        x = self.body2(x.to('cuda:1'))
        x = self.conv02(x)
        copy_connection = _skip_connection.to('cuda:1')
        feat = x + copy_connection

        x = self.last_conv(feat)

        return x

I load SRResNet like:

model = SRResNet()

I don’t move the network to any specific device since I specified each layer’s device in the network implementation.

I also load my input tensor on gpu0 and my ground truth on gpu1.

I get the following error in the forward function:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument weight in method wrapper__prelu)

Any ideas on how to train this network on multiple GPUs?

The error points to PReLU which is the default act argument and used in:

self.conv01 = conv(in_channel = img_feat, out_channel = n_feats, kernel_size = 9, BN = False, act = act).to('cuda:0')

as well as:

resblocks = [ResBlock(channels = n_feats, kernel_size = 3, act = act) for _ in range(int(num_block/2))]
        self.body1 = nn.Sequential(*resblocks).to('cuda:0')
        self.body2 = nn.Sequential(*resblocks).to('cuda:1')

Based on this code the second one looks wrong as you are using references to resblock while I assume you want to create different instances:

resblocks = [nn.Linear(10, 10)]
body1 = nn.Sequential(*resblocks).to('cpu')
print(body1[0].weight.device)
# cpu
print(resblocks[0].weight.device)
# cpu

body2 = nn.Sequential(*resblocks).to('cuda')
print(body1[0].weight.device)
# cuda:0
print(body2[0].weight.device)
# cuda:0
print(resblocks[0].weight.device)
# cuda:0

Here you can see that both self.bodyX modules will reuse the same resblock and the second to() operation will thus also move self.body1 to cuda:1.