Can you help me adapt the actor-critic example for multi-gpu?

I’m struggling to adapt the actor-critic example for a single machine multi-gpu setup.

Here’s a gist of what I’m working with (or all of the code that seems relevant; note that that won’t compile due to private reward and action classes).

Current failure is a size mismatch on line 55: RuntimeError: size mismatch, m1: [1 x 43], m2: [128 x 256] which tells me it’s splitting something into 3 parts (128/3 = ~ 43).

Any feedback would be great as I’m new to splitting things up on multi-GPUs and new to Pytorch in general.

Are you using 3 GPUs for nn.DataParallel?
Could you print the shape of x in your forward method at these lines of code:

elif self.num_cuda_devices > 1:
    print(x.size())
    x = self.input_layer(x.float()).cuda()  
    x = torch.sigmoid(x)
    print(x.size())
    x = self.hidden_1(x).cuda()

Yup, 3 GPUs which are recognized by nvidia-smi and pytorch.

When I add:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if torch.cuda.device_count() > 1:
    print("data parallel")
    model = torch.nn.DataParallel(model,  device_ids=[0, 1, 2])
    model.cuda()
model.to(device)

to my code but not the part in my model declaration (self.hidden_1 = nn.DataParallel(self.hidden_1)) my code runs fine but does not use all 3 GPUs, just one.

The sizes on those lines are:

torch.Size([65])
torch.Size([128])

Thanks for the debugging!
Since some methods are missing in your gist, you would have to help me out with some more debugging. :wink:

It looks like the batch dimension is missing for x, but I’m not sure why and how your model seems to be working even without nn.DataParallel.
Usually your input should have the shape [batch_size, nb_features] to be a valid input for a linear layer.
If you are using a single sample, it should be [1, 65] for your model.

Since nn.DataParallel splits your data in dim0, you should provide a batch size being a multiple of the number of GPUs. In your setup you should provide a batch size of 3, 6, 9, ... so that the data can be split among each GPU.

Let me know, if you need some more help in debugging.

Yeah, I think what’s going on is in RL, we look at the state at one timestep (size: [65]). The batch size could be greater than 1, but we still pass a single time_step of state to the model. So the model doesn’t look at [32, 65], but [65]. But nn.parallel expects more shape[0] to be greater than 1, like you say it splits in dim0.

Most examples I see have a Variable declared with something like [batch_size, num_features] but mine doesn’t so I’ll have to figure out whether or not I want to keep passing a single example (I suppose if I want multiple GPUs I do have to change it).

Thanks for the feedback