nn.DataParallel: TypeError: expected sequence object with len >= 0 or a single integer

pytorching · September 22, 2020, 6:01am

In my forward function:

def __call__(self, train=True):
    if train:
        predicted = self.forward(...)
        loss = ....
        return loss # return a single value that's fine
        # loss.size() = the number of my GPUs.
    else:
        predicted = self.forward(...)
        return predicted # expected sequence object with len >= 0 or a single integer
        # In validation step I want to return the whole predict labels for other purpose
        # predicted with shape [16, 1] on each device and I have 4 GPU

My code works before model=nn.DataParallel(model).

albanD · September 22, 2020, 3:38pm

Hi,

This is hard to say without more context.
Can you share the stack trace for your functoin as well as where this call function is defined?

pytorching · September 23, 2020, 2:16am

The code is here. But the code is not quite organized. It has other problems prevent me from using nn.DataParallel only this one I cannot solve.

github.com

lifanchen-simm/transformerCPI/blob/f9301880740975ddc1d56ce19f9eb52a6ad75933/Kinase/model.py#L315


    # protein = torch.unsqueeze(protein, dim=0)
    # protein =[ batch size=1,protein len, protein_dim]
    enc_src = self.encoder(protein)
    # enc_src = [batch size, protein len, hid dim]

    out = self.decoder(compound, enc_src, compound_mask, protein_mask)
    # out = [batch size, 2]
    # out = torch.squeeze(out, dim=0)
    return out

def __call__(self, data, train=True):

    compound, adj, protein, correct_interaction ,atom_num,protein_num = data
    # compound = compound.to(self.device)
    # adj = adj.to(self.device)
    # protein = protein.to(self.device)
    # correct_interaction = correct_interaction.to(self.device)
    #scale = torch.tensor([1.0,4.0], device=self.device)
    Loss = nn.CrossEntropyLoss()

    if train:

albanD · September 23, 2020, 1:56pm

The first issue is that you should never redefine the __call__ method on a Module. Just the forward. This is going to provent it from working nicely with other parts of pytorch.

More generally, the error most likely refers to the creation of the DataParallel where the device argument does not have the right type.

pytorching · September 23, 2020, 2:17pm

I’m trying to remove the __call__(). But I don’t understand which part of the device is not right? To be honest, I don’t know which device should input.to(device) and model.to(device) use, when using nn.DataParallel. I just use device_ids[0].
You mean the bug is here? Thank you very much!

github.com

lifanchen-simm/transformerCPI/blob/f9301880740975ddc1d56ce19f9eb52a6ad75933/Kinase/main.py#L68


batch = 64
lr = 1e-4
weight_decay = 1e-4
iteration = 300
kernel_size = 9

encoder = Encoder(protein_dim, hid_dim, n_layers, kernel_size, dropout, device)
decoder = Decoder(atom_dim, hid_dim, n_layers, n_heads, pf_dim, DecoderLayer, SelfAttention, PositionwiseFeedforward, dropout, device)
model = Predictor(encoder, decoder, device)
# model.load_state_dict(torch.load("output/model/lr=0.001,dropout=0.1,lr_decay=0.5"))
model.to(device)
trainer = Trainer(model, lr, weight_decay, batch)
tester = Tester(model)

"""Output files."""
file_AUCs = 'output/result/AUCs--lr=1e-4,dropout=0.1,weight_decay=1e-4,kernel=9,n_layer=3,batch=64,balance,lookaheadradam'+ '.txt'
file_model = 'output/model/' + 'lr=1e-4,dropout=0.1,weight_decay=1e-4,kernel=9,n_layer=3,batch=64,balance,lookaheadradam'
AUC = ('Epoch\tTime(sec)\tLoss_train\tAUC_dev\tPRC_dev')
with open(file_AUCs, 'w') as f:
    f.write(AUC + '\n')

albanD · September 23, 2020, 2:25pm

As mentioned in the DataParallel doc: " The parallelized module must have its parameters and buffers on device_ids[0] before running this DataParallel module."

I can’t find any reference to DataParallel in the repo so not sure where you do that. By I was talking about the place where you wrap your module in DataParallel.

pytorching · September 23, 2020, 2:30pm

Sorry, just the line before referenced.

model = Predictor(encoder, decoder, device)
# model.load_state_dict(torch.load("output/model/lr=0.001,dropout=0.1,lr_decay=0.5"))
model = nn.DataParallel(model, device_ids=[0,1,2,3])  # I add the code here
model.to(device)

github.com

lifanchen-simm/transformerCPI/blob/f9301880740975ddc1d56ce19f9eb52a6ad75933/Kinase/main.py#L68


batch = 64
lr = 1e-4
weight_decay = 1e-4
iteration = 300
kernel_size = 9

encoder = Encoder(protein_dim, hid_dim, n_layers, kernel_size, dropout, device)
decoder = Decoder(atom_dim, hid_dim, n_layers, n_heads, pf_dim, DecoderLayer, SelfAttention, PositionwiseFeedforward, dropout, device)
model = Predictor(encoder, decoder, device)
# model.load_state_dict(torch.load("output/model/lr=0.001,dropout=0.1,lr_decay=0.5"))
model.to(device)
trainer = Trainer(model, lr, weight_decay, batch)
tester = Tester(model)

"""Output files."""
file_AUCs = 'output/result/AUCs--lr=1e-4,dropout=0.1,weight_decay=1e-4,kernel=9,n_layer=3,batch=64,balance,lookaheadradam'+ '.txt'
file_model = 'output/model/' + 'lr=1e-4,dropout=0.1,weight_decay=1e-4,kernel=9,n_layer=3,batch=64,balance,lookaheadradam'
AUC = ('Epoch\tTime(sec)\tLoss_train\tAUC_dev\tPRC_dev')
with open(file_AUCs, 'w') as f:
    f.write(AUC + '\n')

pytorching · September 23, 2020, 2:36pm

Another question,(sorry I’m new to pytorch)

By this image in some blog about nn.DataParallel.
The first step in Backward(Compute loss gradient on GPU-1) results in imbalanced GPU usage.
Is that means in DataParallel loss.backward() only happens in GPU-1 not other GPUs, but optimizer.step(),optimizer.zero_grad() are parallel(step 2,3,4 in backward)?
Thank you very much.

albanD · September 23, 2020, 2:46pm

What DataParallel does is more version 3 of this image: split the input on each GPUs and run on each of them independently. Then accumulate.
Note that the backward will run on the same device as the forward. whatever the device of the Tensor on which you call .backward().

pytorching · September 23, 2020, 3:06pm

Then where does the imbalanced GPU usage come from?
You means loss.backward() is also parallel right?
I’m a little confused.

albanD · September 23, 2020, 3:22pm

It depends if the loss is inside the DataParallel or not.
If it is, then there won’t be any imbalance.
If it is outside and just computed on one GPU, then this GPU will do a bit more work indeed.

pytorching · September 23, 2020, 5:17pm

It depends if the loss is inside the DataParallel or not.

In DataParallel you mean inside the forward function? But most time the forward function won’t contain loss computation right ?
I’m also confused about imbalance come from loss.backward() or from loss=criterion(ture, pred) ?
Thank you for your patience!

albanD · September 24, 2020, 1:44pm

The DataParallel takes a Module as input so it can contain anything you want
And yes what is executed is what is in the forward function of your Module.

The imbalance won’t come from the loss.backward() because it runs at the same place as the forward. So if the forward is balanced, the backward will be as well.

pytorching · September 25, 2020, 2:39pm

Another weird problem in nn.DataParallel
in my main.py I put the model to device

encoder = Encoder(protein_dim, hid_dim, n_layers, kernel_size, dropout)
decoder = Decoder(atom_dim, hid_dim, n_layers, n_heads, pf_dim, DecoderLayer, SelfAttention,
                  PositionwiseFeedforward, dropout)
model = Predictor(encoder, decoder)
# model.load_state_dict(torch.load("output/model/lr=0.001,dropout=0.1,lr_decay=0.5"))
model = nn.DataParallel(model)
model.to(device)

trainer = Trainer(model, lr, weight_decay, scaler)
tester = Tester(model)
loss_train = trainer.train(train_dl, device=device)  # This line throw errors

But I got the following error

assert all(map(lambda i: i.is_cuda, inputs))
AssertionError

I have test all model.parameters() and inputs in train():

def train(self, dataloader, device):
    self.model.train()

    if self.scaler is None:
        for i, data_pack in enumerate(dataloader):
            data_pack = to_cuda(data_pack, device=device)

            assert (all(map(lambda i: i.is_cuda, self.model.parameters())))
            assert (all(map(lambda i: i.is_cuda, data_pack)))
            loss, _, _ = self.model(data_pack)  # This line throw errors

            self.optimizer.zero_grad()
            loss.sum().backward()
            self.optimizer.step()

The results are all True. But I still get this error in the third line loss, _, _ = self.model(data_pack)
What happened?
This is my forward function:

def forward(self, data):
    compound, adj, protein, correct_interaction, atom_num, protein_num = data
    # compound = [batch,atom_num, atom_dim]
    # adj = [batch,atom_num, atom_num]
    # protein = [batch,protein len, 100]

    compound_max_len = compound.shape[1]
    protein_max_len = protein.shape[1]
    compound_mask, protein_mask = self.make_masks(atom_num, protein_num, compound_max_len, protein_max_len)
    compound = self.gcn(compound, adj)
    # compound = torch.unsqueeze(compound, dim=0)
    # compound = [batch size=1 ,atom_num, atom_dim]

    # protein = torch.unsqueeze(protein, dim=0)
    # protein =[ batch size=1,protein len, protein_dim]
    enc_src = self.encoder(protein)
    # enc_src = [batch size, protein len, hid dim]

    predicted_interaction = self.decoder(compound, enc_src, compound_mask, protein_mask)
    # out = [batch size, 2]
    # out = torch.squeeze(out, dim=0)
    loss = self.Loss(predicted_interaction, correct_interaction.view(-1, 1))
    return torch.unsqueeze(loss, 0), predicted_interaction.cpu().detach().view(-1, 1), correct_interaction.cpu().detach().view(-1, 1)

Thank you very much !!!

albanD · September 25, 2020, 2:46pm

From the DataParallel doc, you should send your model to the device before wrapping it in DataParallel!

pytorching · September 25, 2020, 2:49pm

You mean

model = nn.DataParallel(model)  # The order is wrong?
model.to(device)

model.to(device)  # This is right?
model = nn.DataParallel(model)

But in the doc

BTW, where is the complete doc of nn.DataParallel??

albanD · September 25, 2020, 2:52pm

Here: https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel

pytorching · September 25, 2020, 2:53pm

I have tested both the orders. It doesn’t help.
Can you plz look at the code?

albanD · September 25, 2020, 2:58pm

Is the error just with your own assert that checks if things are on the GPU?

pytorching · September 25, 2020, 2:58pm

No. I just use the assert to verify but all the inputs and parameters are on cuda. So I’m confused.