Error on passing GPT-NEO-1.3B into DDP[Error: Invalid Scalar Type]

iammano · July 5, 2023, 10:25am

Hi there,

Problem:
Unable to wrap the model inside DDP.

Note:
I’m using gloo backend with DDP. (Does not have GPU)

Spec:

24 cores CPU
32GB RAM
512 GB SSD

I already deployed model training with DDP(multinode) using the basic CNN class.

I’m facing the error “RuntimeError: Invalid scalar type” upon supplying the model(GPT-NEO-1.3B) into the DDP wrapper

class Agent():

def __init__(self, model, dataset: DataLoader):
    self.local_rank = int(os.environ["LOCAL_RANK"])
    self.global_rank = int(os.environ["RANK"])
    self.dataset = dataset
    self.model = DDP(model) #This is the line I'm getting the error

def train(self):
    #Re-Train the model
    training_args = TrainingArguments(output_dir="./output", num_train_epochs=1, logging_steps=10, 
                             save_strategy="epoch", per_device_train_batch_size=2, per_device_eval_batch_size=2,
                             warmup_steps=100, weight_decay=0.01, logging_dir="./logs")
    Trainer(model=self.model, args=training_args, train_dataset = self.dataset, data_collator=lambda data:{
            "input_ids":torch.stack([f[0] for f in data]).to(f"cpu:{local_rank}"),
            "attention_mask":torch.stack([f[1] for f in data]).to(f"cpu:{local_rank}"),
            "labels":torch.stack([f[0] for f in data]).to(f"cpu:{local_rank}")
        }).train()

This is my class, apart from that I have initialized the DDP process group and wrapped the dataset into a DistrbutedSampler Inside the Dataloader sampler parameter.

Kindly help me out, thanks in advance.

Alex_Smith1 · July 12, 2023, 5:04pm

I’m facing a similar issue.

iammano · July 12, 2023, 5:29pm

Hi @Alex_Smith1 i resolved the issue, the thing here is DDP does not accept torch.bool type tensors. Here gpt neo’s buffer in torch.bool type i converted the buffers of the model into torch.uint8 . Try this might be helpful.

Alex_Smith1 · July 12, 2023, 8:37pm

Thanks @iammano. That seems to have resolved the error but looks like uint8 is deprecated for attention tensors and I get the following warning. It’s a little surprising that bool is not supported

/site-packages/torch/nn/functional.py:5080: UserWarning: Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.
warnings.warn(“Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.”)

iammano · July 13, 2023, 1:49am

Yes @Alex_Smith1 if you try with int8 or something we get error like expected torch.bool type but got char dtype, i tried that. This error will triggered when we provide data to the model.

But uint8 works.