Hi there,
Problem:
Unable to wrap the model inside DDP.
Note:
I’m using gloo backend with DDP. (Does not have GPU)
Spec:
- 24 cores CPU
- 32GB RAM
- 512 GB SSD
I already deployed model training with DDP(multinode) using the basic CNN class.
I’m facing the error “RuntimeError: Invalid scalar type” upon supplying the model(GPT-NEO-1.3B) into the DDP wrapper
class Agent():
def __init__(self, model, dataset: DataLoader):
self.local_rank = int(os.environ["LOCAL_RANK"])
self.global_rank = int(os.environ["RANK"])
self.dataset = dataset
self.model = DDP(model) #This is the line I'm getting the error
def train(self):
#Re-Train the model
training_args = TrainingArguments(output_dir="./output", num_train_epochs=1, logging_steps=10,
save_strategy="epoch", per_device_train_batch_size=2, per_device_eval_batch_size=2,
warmup_steps=100, weight_decay=0.01, logging_dir="./logs")
Trainer(model=self.model, args=training_args, train_dataset = self.dataset, data_collator=lambda data:{
"input_ids":torch.stack([f[0] for f in data]).to(f"cpu:{local_rank}"),
"attention_mask":torch.stack([f[1] for f in data]).to(f"cpu:{local_rank}"),
"labels":torch.stack([f[0] for f in data]).to(f"cpu:{local_rank}")
}).train()
This is my class, apart from that I have initialized the DDP process group and wrapped the dataset into a DistrbutedSampler Inside the Dataloader sampler parameter.
Kindly help me out, thanks in advance.