DDP implementation in LLM

Hey Guys,
how can I use DDP to make my two GPUs work together and parallel on answering questions with an LLM like Mistral-7B?

This may be relevant How to inference under DDP

I am a little new in this topic but what I want is that I run my model but i dont want the model to be split up between my gpus but to ,work together" parallel to create the answer faster. does this make sense? what is the correct way to achieve this?

This is my code to simply run the mistral-7B
But how can I make my 2 GPUs (Tesla M40 24GB) act like one large gpu so the output gets faster?

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

local_model_path = "/media/data/AI/models/mistral-7B"
# Load the tokenizer and model from the local folder
tokenizer = AutoTokenizer.from_pretrained(local_model_path)
model = AutoModelForCausalLM.from_pretrained(local_model_path, torch_dtype=torch.float16)
# Ensure the model is in evaluation mode
model.eval()

class TransformersLLM:
    def __init__(self, model, tokenizer, device):
        self.model = model.to(device)  # Move model to the specified device
        self.tokenizer = tokenizer
        self.device = device

    def invoke(self, prompt):
        # Tokenize the input prompt and move tensors to the appropriate device
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        
        # Generate the output using the model
        with torch.no_grad():  # Disable gradient calculations to save memory
            outputs = self.model.generate(input_ids=inputs['input_ids'].to(self.device), max_length=4096)
        
        # Decode the generated output to a string
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Initialize the custom LLM class with accelerate
llm = TransformersLLM(model, tokenizer, "cuda")

print(llm.invoke("Who was Marie Curie?"))

Yes, DDP replicates rather than shards your model. The data is sharded, so that forward/backward can be parallelized.