DDP failed to run

I used this command to start distributed training: torchrun --standalone --nnodes=1 --nproc-per-node=2 train.py

Here is the first a few lines of error message:

master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2023-04-15 22:17:19,983 WARNING: Cuda is available!
2023-04-15 22:17:19,985 WARNING: Cuda is available!
2023-04-15 22:17:20,059 WARNING: Found 10 GPUs!
2023-04-15 22:17:20,060 WARNING: Found 10 GPUs!
Size of vocab: 24939
Model parameters #: 1214594
Size of vocab: 24939
Model parameters #: 1214594
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [12,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1678402374358/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [87,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

And here is the last a few:

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3295682 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3295680) of binary: /home/stu11/s11/wz1937/miniconda3/envs/learn/bin/python
Traceback (most recent call last):
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
learn_ddp.py FAILED
------------------------------------------------------------

An indexing operation failed. Try to rerun your code with CUDA_LAUNCH_BLOCKING=1 and check which operation failed in the stacktrace. Once the failing layer or operation is isolated check the indexing tensor and make sure all values are valid.

I can see the error with embedding, but I’m not sure how to fix it… Theoretically ddp training does not need to change the model from the single GPU training setting right?
Here is the stack-trace:

Traceback (most recent call last):
  File "/home/stu11/s11/wz1937/torch DDP example/learn_ddp.py", line 239, in <module>
    train(local_rank, to_map_style_dataset(train_data_iter), to_map_style_dataset(eval_data_iter), model, optimizer, num_epoch=10, log_step_interval=20, save_step_interval=500, eval_step_interval=300, save_path="./logs_imdb_text_classification", resume=resume)
  File "/home/stu11/s11/wz1937/torch DDP example/learn_ddp.py", line 159, in train
    logits = model(token_index) # FIXME
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/stu11/s11/wz1937/torch DDP example/learn_ddp.py", line 58, in forward
    word_embedding = self.embedding_table(word_index) # [bs, max_seq_len, embedding_dim]
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/stu11/s11/wz1937/miniconda3/envs/learn/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

This is correct, but you can search for the same issues here and will find some similar topics where “something”, e.g. in the data loading, changed and caused the issue without the user realizing it.

The nn.Embedding layer expects inputs containing values in [0, num_embeddings-1]. Check the word_index and its min/max values and make sure these are in the valid range.

I checked the size of embedding, which looks good to me, the num_embeddings is way greater than required, so that shouldn’t be an issue.
Another thing to note is that I’ve been trying this code since last afternoon and the code runs successfully for twice (out of probably ~200 times), do you know why that would happen?
At last is it OK if I paste my code here?
Many thanks!

# newer command: torchrun --standalone --nnodes=1 --nproc-per-node=2 learn_ddp.py

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchtext
import torch.distributed
import torch.utils.data
import torch.utils.data.distributed
from torchtext.datasets import IMDB
from torchtext.datasets.imdb import NUM_LINES
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset

import sys
import os
import logging
logging.basicConfig(
    level=logging.WARNING,
    stream=sys.stdout,
    format="%(asctime)s %(levelname)s: %(message)s",
)

VOCAB_SIZE = 15000

class GCNN(nn.Module):
    def __init__(self, vocab_size=VOCAB_SIZE, embedding_dim=64, num_class=2):
        super().__init__()

        self.embedding_table = nn.Embedding(vocab_size, embedding_dim)
        nn.init.xavier_uniform_(self.embedding_table.weight)

        self.conv_A_1 = nn.Conv1d(embedding_dim, 64, 15, stride=7)
        self.conv_B_1 = nn.Conv1d(embedding_dim, 64, 15, stride=7)

        self.conv_A_2 = nn.Conv1d(64, 64, 15, stride=7)
        self.conv_B_2 = nn.Conv1d(64, 64, 15, stride=7)

        self.output_linear1 = nn.Linear(64, 128)
        self.output_linear2 = nn.Linear(128, num_class)

    def forward(self, word_index):
        # define GCN forward operation,output logits based on input word_index

        # 1. get word_embedding from word_index
        # word_index shape:[bs, max_seq_len]
        word_embedding = self.embedding_table(word_index) # [bs, max_seq_len, embedding_dim] FIXME

        # 2. first layer Conv1d
        word_embedding = word_embedding.transpose(1, 2) # [bs, embedding_dim, max_seq_len]
        A = self.conv_A_1(word_embedding)
        B = self.conv_B_1(word_embedding)
        H = A * torch.sigmoid(B) # [bs, 64, max_seq_len]

        A = self.conv_A_2(H)
        B = self.conv_B_2(H)
        H = A * torch.sigmoid(B) # [bs, 64, max_seq_len]

        # 3. pooling and linear
        pool_output = torch.mean(H, dim=-1) # avg pooling,get [bs, 64]
        linear1_output = self.output_linear1(pool_output)
        logits = self.output_linear2(linear1_output) # [bs, 2]

        return logits


class TextClassificationModel(nn.Module):
    """ simple embeddingbag+DNN model """

    def __init__(self, vocab_size=VOCAB_SIZE, embed_dim=64, num_class=2):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)

    def forward(self, token_index):
        embedded = self.embedding(token_index) # shape: [bs, embedding_dim]
        return self.fc(embedded)

# step2 IMDB DataLoader

BATCH_SIZE = 64

def yield_tokens(train_data_iter, tokenizer):
    for i, sample in enumerate(train_data_iter):
        label, comment = sample
        yield tokenizer(comment)


def collate_fn(batch):
    """post processing for DataLoader minibatch"""
    target = []
    token_index = []
    max_length = 0
    for i, (label, comment) in enumerate(batch):
        tokens = tokenizer(comment)

        token_index.append(vocab(tokens))
        if len(tokens) > max_length:
            max_length = len(tokens)

        if label == "pos":
            target.append(0)
        else:
            target.append(1)

    token_index = [index + [0]*(max_length-len(index)) for index in token_index]
    return (torch.tensor(target).to(torch.int64), torch.tensor(token_index).to(torch.int32))


# step3 
def train(local_rank, train_dataset, eval_dataset, model, optimizer, num_epoch, log_step_interval, save_step_interval, eval_step_interval, save_path, resume=""):
    """ dataloader as map-style dataset """
    start_epoch = 0
    start_step = 0
    if resume != "":
        #  loading from checkpoint
        logging.warning(f"loading from {resume}")
        checkpoint = torch.load(resume, map_location=torch.device("cuda:0")) # cpu,cuda,cuda:index
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        start_epoch = checkpoint['epoch']
        start_step = checkpoint['step']

    # model = nn.parallel.DistributedDataParallel(model.cuda(local_rank), device_ids=[local_rank])
    model = model.cuda(local_rank)
    model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])


    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn, sampler=train_sampler)
    eval_data_loader = torch.utils.data.DataLoader(eval_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for epoch_index in range(start_epoch, num_epoch):
        ema_loss = 0.
        num_batches = len(train_data_loader)

        train_sampler.set_epoch(epoch_index) # randomize data for each GPU on different epoch

        for batch_index, (target, token_index) in enumerate(train_data_loader):
            optimizer.zero_grad()
            step = num_batches*(epoch_index) + batch_index + 1


            # token_index = token_index.cuda(local_rank) 
            target = target.cuda(local_rank)

            print(f"-----{token_index.shape}----")
            logits = model(token_index) # FIXME
            logging.error("passed this point")
            
            bce_loss = F.binary_cross_entropy(torch.sigmoid(logits), F.one_hot(target, num_classes=2).to(torch.float32))
            ema_loss = 0.9*ema_loss + 0.1*bce_loss
            bce_loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 0.1)
            optimizer.step()

            if step % log_step_interval == 0:
                logging.warning(f"epoch_index: {epoch_index}, batch_index: {batch_index}, ema_loss: {ema_loss.item()}")

            if step % save_step_interval == 0 and local_rank == 0:
                os.makedirs(save_path, exist_ok=True)
                save_file = os.path.join(save_path, f"step_{step}.pt")
                torch.save({
                    'epoch': epoch_index,
                    'step': step,
                    'model_state_dict': model.module.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                    'loss': bce_loss,
                }, save_file)
                logging.warning(f"checkpoint has been saved in {save_file}")

            if step % eval_step_interval == 0: # validation
                logging.warning("start to do evaluation...")
                model.eval()
                ema_eval_loss = 0
                total_acc_account = 0
                total_account = 0
                for eval_batch_index, (eval_target, eval_token_index) in enumerate(eval_data_loader):
                    total_account += eval_target.shape[0]
                    eval_logits = model(eval_token_index)
                    eval_target = eval_target.cuda(local_rank)
                    total_acc_account += (torch.argmax(eval_logits, dim=-1) == eval_target).sum().item()
                    eval_bce_loss = F.binary_cross_entropy(torch.sigmoid(eval_logits), F.one_hot(eval_target, num_classes=2).to(torch.float32))
                    ema_eval_loss = 0.9*ema_eval_loss + 0.1*eval_bce_loss
                acc = total_acc_account/total_account

                logging.warning(f"eval_ema_loss: {ema_eval_loss.item()}, eval_acc: {acc.item()}")
                model.train()

# step4 testing
if __name__ == "__main__":
    
    local_rank = int(os.environ['LOCAL_RANK'])
    local_rank = local_rank % torch.cuda.device_count()
    # local_rank = torch.distributed.get_rank()

    if torch.cuda.is_available():
        logging.warning("Cuda is available!")
        if torch.cuda.device_count() > 1:
            logging.warning(f"Found {torch.cuda.device_count()} GPUs!")
        else:
            logging.warning("Too few GPU!")
            exit()
    else:
        logging.warning("Cuda is not available! Exit!")
        exit()

    torch.distributed.init_process_group("nccl")

    train_data_iter = IMDB(root='../data', split='train') 
    tokenizer = get_tokenizer("basic_english")
    vocab = build_vocab_from_iterator(yield_tokens(train_data_iter, tokenizer), min_freq=20, specials=["<unk>"])
    vocab.set_default_index(0)
    print(f"Size of vocab: {len(vocab)}")
    
    model = GCNN()
    #  model = TextClassificationModel()
    print("Model parameters #:", sum(p.numel() for p in model.parameters()))
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    # train_data_loader = torch.utils.data.DataLoader(to_map_style_dataset(train_data_iter), batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=False)


    eval_data_iter = IMDB(root='../data', split='test') 
    # eval_data_loader = torch.utils.data.DataLoader(to_map_style_dataset(eval_data_iter), batch_size=BATCH_SIZE, collate_fn=collate_fn)
    resume = ""

    train(local_rank, to_map_style_dataset(train_data_iter), to_map_style_dataset(eval_data_iter), model, optimizer, num_epoch=10, log_step_interval=20, save_step_interval=500, eval_step_interval=300, save_path="./logs_imdb_text_classification", resume=resume)

    torch.distributed.destroy_process_group()

Alright I figured it out, you are right about the num_embeddings, somehow the values of inputs exceed the value I set previously for single-gpu training.
Thank you.