Expected tensor for argument #1 'indices' to have scalar type Long; but got CPUFloatTensor instead (while checking arguments for embedding)

winterknight · November 18, 2022, 4:23am

hi @ptrblck
Could you help me with a similar issue…

I am trying to train T5Encoder(finetuned on QA) alongwith T5(encoder + decoder finetuned on summarization). I’m using a mixed objective. Sequence(summary) generation and binary class prediction whether summary contains answer or not. When I try and pass the target 0/1 class through DataLoader I think it tries to embed it and I see this error :

<ipython-input-27-c1d41cd4adff> in validation_step(self, batch, batch_idx)
     89         question_input_ids=question_input_ids,
     90         question_attention_mask=question_attention_mask,
---> 91         question_labels=question_labels.to(torch.float64)
     92     )
     93 

RuntimeError: Expected tensor for argument 
 1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.DoubleTensor instead (while checking arguments for embedding)

Here question_labels are the 0/1 target labels

Model →

class CrossAttentionSummarizer(pl.LightningModule):
  def __init__(self):
        super(CrossAttentionSummarizer, self).__init__()
        self.summarizer_model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)
        self.qa_encoder = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)
        self.multihead_attn = nn.MultiheadAttention(embed_dim=768, num_heads=4, batch_first=True)
        self.linear1 = nn.Linear(512*768, 512)
        self.linear2 = nn.Linear(512, 1, bias=False)
        self.sigmoid = nn.Sigmoid()
        self.bce_loss = nn.BCELoss()

  def forward(self, question_input_ids, question_attention_mask, question_labels, input_ids, attention_mask, decoder_attention_mask, labels=None):
    summarizer_output = self.summarizer_model(
        input_ids,
        attention_mask=attention_mask,
        labels=labels,
        decoder_attention_mask=decoder_attention_mask
    )

    qa_output = self.qa_encoder(
        question_input_ids, 
        question_attention_mask,
        question_labels
    )
    decoder_output = summarizer_output[3]
    encoder_output = qa_output[2]
    
    multi_attn_output, multi_attn_output_weights = self.multihead_attn(decoder_output, encoder_output, encoder_output)
    lin_output = self.linear1(multi_attn_output.reshape(-1, 512*768))
    cls_outputs = self.linear2(lin_output)
    cls_preds = self.sigmoid(cls_outputs)

    cls_pred_loss = self.bce_loss(cls_preds, question_labels)
    return summarizer_output.loss, summarizer_output.logits, cls_pred_loss, cls_preds

  def training_step(self, batch, batch_idx):
    input_ids = batch["text_input_ids"]
    attention_mask = batch["text_attention_mask"]
    labels = batch["labels"]
    labels_attention_mask = batch["labels_attention_mask"]
    question_input_ids = batch["question_input_ids"]
    question_attention_mask = batch["question_attention_mask"]
    question_labels = batch["question_labels"]

    loss, outputs, cls_pred_loss, cls_pred = self(
        input_ids=input_ids,
        attention_mask=attention_mask,
        decoder_attention_mask=labels_attention_mask,
        labels=labels,
        question_input_ids=question_input_ids,
        question_attention_mask=question_attention_mask,
        question_labels=question_labels.to(torch.float64)
    )

    self.log("train_loss", loss, prog_bar=True, logger=True)
    self.log("train_pred_loss", cls_pred_loss, prog_bar=True, logger=True)
    return loss, cls_pred_loss

  def validation_step(self, batch, batch_idx):
    input_ids = batch["text_input_ids"]
    attention_mask = batch["text_attention_mask"]
    labels = batch["labels"]
    labels_attention_mask = batch["labels_attention_mask"]
    question_input_ids = batch["question_input_ids"]
    question_attention_mask = batch["question_attention_mask"]
    question_labels = batch["question_labels"]

    loss, outputs, cls_pred_loss, cls_pred = self(
        input_ids=input_ids,
        attention_mask=attention_mask,
        decoder_attention_mask=labels_attention_mask,
        labels=labels,
        question_input_ids=question_input_ids,
        question_attention_mask=question_attention_mask,
        question_labels=question_labels.to(torch.float64)
    )

    self.log("val_loss", loss, prog_bar=True, logger=True)
    self.log("val_pred_loss", cls_pred_loss, prog_bar=True, logger=True)
    return loss, cls_pred_loss

  def test_step(self, batch, batch_idx):
    input_ids = batch["text_input_ids"]
    attention_mask = batch["text_attention_mask"]
    labels = batch["labels"]
    labels_attention_mask = batch["labels_attention_mask"]
    question_input_ids = batch["question_input_ids"]
    question_attention_mask = batch["question_attention_mask"]
    question_labels = batch["question_labels"]

    loss, outputs, cls_pred_loss, cls_pred = self(
        input_ids=input_ids,
        attention_mask=attention_mask,
        decoder_attention_mask=labels_attention_mask,
        labels=labels,
        question_input_ids=question_input_ids,
        question_attention_mask=question_attention_mask,
        question_labels=question_labels.to(torch.float64)
    )

    self.log("test_loss", loss, prog_bar=True, logger=True)
    self.log("test_pred_loss", cls_pred_loss, prog_bar=True, logger=True)
    return loss, cls_pred_loss

  def configure_optimizers(self):
    return AdamW(self.parameters(), lr=0.0001)

ptrblck · November 18, 2022, 5:50am

Based on the error message and the stacktrace I would assume question_labels should be a LongTensor instead of a DoubleTensor, so call to(torch.long) and it might work.

winterknight · November 18, 2022, 11:16am

Thank you so much
I changed it and it works!

Model training encounters an error :
RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1

I tried to change the torch version as stated in : Newer PyTorch Binaries for Older GPUs
but to no avail.

I am using Google Colab GPU A100 SXM-4 40Gb Cuda Version 11.2
I tried torch 1.9.1 and torch==1.3.1+cu92

ptrblck · November 18, 2022, 6:57pm

These versions are too old for your Ampere GPU, as you need a PyTorch binary with a CUDA 11 runtime. Install the latest stable (1.13.0) or nightly release and it should work.

winterknight · November 19, 2022, 8:36pm

cuda version issues when trying to get pytorch lightning to run on A100 gpu

Something strange is happening…
There are some version conflicts in my installations and it throws an error. The model trains on the Tesla T4 despite the errors. But it fails on the Ampere 100 GPU which I need for higher batch sizes.

I looked up some posts about using higher versions of torch but I suspect this is something to do with installing pytorch_lightning. After running the installs I have the following versions:

torch version : 1.12.1+cu102
pytorch lightning version : 1.8.2
torchtext version : 0.13.1

I don’t know why torch version is still 1.12.1 with cu102 (This probably is the source of error)

These are my installs…

!pip install --quiet transformers==4.5.0
!pip uninstall -y --quiet torch
!pip install --quiet torch==1.13.0
!pip install --quiet pytorch-lightning
!pip install --quiet torchvision==0.13.0
!pip install --quiet torchtext
!pip install --quiet datasets

But I see this error
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torchtext 0.13.1 requires torch==1.12.1, but you have torch 1.12.0 which is incompatible. torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.12.0 which is incompatible.

I tried installing torch using !pip3 install --quiet torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 to match the torchtext version dependency but it still throws the same error…

ptrblck · November 20, 2022, 3:32am

Your versions are indeed conflicting, since torch==1.13.0 would correspond to torchvision==0.14.0.
Try to install the “pure” PyTorch packages first, make sure the GPU is running fine, then try to install 3rd party packages such as lightning and check if it’s downgrading your PyTorch installation.

winterknight · November 21, 2022, 4:02pm

It worked!
Thank you so much! You’re the best!!