My second gpu "Tesla V100-PCIE-32GB" disappears after running the code of transformers or after some time of work

My second GPUs disappears without any reason, sometimes after running the code or after some time of work.

My first GPU is GP107GL [Quadro P1000] and second one is Tesla V100-PCIE-32GB. I am using the first one as display for my secreens.

The GPU appears again if I restart the device.

My OS is Ubuntu 20.04.5 LTS

Is there any way to solve this problem?

The code I am running is as the following if this will help:

# %%
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# %%
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# %%
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# %%
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# %%

tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

# %%
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

# %%

# for batch in train_dataloader:
#     break
batch = next(iter(train_dataloader))
{k: v.shape for k, v in batch.items()}

# %%

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# %% 

outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

# %%
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)
# %%

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)
# %%

import torch

print([(i, torch.cuda.get_device_properties(i)) for i in range(torch.cuda.device_count())])
# num_of_gpus = torch.cuda.device_count()
# print("The Number of the GPUs are: ", num_of_gpus)

# print("Current GPU", torch.cuda.current_device())

# torch.cuda.device(2)
# torch.cuda.set_device(0)
# print("New Selected GPU", torch.cuda.current_device())

# %%
import os 

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# %%
torch.cuda.set_device(0)
# device = torch.device("cuda") if torch.cuda.is_available() else torch.device('cpu')
# print(device)
torch.cuda.get_arch_list()
torch.cuda.get_device_properties("cuda:0")
# torch.cuda.get_device_properties()
print("New Selected GPU", torch.cuda.current_device())

device = "cuda:0"
# %%

model.to(device)

# %%
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    print("Epoch: " , epoch)
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
## %%

# %%
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

The issue sounds more like a system/driver/hardware issue and I doubt it’s related to PyTorch.
I would thus recommend to run a few simple CUDA tests and check if the GPU also drops.
If so, check if any Xids are shown in dmesg which could indicate why it’s failing.

It doesn’t produce any message, just it disappears.

Anyway,I run dmesg command if you can get anything from it, thanks a lot.

The output is here:

https://drive.google.com/file/d/1BGmCMz3oS-cM3W_kwb9KsoF9sn-U6wlI/view?usp=share_link

The error is:

[ 1361.908162] NVRM: GPU at PCI:0000:d8:00: GPU-d1a5f877-65cb-a62e-4192-ae05bb68fc48
[ 1361.908175] NVRM: GPU Board Serial Number: 1560121001476
[ 1361.908178] NVRM: Xid (PCI:0000:d8:00): 79, pid=0, GPU has fallen off the bus.
[ 1361.908186] NVRM: GPU 0000:d8:00.0: GPU has fallen off the bus.
[ 1361.908191] NVRM: GPU 0000:d8:00.0: GPU serial number is 1560121001476.
[ 1361.908210] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

Based on this table it could be caused by a:

  • HW error
  • Driver issue
  • System Memory Corruption
  • Bus Error
  • Thermal Issue

A while ago a user was seeing the same issue and realized that the power cable wasn’t properly plugged into the GPU, which caused the same Xid, so you might want to start with this.

1 Like

I will try this and back to you. Thanks a lot.

Hi @ptrblck

I have tried the solution and made sure the power cable is plugged probably but the GPU is disappearing again.

Hi @ptrblck

Thanks for your help. Any other tips to solve this problem? thanks a lot.

No unfortunately I don’t have any other advice besides trying to narrow down potential root causes for the aforementioned issues.

1 Like

Hi,

thanks a lot for your help and support. I tracked the issue and it seems from the heating.

How can I cool my GPU?

Your GPU should already have a fan so make sure to leave enough space for air circulation. I’m sure you can find a lot of helpful blog posts discussing good thermal performance in workstations, which cases are best, etc.

1 Like

To be honest, it does not have a fan. Anyway, I have added two fans for cooling and it works like a monster.

Thanks a lot for your help and support.

I assume you did not use a server with active cooling but plugged the GPU into your workstation without any airflow?

Yes, this was a problem. It is a workstation.

OK, good you’ve isolated the issue.
Btw. I mistakenly assumed you were using a Titan V when I claimed it has a fan, but now realized you are using a V100.

1 Like

Thanks a lot for the help, it was very useful for me.