Torch Distributed Data Parallel Crashes Computer Using 4 GPUs but not 3

Wiqzard · June 13, 2023, 7:34am

I am encountering a problem while attempting to train a neural network utilizing torch distributed data parallel. The issue is that my system crashes and instantaneously restarts when I train using 4 GPUs. Interestingly, this problem doesn’t occur when I train with 3 GPUs. Solutions offered in the different posts on this forum regarding similar problems did not help either.

Here are some further details on my setup that augment my perplexity:

I created a fresh Pytorch environment with Python 3.11.3, and the Pytorch versions I tried are 2.0.1 and 1.12.1+cu116. The same crash issue also arose with older versions of Pytorch and Python.

>>> print(torch.__version__)
2.0.1

>>> print(torch.__version__)
1.12.1+cu116
...
...

I utilized the following torch distributed data parallel example code:

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(784, 10000)
        self.fc2 = nn.Linear(10000, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        x = self.fc2(x)
        return x


def train(rank, num_gpus):
    dist.init_process_group(
        backend="nccl", init_method="env://", world_size=num_gpus, rank=rank
    )
    torch.cuda.set_device(rank)

    model = SimpleNet().to(rank)
    ddp_model = DistributedDataParallel(model, device_ids=[rank])
    print("Rank ", rank, ", Model Created")
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]
    )
    train_set = datasets.MNIST("./data", download=True, train=True, transform=transform)
    train_sampler = DistributedSampler(
        dataset=train_set, num_replicas=num_gpus, rank=rank
    )
    train_loader = DataLoader(
        dataset=train_set,
        batch_size=512,
        shuffle=False,
        num_workers=1,
        pin_memory=True,
        sampler=train_sampler,
    )

    criterion = nn.CrossEntropyLoss().to(rank)
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)

    for epoch in range(100):
        running_loss = 0.0
        for inputs, labels in train_loader:
            inputs = inputs.to(rank)
            labels = labels.to(rank)
            optimizer.zero_grad()
            outputs = ddp_model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        print("Rank ", rank, ", Epoch ", epoch, ", Loss: ", running_loss)


def main():
    num_gpus = 4
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    mp.spawn(train, args=(num_gpus,), nprocs=num_gpus, join=True)


if __name__ == "__main__":
    main()

The output of nvidia-smi can be seen in the attached image.

Pasted image 20230612141930543×529 10.1 KB

The operation runs smoothly on three GPUs (the order does not matter - for instance, 0,1,2 or 1,2,3 or 0,2,3). I can even occupy the VRAM completely, after which, as expected, a CUDA out of memory error is thrown.
If I train with four GPUs, everything works up to 1.2GB per GPU. But once that limit is exceeded, my computer suddenly crashes and reboots before any error messages can be displayed.
If I increase the batch size using 4 GPUs to a point where the memory should be depleted, I can see the four GPUs accumulating all the VRAM up to 12gb until the CUDA out of memory error gets thrown. This leads me to believe that the issue might be related to the GPU communication and the NCCL backend.
If I switch from the “nccl” backend to the “gloo” backend, the program works flawlessly on all GPUs.
The result of my NCCL tests is as follows:

(base) ➜ nccl-tests git:(master) mpirun -np 1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 
Invalid MIT-MAGIC-COOKIE-1 keyInvalid MIT-MAGIC-COOKIE-1 key# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # 
# Using devices 
  # Rank 0 Group 0 Pid 412103 on kong device 0 [0x21] NVIDIA GeForce RTX 3080 Ti 
  # Rank 1 Group 0 Pid 412103 on kong device 1 [0x22] NVIDIA GeForce RTX 3080 Ti 
  # Rank 2 Group 0 Pid 412103 on kong device 2 [0x41] NVIDIA GeForce RTX 3080 Ti 
  # Rank 3 Group 0 Pid 412103 on kong device 3 [0x43] NVIDIA GeForce RTX 3080 Ti # 
  # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float sum -1 4318.1 0.00 0.00 0 126.5 0.00 0.00 0 
16 4 float sum -1 127.6 0.00 0.00 0 340.7 0.00 0.00 0 
32 8 float sum -1 125.1 0.00 0.00 0 3120.3 0.00 0.00 0 
64 16 float sum -1 129.0 0.00 0.00 0 3220.1 0.00 0.00 0 
128 32 float sum -1 129.9 0.00 0.00 0 131.3 0.00 0.00 0 
256 64 float sum -1 129.6 0.00 0.00 0 128.4 0.00 0.00 0 
512 128 float sum -1 128.3 0.00 0.01 0 130.0 0.00 0.01 0 
1024 256 float sum -1 130.1 0.01 0.01 0 129.8 0.01 0.01 0 
2048 512 float sum -1 2767.1 0.00 0.00 0 126.9 0.02 0.02 0 
4096 1024 float sum -1 131.0 0.03 0.05 0 20.06 0.20 0.31 0 
8192 2048 float sum -1 136.5 0.06 0.09 0 133.2 0.06 0.09 0 
16384 4096 float sum -1 355.7 0.05 0.07 0 134.8 0.12 0.18 0 
32768 8192 float sum -1 140.5 0.23 0.35 0 143.8 0.23 0.34 0 
65536 16384 float sum -1 164.0 0.40 0.60 0 159.3 0.41 0.62 0 
131072 32768 float sum -1 3877.2 0.03 0.05 0 4068.1 0.03 0.05 0 
262144 65536 float sum -1 357.9 0.73 1.10 0 345.7 0.76 1.14 0 
524288 131072 float sum -1 596.6 0.88 1.32 0 580.4 0.90 1.35 0 
1048576 262144 float sum -1 784.5 1.34 2.00 0 837.0 1.25 1.88 0 
2097152 524288 float sum -1 5550.3 0.38 0.57 0 6989.3 0.30 0.45 0 
4194304 1048576 float sum -1 4156.0 1.01 1.51 0 8842.0 0.47 0.71 0 
8388608 2097152 float sum -1 8536.3 0.98 1.47 0 8681.9 0.97 1.45 0 
16777216 4194304 float sum -1 16639 1.01 1.51 0 15646 1.07 1.61 0 
33554432 8388608 float sum -1 28481 1.18 1.77 0 29903 1.12 1.68 0 
67108864 16777216 float sum -1 57158 1.17 1.76 0 59731 1.12 1.69 0 
134217728 33554432 float sum -1 123551 1.09 1.63 0 117790 1.14 1.71 0 #
 Out of bounds values : 0 OK # Avg bus bandwidth : 0.623606 #

Running GPU benchmarks like gpu-burn using 4 GPUs up to 11gb VRAM and all using 100% power supply operates perfectly fine.
Disabling the IOMMU following these instructions didn’t help either: PCI Access Control Services (ACS).
A colleague of mine, using the same code within a new Pytorch environment and the same GPUs (4x3080Ti) but with a different motherboard, does not encounter any problems, and the code runs without any issues.
Both CPU and RAM usage stay within normal limits throughout, so they’re not the culprits.
Changing the master port didn’t affect the problem.

Wiqzard · June 13, 2023, 7:46am

syslog right before and after the crash. Crash appeared around minute 54-55


Jun 12 14:54:31 kong /usr/libexec/gdm-x-session[3152]: dbus-daemon[3152]: [session uid=127 pid=3152] Activating service name='org.freedesktop.systemd1' requested by ':1.9' (uid=127 pid=3327 comm="/usr/libexec/gsd-sharing " label="unconfined")
Jun 12 14:54:31 kong /usr/libexec/gdm-x-session[3152]: dbus-daemon[3152]: [session uid=127 pid=3152] Activated service 'org.freedesktop.systemd1' failed: Process org.freedesktop.systemd1 exited with status 1
Jun 12 14:54:31 kong gsd-sharing[3327]: Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
Jun 12 14:54:31 kong gsd-sharing[3327]: Failed to StopUnit service: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExited: Process org.freedesktop.systemd1 exited with status 1
Jun 12 14:54:32 kong NetworkManager[1206]: <info>  [1686574472.5430] dhcp6 (eno1): activation: beginning transaction (timeout in 45 seconds)
Jun 12 14:54:32 kong NetworkManager[1206]: <info>  [1686574472.5451] dhcp6 (eno1): state changed new lease
Jun 12 14:54:32 kong systemd-resolved[1149]: eno1: Bus client set DNS server list to: 192.168.179.1, 2003:180:2:7000::53, 2003:180:2:9000::53
Jun 12 14:54:35 kong appimagelauncherd[1509]: Scheduling for (re-)integration: /home/bmw/.zsh_history
Jun 12 14:54:35 kong dbus-daemon[1743]: [session uid=1000 pid=1741] Activating service name='org.freedesktop.Tracker3.Miner.Extract' requested by ':1.15' (uid=1000 pid=1979 comm="/usr/libexec/tracker-miner-fs-3 " label="unconfined")
Jun 12 14:54:35 kong dbus-daemon[1743]: [session uid=1000 pid=1741] Successfully activated service 'org.freedesktop.Tracker3.Miner.Extract'
Jun 12 14:54:35 kong dbus-daemon[1522]: [session uid=1000 pid=1522] Activating via systemd: service name='org.freedesktop.Tracker3.Miner.Extract' unit='tracker-extract-3.service' requested by ':1.7' (uid=1000 pid=1618 comm="/usr/libexec/tracker-miner-fs-3 " label="unconfined")
Jun 12 14:54:35 kong systemd[1462]: Starting Tracker metadata extractor...
Jun 12 14:54:36 kong dbus-daemon[1522]: [session uid=1000 pid=1522] Successfully activated service 'org.freedesktop.Tracker3.Miner.Extract'
Jun 12 14:54:36 kong systemd[1462]: Started Tracker metadata extractor.
Jun 12 14:54:38 kong gnome-session[3153]: gnome-session-binary[3153]: GLib-CRITICAL: g_hash_table_foreach_remove_or_steal: assertion 'version == hash_table->version' failed
Jun 12 14:54:38 kong gnome-session-binary[3153]: GLib-CRITICAL: g_hash_table_foreach_remove_or_steal: assertion 'version == hash_table->version' failed
Jun 12 14:54:38 kong gnome-session[3153]: gnome-session-binary[3153]: WARNING: Lost name on bus: org.gnome.SessionManager
Jun 12 14:54:38 kong gnome-session-binary[3153]: WARNING: Lost name on bus: org.gnome.SessionManager
Jun 12 14:54:38 kong gsd-sharing[3327]: Error releasing name org.gnome.SettingsDaemon.Sharing: The connection is closed
Jun 12 14:54:38 kong gsd-sharing[3327]: ../../../gobject/gsignal.c:2765: instance '0x7fb78c008060' has no handler with id '10'
Jun 12 14:54:38 kong gsd-print-notif[3335]: Error releasing name org.gnome.SettingsDaemon.PrintNotifications: The connection is closed
Jun 12 14:54:38 kong gsd-rfkill[3339]: Error releasing name org.gnome.SettingsDaemon.Rfkill: The connection is closed
Jun 12 14:54:38 kong gsd-smartcard[3340]: Error releasing name org.gnome.SettingsDaemon.Smartcard: The connection is closed
Jun 12 14:54:38 kong gsd-screensaver[3352]: Error releasing name org.freedesktop.ScreenSaver: The connection is closed
Jun 12 14:54:38 kong gsd-sound[3353]: Error releasing name org.gnome.SettingsDaemon.Sound: The connection is closed
Jun 12 14:54:38 kong [3356]: Error releasing name org.gnome.SettingsDaemon.A11ySettings: The connection is closed
Jun 12 14:54:38 kong [3358]: Error releasing name org.gnome.SettingsDaemon.Housekeeping: The connection is closed
Jun 12 14:54:38 kong gsd-datetime[3341]: Error releasing name org.gnome.SettingsDaemon.Datetime: The connection is closed
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: not releasing fd for 13:70, still in use
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: not releasing fd for 13:76, still in use
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: not releasing fd for 13:73, still in use
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: not releasing fd for 13:79, still in use
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:71
Jun 12 14:54:38 kong xiccd[4639]: unable to connect to device: Failed to connect to missing device /org/freedesktop/ColorManager/devices/xrandr_Dell_Inc__DELL_P2314H_J8J3145T252L_gdm_127
Jun 12 14:54:38 kong xiccd[4639]: unable to connect to device: Failed to connect to missing device /org/freedesktop/ColorManager/devices/xrandr_Dell_Inc__DELL_P2314H_J8J3145T252L_gdm_127
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:70
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:69
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:67
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:77
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:76
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:73
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:75
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:74
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:72
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:79
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:78
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:68
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:66
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:64
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) UnloadModule: "libinput"
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) systemd-logind: releasing fd for 13:65
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G3): Failed to set the display configuration
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G3):  - Setting a mode on head 0 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G3):  - Setting a mode on head 1 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G3):  - Setting a mode on head 2 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G3):  - Setting a mode on head 3 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G2): Failed to set the display configuration
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G2):  - Setting a mode on head 0 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G2):  - Setting a mode on head 1 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G2):  - Setting a mode on head 2 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G2):  - Setting a mode on head 3 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G1): Failed to set the display configuration
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G1):  - Setting a mode on head 0 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G1):  - Setting a mode on head 1 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G1):  - Setting a mode on head 2 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G1):  - Setting a mode on head 3 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G0): Failed to set the display configuration
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G0):  - Setting a mode on head 0 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G0):  - Setting a mode on head 1 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G0):  - Setting a mode on head 2 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (WW) NVIDIA(G0):  - Setting a mode on head 3 failed: Insufficient permissions
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-0): Deleting GPU-0
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-1): Renaming GPU-1 to GPU-0
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-2): Renaming GPU-2 to GPU-1
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-3): Renaming GPU-3 to GPU-2
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-0): Deleting GPU-0
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-1): Renaming GPU-1 to GPU-0
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-2): Renaming GPU-2 to GPU-1
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-0): Deleting GPU-0
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-1): Renaming GPU-1 to GPU-0
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) NVIDIA(GPU-0): Deleting GPU-0
Jun 12 14:54:38 kong /usr/libexec/gdm-x-session[2895]: (II) Server terminated successfully (0). Closing log file.
Jun 12 14:54:38 kong gnome-session-f[5584]: Cannot open display: 
Jun 12 14:54:38 kong gdm-launch-environment]: GLib-GObject: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
Jun 12 14:54:38 kong systemd[1]: session-c2.scope: Deactivated successfully.
Jun 12 14:54:38 kong systemd[1]: session-c2.scope: Consumed 4.128s CPU time.
Jun 12 14:54:39 kong kernel: [   34.675563] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675621] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675647] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675671] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675706] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675725] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675740] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675758] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675780] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675797] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675813] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675828] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675851] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675868] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675883] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.675898] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710385] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710439] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710466] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710491] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710525] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710542] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710558] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710574] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710598] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710617] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710635] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710651] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710673] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710689] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710705] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.710719] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong gdm3: Gdm: Child process -2886 was already dead.
Jun 12 14:54:39 kong kernel: [   34.729476] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729525] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729553] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729579] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729611] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729627] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729643] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729658] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729680] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729696] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729711] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729727] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729749] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002200] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729764] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729780] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004300] Failed to grab modeset ownership
Jun 12 14:54:39 kong kernel: [   34.729795] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00004100] Failed to grab modeset ownership
Jun 12 14:54:39 kong pulseaudio[1512]: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Jun 12 14:54:42 kong systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Jun 12 14:54:42 kong systemd[1]: systemd-fsckd.service: Deactivated successfully.
Jun 12 14:54:43 kong systemd-timesyncd[1150]: Initial synchronization to time server [2620:2d:4000:1::41]:123 (ntp.ubuntu.com).
Jun 12 14:54:44 kong appimagelauncherd[1509]: Directories to watch disappeared, unintegrating AppImages formerly found in there
Jun 12 14:54:45 kong appimagelauncherd[1509]: Executing deferred operations
Jun 12 14:54:45 kong appimagelauncherd[1509]: Integrating: /home/bmw/.zsh_history
Jun 12 14:54:45 kong appimagelauncherd[1509]: ERROR: not an AppImage, skipping
Jun 12 14:54:45 kong appimagelauncherd[1509]: Cleaning up old desktop integration files
Jun 12 14:54:45 kong appimagelauncherd[1509]: Updating desktop database and icon caches
Jun 12 14:54:45 kong appimagelauncherd[5668]: Directory '/home/bmw/.local/share/mime/packages' does not exist!
Jun 12 14:54:45 kong appimagelauncherd[1509]: Done
Jun 12 14:54:46 kong org.xfce.ScreenSaver[1874]: Xlib:  extension "DPMS" missing on display ":1".
Jun 12 14:54:46 kong systemd[1]: systemd-timedated.service: Deactivated successfully.
Jun 12 14:54:49 kong systemd[1]: Stopping User Manager for UID 127...
Jun 12 14:54:49 kong systemd[2839]: Stopped target Main User Target.
Jun 12 14:54:49 kong systemd[2839]: Stopping AppImageLauncher daemon...
Jun 12 14:54:49 kong gvfsd[2883]: A connection to the bus can't be made
Jun 12 14:54:49 kong systemd[2839]: Stopping D-Bus User Message Bus...
Jun 12 14:54:49 kong systemd[2839]: Stopping Virtual filesystem service - Apple File Conduit monitor...
Jun 12 14:54:49 kong systemd[2839]: Stopping Virtual filesystem service...
Jun 12 14:54:49 kong systemd[2839]: Stopping Virtual filesystem service - GNOME Online Accounts monitor...
Jun 12 14:54:49 kong systemd[2839]: Stopping Virtual filesystem service - digital camera monitor...
Jun 12 14:54:49 kong systemd[2839]: Stopping Virtual filesystem service - Media Transfer Protocol monitor...
Jun 12 14:54:49 kong systemd[2839]: Stopping Virtual filesystem service - disk device monitor...
Jun 12 14:54:49 kong systemd[2839]: Stopping PipeWire Media Session Manager...
Jun 12 14:54:49 kong systemd[2839]: Stopping Tracker file system data miner...
Jun 12 14:54:49 kong systemd[2839]: Stopping flatpak document portal service...
Jun 12 14:54:49 kong systemd[2839]: Stopping sandboxed app permission store...
Jun 12 14:54:49 kong systemd[1]: run-user-127-gvfs.mount: Deactivated successfully.
Jun 12 14:54:49 kong systemd[2839]: Stopped AppImageLauncher daemon.
Jun 12 14:54:49 kong systemd[2839]: Stopped PipeWire Media Session Manager.
Jun 12 14:54:49 kong systemd[2839]: Stopped Virtual filesystem service.
Jun 12 14:54:49 kong systemd[2839]: xdg-document-portal.service: Main process exited, code=exited, status=20/n/a
Jun 12 14:54:49 kong systemd[2839]: xdg-document-portal.service: Failed with result 'exit-code'.
Jun 12 14:54:49 kong systemd[2839]: Stopped flatpak document portal service.
Jun 12 14:54:49 kong systemd[2839]: xdg-permission-store.service: Main process exited, code=exited, status=1/FAILURE
Jun 12 14:54:49 kong systemd[2839]: xdg-permission-store.service: Failed with result 'exit-code'.
Jun 12 14:54:49 kong systemd[2839]: Stopped sandboxed app permission store.
Jun 12 14:54:49 kong systemd[2839]: Stopped Virtual filesystem service - disk device monitor.
Jun 12 14:54:49 kong systemd[2839]: Stopped Virtual filesystem service - Media Transfer Protocol monitor.
Jun 12 14:54:49 kong systemd[2839]: Stopped Virtual filesystem service - GNOME Online Accounts monitor.
Jun 12 14:54:49 kong systemd[2839]: Stopped Virtual filesystem service - Apple File Conduit monitor.
Jun 12 14:54:49 kong systemd[2839]: Stopped Virtual filesystem service - digital camera monitor.
Jun 12 14:54:49 kong systemd[2839]: Stopping PipeWire Multimedia Service...
Jun 12 14:54:49 kong systemd[2839]: Stopped D-Bus User Message Bus.
Jun 12 14:54:49 kong systemd[2839]: Stopped PipeWire Multimedia Service.
Jun 12 14:54:49 kong systemd[2839]: Removed slice User Core Session Slice.
Jun 12 14:54:49 kong tracker-miner-fs-3[2922]: OK
Jun 12 14:54:49 kong systemd[2839]: Stopped Tracker file system data miner.
Jun 12 14:54:49 kong systemd[2839]: Removed slice User Background Tasks Slice.
Jun 12 14:54:49 kong systemd[2839]: Stopped target Basic System.
Jun 12 14:54:49 kong systemd[2839]: Stopped target Paths.
Jun 12 14:54:49 kong systemd[2839]: Stopped Pending report trigger for Ubuntu Report.
Jun 12 14:54:49 kong systemd[2839]: Stopped target Sockets.
Jun 12 14:54:49 kong systemd[2839]: Stopped target Timers.
Jun 12 14:54:49 kong systemd[2839]: Closed D-Bus User Message Bus Socket.
Jun 12 14:54:49 kong systemd[2839]: Closed GnuPG network certificate management daemon.
Jun 12 14:54:49 kong systemd[2839]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jun 12 14:54:49 kong systemd[2839]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jun 12 14:54:49 kong systemd[2839]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jun 12 14:54:49 kong systemd[2839]: Closed GnuPG cryptographic agent and passphrase cache.
Jun 12 14:54:49 kong systemd[2839]: Closed PipeWire Multimedia System Socket.
Jun 12 14:54:49 kong systemd[2839]: Closed debconf communication socket.
Jun 12 14:54:49 kong systemd[2839]: Closed Sound System.
Jun 12 14:54:49 kong systemd[2839]: Closed REST API socket for snapd user session agent.
Jun 12 14:54:49 kong systemd[2839]: Removed slice User Application Slice.
Jun 12 14:54:49 kong systemd[2839]: Reached target Shutdown.
Jun 12 14:54:49 kong systemd[2839]: Finished Exit the Session.
Jun 12 14:54:49 kong systemd[2839]: Reached target Exit the Session.
Jun 12 14:54:49 kong systemd[1]: user@127.service: Deactivated successfully.
Jun 12 14:54:49 kong systemd[1]: Stopped User Manager for UID 127.
Jun 12 14:54:49 kong systemd[1]: Stopping User Runtime Directory /run/user/127...
Jun 12 14:54:49 kong systemd[1]: run-user-127-doc.mount: Deactivated successfully.
Jun 12 14:54:49 kong systemd[1]: run-user-127.mount: Deactivated successfully.
Jun 12 14:54:49 kong systemd[1]: user-runtime-dir@127.service: Deactivated successfully.
Jun 12 14:54:49 kong systemd[1]: Stopped User Runtime Directory /run/user/127.
Jun 12 14:54:49 kong systemd[1]: Removed slice User Slice of UID 127.
Jun 12 14:54:49 kong systemd[1]: user-127.slice: Consumed 5.109s CPU time.
Jun 12 14:54:49 kong systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Jun 12 14:54:51 kong systemd[1]: fprintd.service: Deactivated successfully.
Jun 12 14:54:58 kong xdg-desktop-por[4716]: Failed to get application states: GDBus.Error:org.freedesktop.portal.Error.Failed: Could not get window list
Jun 12 14:54:58 kong systemd[1]: systemd-localed.service: Deactivated successfully.
Jun 12 14:55:01 kong org.xfce.ScreenSaver[1874]: Xlib:  extension "DPMS" missing on display ":1".
Jun 12 14:55:16 kong org.xfce.ScreenSaver[1874]: Xlib:  extension "DPMS" missing on display ":1".
Jun 12 14:55:18 kong systemd[1462]: Started snap.code.code.e4b6c20c-e964-40fe-95ab-5ff8220877d0.scope.
Jun 12 14:55:18 kong dbus-daemon[1205]: [system] Activating via systemd: service name='org.bluez' unit='dbus-org.bluez.service' requested by ':1.160' (uid=1000 pid=5769 comm="/snap/code/130/usr/share/code/code --no-sandbox " label="snap.code.code (complain)")
Jun 12 14:55:18 kong systemd[1]: Condition check resulted in Bluetooth service being skipped.
Jun 12 14:55:20 kong geoclue[3301]: Service not used for 60 seconds. Shutting down..
Jun 12 14:55:20 kong systemd[1]: geoclue.service: Deactivated successfully.
Jun 12 14:55:31 kong org.xfce.ScreenSaver[1874]: Xlib:  extension "DPMS" missing on display ":1".

ptrblck · June 13, 2023, 11:59am

Your description of the issue points towards a weak PSU. If you have a spare one or could use your friend’s PSU you could swap it and run some tests to see if this helps.

Wiqzard · June 13, 2023, 12:04pm

However, the problems do not occur when all four GPUs are utilized at 100% power during other GPU benchmarks. Doesn’t this fact eliminate the possibility of a weak power supply?

3bsamad · June 13, 2023, 12:05pm

I also have the exact same problem but haven’t been able to solve it yet.

ptrblck · June 13, 2023, 12:32pm

No, as different workloads could create different spikes and ramp up curves which could still cause issues. We have discussed similar issues here in the past and users were sometimes still running into it even with a PSU way above the theoretical max. power usage.