PC crashes and turns off when using large hidden size

mrdbourke · June 20, 2022, 3:08am

Hello,

I’m trying to replicate the ViT paper: [2010.11929] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

I’ve reproduced the architecture, however, some strange things happen when increasing the size of the nn.Linear() layers.

For example, when trying to run a nn.Linear() layer with over 3000 hidden units, my PC crashes and turns off immediately.

It works with lower multiples such as:

768 - works
1024 - works
2048 - works

But anything over 3000, seems to crash my PC without any kind of error warning.

Sometimes it’ll work with 3001 but then won’t work with 3002 hidden units.

The code I’m using is replicating the MLP block in Table 1 of the paper linked above:

import torch
from torch import nn

# Could also call this "FeedForward"
class MLPBlock(nn.Module):
    """Creates an MLPBlock of the Vision Transformer architecture."""
    def __init__(self,
                 embedding_dim, # embedding dimension (Hidden Size D in Table 1)
                 mlp_size, # MLP size in Table 1
                 dropout=0): # "Dropout... is applied to every dense layer... (Appendix B.1)"
        super().__init__()
        
        self.mlp = nn.Sequential(
            nn.Linear(in_features=embedding_dim,
                      out_features=mlp_size),
            nn.GELU(), # "The MLP contains two layers with a GELU non-linearity (section 3.1)."
            nn.Dropout(p=dropout),
            nn.Linear(in_features=mlp_size, # needs to take same in_features as out_features of layer above
                      out_features=embedding_dim), # take back to embedding_dim
            nn.Dropout(p=dropout)
        )

    def forward(self, x):
        return self.mlp(x)

# Create random tensor (same shape as paper)
z = torch.randn((1, 196, 768))
print(z.shape)

# Set MLP size
mlp_size = 1024

# No CUDA
cpu_device = "cpu"
print(f"\nUsing device: {cpu_device}")
print(f"Using MLP size: {mlp_size}")
mlp_block = MLPBlock(embedding_dim=768,
                     mlp_size=mlp_size).to(cpu_device) 
z_through_mlp_block = mlp_block(z.to(cpu_device))
print(z_through_mlp_block.shape)

# With CUDA
cuda_device = "cuda"
print(f"\nUsing device: {cuda_device}")
print(f"Using MLP size: {mlp_size}")
mlp_block = MLPBlock(embedding_dim=768,
                     mlp_size=mlp_size).to(cuda_device) 
z_through_mlp_block_cuda = mlp_block(z.to(cuda_device))
print(z_through_mlp_block_cuda.shape)

Output:

torch.Size([1, 196, 768])

Using device: cpu
Using MLP size: 1024
torch.Size([1, 196, 768])

Using device: cuda
Using MLP size: 1024
torch.Size([1, 196, 768])

If I set mlp_size to be anything over 3000 in the code above, it crashes my whole PC.

Tests I’ve done

I’m not quite sure what’s going on because I’ve tested the same code on Google Colab with a P100 GPU (~16GB memory) and it works fine at various mlp_size values, including values of 5000+.

But if I run the same code on my local machine with a NVIDIA TITAN RTX (~24GB memory), it crashes immediately.

My hardware

I ran the script from the PyTorch GitHub to show the various hardware/software I’ve got:

Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-27-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA TITAN RTX
Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.3
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.2
[pip3] torch==1.11.0
[pip3] torch-tb-profiler==0.3.1
[pip3] torchaudio==0.11.0
[pip3] torchinfo==1.7.0
[pip3] torchmetrics==0.7.2
[pip3] torchvision==0.12.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] numpy                     1.21.2           py39h20f2e39_0  
[conda] numpy-base                1.21.2           py39h79a1101_0  
[conda] pytorch                   1.11.0          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch-tb-profiler         0.3.1                    pypi_0    pypi
[conda] torchaudio                0.11.0               py39_cu113    pytorch
[conda] torchinfo                 1.7.0                    pypi_0    pypi
[conda] torchmetrics              0.7.2                    pypi_0    pypi
[conda] torchvision               0.12.0               py39_cu113    pytorch

My thoughts

Is this potentially something to do with the max memory values I’ve set on my GPU?

I’m not sure where I’d look to find those.

And temperature wise, this happens regardless of warm start or cold start.

ptrblck · June 20, 2022, 3:20am

No, I don’t think so and based on your description I would guess your system shuts down due to a power spike if the GPU workload exceeds your PSU max. output.
You could try to decrease the clock frequencies of your GPU and see if this throttling would work.

mrdbourke · June 20, 2022, 5:15am

I think you may be right about the PSU…

It’s now happening for (previously working) older scripts using less compute-heavy networks/data.

Also starting to get a CUDA issue about invalid device ordinals despite only having one CUDA device:

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

ptrblck · June 20, 2022, 8:25pm

The CUDA error: invalid device ordinal might be raised, if your system “dropped” the GPU, which could also indicate a shutdown e.g. due to a weak/faulty PSU. You could check dmesg for Xid error codes and see if some are indicating a bus error etc.

mrdbourke · June 27, 2022, 8:24am

Thank you!

Turns out you were right.

It was the PSU (or at least so far so good).

My PC started switching off randomly despite the activity (even just browsing the web).

I replaced the PSU today and it’s able to run PyTorch code flawlessly + do many other tasks without a shutdown.

Thank you again