Segment Anything Model: unable to find engine to execute this computation

sameed-khan · May 4, 2023, 11:12pm

Hello -

I am trying to fine tune the “Segment Anything” model released by Facebook. Specifically, I am trying to tune a fork of the model that has some code for fine-tuning on medical images, termed MedSAM (found here). When trying to run the code, I get the following error:

Traceback (most recent call last):
  File "/mnt/beegfs/khans24/medsam_finetuning/minimal.py", line 30, in <module>
    embedding = sam_model.image_encoder(input_image)
  File "/home/khans24/beegfs/miniconda/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/beegfs/khans24/segment-anything/segment_anything/modeling/image_encoder.py", line 107, in forward
    x = self.patch_embed(x)
  File "/home/khans24/beegfs/miniconda/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/beegfs/khans24/segment-anything/segment_anything/modeling/image_encoder.py", line 392, in forward
    x = self.proj(x)
  File "/home/khans24/beegfs/miniconda/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/khans24/beegfs/miniconda/envs/sam/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/khans24/beegfs/miniconda/envs/sam/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

I’ve tried to troubleshoot this many times by switching with different CUDA and cuDNN versions, but to no avail. Here are the specs of the system I am running it on:

PyTorch 2.0 (but also tried with 1.13)
Runtime CUDA 11.0
Runtime cuDNN 8.2.1

Output of torch.backends.cudnn.version() is 8500
I get positive results from running the following code:

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.backends.cudnn.enabled)
print(torch.backends.cudnn.version())

Output is True, 4, True, 8500

I am also running this on a high-performance computing cluster, the OS is RedHat Linux. I am also using the “modules” package to load in the CUDA toolkit from modulefiles. I have limited ability to install things since I do not have root access.

Here is also a minimal reproducible example:

import torch
import numpy as np
from skimage import io, transform
from segment_anything import SamPredictor, sam_model_registry
from segment_anything.utils.transforms import ResizeLongestSide

# Set up the model and device
model_type = 'vit_b'
checkpoint = 'load/medsam_20230423_vit_b_0.0.1.pth'
device = 'cuda:0'

sam_model = sam_model_registry[model_type](checkpoint=checkpoint).to(device)

# Generate a random image
image_size = 256
random_image = np.random.randint(0, 256, (image_size, image_size, 3), dtype=np.uint8)

# Resize the random image
sam_transform = ResizeLongestSide(sam_model.image_encoder.img_size)
resized_image = sam_transform.apply_image(random_image)

# Convert the resized image to a PyTorch tensor
resized_image_tensor = torch.as_tensor(resized_image.transpose(2, 0, 1)).to(device)

# Preprocess the image tensor
input_image = sam_model.preprocess(resized_image_tensor[None, :, :, :])

# Compute the image embedding using the sam_model
with torch.no_grad():
    embedding = sam_model.image_encoder(input_image)
    print(embedding.shape)

Any help is appreciated, I’ve been banging my head on this for days.

@ptrblack help!

Thanks all

eqy · May 5, 2023, 2:39am

I was not able to reproduce this on A100 with the sam_vit_b_01ec64.pth checkpoint given in the repo. Could you provide more details such as the specific GPU you are using?

sameed-khan · May 5, 2023, 3:34am

Thanks for taking the time. NVIDIA Tesla V100-SXM2

eqy · May 5, 2023, 9:17pm

I also could not reproduce this on dgx1v machine with V100-SXM2, however with a newer version of cuDNN.

Could you try e.g., updating your cuDNN to a newer version such as > 8800 or > 8900 and see if that helps?
If you cannot update cuDNN, I would also check if setting the environment variable TORCH_CUDNN_V8_API_DISABLED=1 produces different behavior.

sameed-khan · May 6, 2023, 2:57pm

That’s unfortunate. Since I don’t have root access, I “updated” my cuDNN version by placing the relevant files in my HOME folder, the path is /home/khans24/cuDNN/include and /home/khans24/cuDNN/lib64. I followed directions here for installing cuDNN but instead of installing to /usr/local/cuda I just installed it to /home/khans24/cuDNN. I updated my .bashrc script, so this is the value of LD_LIBRARY_PATH now:

/cm/shared/apps/slurm/20.02.5/lib64/slurm:/cm/shared/apps/slurm/20.02.5/lib64:/cm/local/apps/gcc/8.2.0/lib:/cm/local/apps/gcc/8.2.0/lib64:/home/khans24/cuDNN/include:/home/khans24/cuDNN/lib64:/home/khans24/cuDNN/include:/home/khans24/cuDNN/lib64

I verified that the cuDNN version was updated by starting a Python shell:

>>> torch.backend.cudnn.version()
>>> 8901

So after doing all this, I reran minimal.py and got the same error as before (“GET was unable to find an engine to execute this computation”).

After doing that, I set TORCH_CUDNN_V8_API_DISABLED=1 and that gave me a different error, as below:

Traceback (most recent call last):
  File "/mnt/beegfs/khans24/medsam_finetuning/minimal.py", line 33, in <module>
    embedding = sam_model.image_encoder(input_image)
  File "/home/khans24/beegfs/miniconda/envs/medsam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/beegfs/khans24/segment-anything/segment_anything/modeling/image_encoder.py", line 107, in forward
    x = self.patch_embed(x)
  File "/home/khans24/beegfs/miniconda/envs/medsam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/beegfs/khans24/segment-anything/segment_anything/modeling/image_encoder.py", line 392, in forward
    x = self.proj(x)
  File "/home/khans24/beegfs/miniconda/envs/medsam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/khans24/beegfs/miniconda/envs/medsam/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/khans24/beegfs/miniconda/envs/medsam/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Let me know if you have any other ideas. As always, appreciate your help here.

eqy · May 6, 2023, 8:32pm

Could you check if you can reproduce the issue without loading the checkpoint? I’m not able to download the exact checkpoint in your repro script as there seems to be some download limit quota that was exceeded. If you can’t reproduce the error this way, another way to provide a repro would be to log the shape and other parameters of the convolution (e.g., by temporarily adding some print statements to /home/khans24/beegfs/miniconda/envs/medsam/lib/python3.10/site-packages/torch/nn/modules/conv.py) and to see if the issue is reproducible when a standalone convolution is called.

As some background the errors you are seeing (both RuntimeError: Unable to find a valid cuDNN algorithm to run convolution and RuntimeError: GET was unable to find an engine to execute this computation typically occur when the cuDNN heuristics return kernels that crash/fail to execute or return no kernels (most likely it would be the former case).

As a sanity check, have you been able to verify that other models/workloads that have conv2d operations are able to run successfully in your setup?

sameed-khan · May 6, 2023, 10:52pm

On the same hardware/server, I’ve successfully trained U-net models using Tensorflow in a different conda environment, but this is my first time using PyTorch on this setup. If I try running the below script:

import torch
import torch.nn.functional as F

# Create a random input tensor with shape (batch_size, in_channels, height, width)
batch_size = 1
in_channels = 3
height = 1024
width = 1024
input_tensor = torch.randn(batch_size, in_channels, height, width).cuda()

# Create a random weight tensor with shape (out_channels, in_channels, kernel_size, kernel_size)
out_channels = 768
kernel_size = 16
weight_tensor = torch.randn(out_channels, in_channels, kernel_size, kernel_size).cuda()

# Create a random bias tensor with shape (out_channels)
bias_tensor = torch.randn(out_channels).cuda()

# Convolution parameters
stride = (16, 16)
padding = (0, 0)
dilation = (1, 1)
groups = 1

# Perform the 2D convolution
output_tensor = F.conv2d(input_tensor, weight_tensor, bias_tensor, stride, padding, dilation, groups)
print(output_tensor)

The stride, padding, dilation, groups, and the shape of the input, weight, and bias tensors are all the same as the inputs to the F.conv2d call that throws the cuDNN error, using print statements placed in conv.py. The above Python script runs completely fine and prints the output tensor:

tensor([[[[-1.2557e+01,  6.0982e+00,  1.8704e+01,  ..., -2.4928e+01,
           -2.2377e+01, -3.5183e+00],
          [ 1.1656e+01, -1.2030e+01,  4.8116e+00,  ..., -6.8026e+00,
           -3.6519e+01,  2.6086e+01],
          [ 4.3808e+01, -3.0679e+01,  9.9522e+00,  ...,  1.2265e+01,
           -1.0603e+01,  1.0314e+01],
          ...
          ...,
          [-3.6283e+01, -7.1115e+00, -5.1546e+01,  ...,  2.2981e+01,
            2.3686e+01,  8.1396e+00],
          [ 2.3555e+01,  1.9937e+00, -3.0376e+00,  ..., -5.0531e+01,
           -3.5612e+01,  2.1503e+01],
          [-6.7170e+00, -2.6074e+01,  4.1083e+01,  ..., -1.2258e+01,
           -3.2744e+01,  6.8977e+00]]]], device='cuda:0')

eqy · May 7, 2023, 3:12am

In that case it could be another kernel that is producing the failure with cuDNN surfacing it due to asynchronous kernel launches. Could you try running the same workload with CUDA_LAUNCH_BLOCKING=1 and checking if a different exception is raised somewhere else?

Otherwise I would also check that the memory layout doesn’t change the behavior (e.g., try setting the input and convolution to channels-last).

sameed-khan · May 7, 2023, 3:36am

No changes on output when setting CUDA_LAUNCH_BLOCKING=1 or by setting the input tensor and the model’s memory format to channels_last, unfortunately.

eqy · May 7, 2023, 7:25pm

I will take another shot at reproducing this issue later, but could you check if loading the checkpoint linked in the repo @ https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth reproduces the issue or if your specific checkpoint was needed?

The environment I am using would be that of an NGC PyTorch container using docker: PyTorch | NVIDIA NGC, which hopefully you can try out without having root access on your machine.

eqy · May 9, 2023, 12:38am

I managed to download the original checkpoint from google drive and tried a few different NGC containers with cuDNN versions 8.5.0, 8.4.0, and 8.3.3 but I was still not able to reproduce the issue.

At this point my best guess is some operation prior to the cuDNN convolution itself is corrupting the CUDA context and causing all of the subsequent cuDNN calls to fail (which would produce the unable to find engine error). Unfortunately, this kind of failure would be difficult to debug and could require manually removing layers from the model and checking which one resolves the failure.

sameed-khan · May 13, 2023, 11:05pm

Hey I figured this out. It turns out that the segment-anything package offered by Meta has differences from the medsam fork in the basic definitions of the mask_decoder class. This was part of the problem - I was trying to run the fine-tuning code of the medsam fork on top of Meta’s version of the pip package.

When I fixed the issue and installed the medsam version, the issue was resolved. Thanks for all of your help!

eqy · May 14, 2023, 5:13am

Thanks for the follow up. The error is still baffling though, as it implies that some kind of CUDA context corruption occured which seems unexpected from pure Python changes

ptrblck · May 14, 2023, 6:30pm

I agree, as also using CUDA_LAUNCH_BLOCKING=1 didn’t seem to improve the error message.
@sameed-khan since you’ve narrowed it down to a specific package offered by Meta, it would be great if you could provide them a minimal code snippet to reproduce the issue as it seems their code misses error checks and then fails in unrelated libraries, such as cuDNN in this case.