Hello everyone!
I am running a code from an online repository link. I couldn’t copy and paste the whole file, as it is a huge file, but I can if required. When I run the code, I face the above-mentioned error:
cuDNN error: CUDNN_STATUS_EXECUTION FAILED
I use the following tools:
CudaToolkit 10.0
Pytorch v1.1.0
Torchvision v0.3.0
Python v3.7
GCC-8.4
Ubuntu 22.04
Anaconda 23.3.1
Here is the complete error:
python -m torch.distributed.launch --nproc_per_node=1 train_det_seg_OCID.py --log_dir=LOGDIR --config="/home/ozaland/newProj/grasp_det_seg_cnn/grasp_det_seg/config/defaults/det_seg_OCID.ini" --data="/home/ozaland/newProj/grasp_det_seg_cnn/DATA/OCID_grasp/data_split"
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
14:52:31 Loading configuration from /home/ozaland/newProj/grasp_det_seg_cnn/grasp_det_seg/config/defaults/det_seg_OCID.ini
14:52:31
[general]
val_interval = 25
log_interval = 10
cudnn_benchmark = no
num_classes = 18
num_stuff = 0
num_things = 18
num_semantic = 32
[body]
body = resnet101
weights = /home/ozaland/newProj/grasp_det_seg_cnn/weights_pretrained/resnet101
normalization_mode = syncbn
activation = leaky_relu
activation_slope = 0.01
gn_groups = 16
body_params = {}
num_frozen = 2
bn_frozen = yes
out_channels = {"mod1": 64, "mod2": 256, "mod3": 512, "mod4": 1024, "mod5": 2048}
out_strides = {"mod1": 4, "mod2": 4, "mod3": 8, "mod4": 16, "mod5": 32}
[fpn]
out_channels = 256
extra_scales = 0
interpolation = nearest
inputs = ["mod2", "mod3", "mod4", "mod5"]
out_strides = (4, 8, 16, 32)
[rpn]
hidden_channels = 256
stride = 1
anchor_ratios = (1., 0.1, 0.4, 0.7, 1.2)
anchor_scale = 2
nms_threshold = 0.7
num_pre_nms_train = 12000
num_post_nms_train = 2000
num_pre_nms_val = 6000
num_post_nms_val = 300
min_size = 16
num_samples = 256
pos_ratio = .5
pos_threshold = .7
neg_threshold = .3
void_threshold = 0.7
fpn_min_level = 0
fpn_levels = 3
sigma = 3.
[roi]
roi_size = (14, 14)
num_samples = 128
pos_ratio = .25
pos_threshold = .5
neg_threshold_hi = .5
neg_threshold_lo = 0.
void_threshold = 0.7
void_is_background = no
nms_threshold = 0.3
score_threshold = 0.05
max_predictions = 100
fpn_min_level = 0
fpn_levels = 4
fpn_canonical_scale = 224
fpn_canonical_level = 2
sigma = 1.
bbx_reg_weights = (10., 10., 5., 5.)
[sem]
fpn_min_level = 0
fpn_levels = 4
pooling_size = (64, 64)
ohem = .25
[optimizer]
lr = 0.03
weight_decay = 0.0001
weight_decay_norm = yes
momentum = 0.9
nesterov = yes
loss_weights = (1., 1., 1., 1.,.75)
[scheduler]
epochs = 800
type = poly
update_mode = batch
params = {"gamma": 0.9}
burn_in_steps = 500
burn_in_start = 0.333
[dataloader]
root_path = /home/ozaland/newProj/grasp_det_seg_cnn/DATA/OCID_grasp
shortest_size = 480
longest_max_size = 640
train_batch_size = 10
val_batch_size = 1
rgb_mean = (0.485, 0.456, 0.406)
rgb_std = (0.229, 0.224, 0.225)
random_flip = no
random_scale = None
rotate_and_scale = True
num_workers = 6
train_set = training_0
val_set = validation_0
test_set = validation_0
14:52:31 Creating dataloaders for dataset in /home/ozaland/newProj/grasp_det_seg_cnn/DATA/OCID_grasp/data_split
14:52:31 Creating backbone model resnet101
14:53:19 Starting epoch 1
Traceback (most recent call last):
File "train_det_seg_OCID.py", line 496, in <module>
main(parser.parse_args())
File "train_det_seg_OCID.py", line 467, in main
global_step=global_step, loss_weights=config["optimizer"].getstruct("loss_weights"))
File "train_det_seg_OCID.py", line 274, in train
losses, _, conf = model(**batch, do_loss=True, do_prediction=False)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/grasp_det_seg/models/det_seg.py", line 71, in forward
x = self.body(img)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/grasp_det_seg/modules/fpn.py", line 136, in forward
x = self.backbone(x)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/grasp_det_seg/models/resnet.py", line 115, in forward
outs["mod4"] = self.mod4(outs["mod3"])
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/grasp_det_seg/modules/residual.py", line 92, in forward
x = self.convs(x) + residual
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in <module>
main()
File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/home/ozaland/anaconda3/envs/detgrasp/bin/python', '-u', 'train_det_seg_OCID.py', '--local_rank=0', '--log_dir=LOGDIR', '--config=/home/ozaland/newProj/grasp_det_seg_cnn/grasp_det_seg/config/defaults/det_seg_OCID.ini', '--data=/home/ozaland/newProj/grasp_det_seg_cnn/DATA/OCID_grasp/data_split']' returned non-zero exit status 1.
conda list
returns the following output:
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.4.0 pypi_0 pypi
blas 1.0 mkl
ca-certificates 2023.05.30 h06a4308_0
certifi 2022.12.7 py37h06a4308_0 anaconda
cffi 1.15.0 py37h7f8727e_0
cudatoolkit 10.0.130 0 anaconda
cudnn 7.6.5 cuda10.0_0
expat 2.4.9 h6a678d5_0
freetype 2.12.1 h4a9f257_0
future 0.18.2 pypi_0 pypi
git 2.19.1 pl526h7fee0ce_0 anaconda
gputil 1.4.0 pypi_0 pypi
graspdetseg-cnn 0.1.dev22+g1cfeb2f.d20230711 pypi_0 pypi
grpcio 1.56.0 pypi_0 pypi
imageio 2.31.1 pypi_0 pypi
importlib-metadata 6.7.0 pypi_0 pypi
inplace-abn 1.1.1.dev7+gd7dd3e1 pypi_0 pypi
intel-openmp 2022.1.0 h9e868ea_3769
jpeg 9e h5eee18b_1
lerc 3.0 h295c915_0
libcurl 7.61.1 heec0ca6_0
libdeflate 1.17 h5eee18b_0
libedit 3.1.20221030 h5eee18b_0
libffi 3.2.1 hf484d3e_1007
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libpng 1.6.39 h5eee18b_0
libssh2 1.8.0 h9cfc8f7_4
libstdcxx-ng 11.2.0 h1234567_1
libtiff 4.5.0 h6a678d5_2
libwebp-base 1.2.4 h5eee18b_1
lz4-c 1.9.4 h6a678d5_0
markdown 3.4.3 pypi_0 pypi
markupsafe 2.1.3 pypi_0 pypi
mkl 2020.2 256
mkl-service 2.3.0 py37he8ac12f_0
mkl_fft 1.3.0 py37h54f3939_0
mkl_random 1.1.1 py37h0573a6f_0
msgpack-python 0.5.6 pypi_0 pypi
ncurses 6.4 h6a678d5_0
networkx 2.6.3 pypi_0 pypi
ninja 1.10.2 h06a4308_5
ninja-base 1.10.2 hd09550d_5
numpy 1.21.6 pypi_0 pypi
numpy-base 1.19.2 py37hfa32c7d_0
olefile 0.46 py37_0
opencv-contrib-python 4.8.0.74 pypi_0 pypi
openssl 1.0.2u h7b6447c_0 anaconda
packaging 23.1 pypi_0 pypi
perl 5.34.0 h5eee18b_1
pillow 6.1.0 py37h34e0f95_0 anaconda
pip 22.3.1 py37h06a4308_0 anaconda
protobuf 3.20.3 pypi_0 pypi
pycparser 2.21 pyhd3eb1b0_0
python 3.7.0 h6e4f718_3
pytorch 1.1.0 cuda100py37he554f03_0
pywavelets 1.3.0 pypi_0 pypi
readline 7.0 h7b6447c_5
scikit-image 0.19.3 pypi_0 pypi
scipy 1.7.3 pypi_0 pypi
setuptools 65.6.3 py37h06a4308_0
shapely 1.7.0 pypi_0 pypi
six 1.16.0 pyhd3eb1b0_1
sqlite 3.33.0 h62c20be_0
tensorboard 1.14.0 pypi_0 pypi
tifffile 2021.11.2 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
torchvision 0.3.0 cuda100py37h72fc40a_0
typing-extensions 4.7.1 pypi_0 pypi
umsgpack 0.1.0 pypi_0 pypi
werkzeug 2.2.3 pypi_0 pypi
wheel 0.38.4 py37h06a4308_0
xz 5.4.2 h5eee18b_0
zipp 3.15.0 pypi_0 pypi
zlib 1.2.13 h5eee18b_0
zstd 1.5.5 hc292b87_0
Nvidia-smi output the following
Mon Jul 24 15:15:33 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:01:00.0 On | Off |
| 41% 42C P5 24W / 140W | 1267MiB / 16376MiB | 34% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 6292 G /usr/lib/xorg/Xorg 450MiB |
| 0 N/A N/A 6503 G /usr/bin/gnome-shell 123MiB |
| 0 N/A N/A 189952 G ...4848331,14781432929507757159,262144 364MiB |
| 0 N/A N/A 946565 G ...,WinRetrieveSuggestionsOnlyOnDemand 76MiB |
| 0 N/A N/A 1262567 G ...sion,SpareRendererForSitePerProcess 166MiB |
+---------------------------------------------------------------------------------------+
I am running the file with the command that is provided in the repository. i.e.` python -m torch.distributed.launch --nproc_per_node=1 FileName.py --log_dir=LOGDIR --config=CONFIGFILE --data=DATAFOLDER
I have done the following
- Reinstalled the CUDA toolkit multiple times, even purged all the Nvidia graphic drivers multiple times, and it still didn’t work.
- The problem is not because of OOM, as many in the forums suggested. As I am keeping track of the memory usage, and the max GPU memory usage is around 14% for now.
If anyone can provide any assistance, it would be great, as I have been stuck on this for a month now, and no solution seems to work.