How to enable `TORCH_USE_CUDA_DSA`

Vik001 · May 15, 2024, 8:17pm

Traceback (most recent call last):
  File "D:\Vikas\Deepvanet\Deepvaner\demo_me.py", line 140, in <module>
    demo()
  File "D:\Vikas\Deepvanet\Deepvaner\demo_me.py", line 128, in demo
    train(modal=args.modal, dataset=args.dataset, epoch=args.epoch, lr=args.learn_rate, use_gpu=use_gpu,
  File "D:\Vikas\Deepvanet\Deepvaner\train_me.py", line 115, in train
    input = (data[0].float().to(device), data[1].float().to(device))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I am relatively new to deep learning, I am trying to compile it with TORCH_USE_CUDA_DSA, on windows pc. I have the following piece of code in my code snippet, which I believe should enable device-side assertions. But it does not.

import os

os.environ['CUDA_LAUNCH_BLOCKING']="1"
os.environ['TORCH_USE_CUDA_DSA'] = "1"

however, cuda gives me asynchronous stack trace, and the above mentioned trace. Any help will be appreciated.

ptrblck · May 16, 2024, 1:31am

Launch the script with blocking launches by exporting this env variable in your terminal and rerun your code to see which line of code failed. If you are stuck, feel free to post a minimal and executable code snippet reproducing the issue.

Vik001 · May 17, 2024, 8:08pm

Do you mean something like this

>> set CUDA_LAUNCH_BLOCKING = 1
>> SET TORCH_USE_CUDA_DSA = 1
>> python demo_me.py

ptrblck · May 17, 2024, 11:35pm

TORCH_USE_CUDA_DSA won’t have any effect on the runtime unless you build PyTorch with this env variable. I’m not using Windows, but guess set should work (export would be the right approach on Linux).

Vik001 · May 21, 2024, 6:48pm

C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:106: block: [0,0,0], thread: [32,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:106: block: [0,0,0], thread: [33,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:106: block: [0,0,0], thread: [34,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:106: block: [0,0,0], thread: [35,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:106: block: [0,0,0], thread: [36,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:106: block: [0,0,0], thread: [37,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:106: block: [0,0,0], thread: [38,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:106: block: [0,0,0], thread: [39,0,0] Assertion `input_val >= zero && input_val <= one` failed.
C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\cuda\Loss.cu:106: block: [0,0,0], thread: [40,0,0] Assertion `input_val >= zero && input_val <= one` failed.


........ some hundred lines 

CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

had this in my code

import os

os.environ['CUDA_LAUNCH_BLOCKING']="1"
os.environ['TORCH_USE_CUDA_DSA'] = "1"

and also this in the terminal

`>> set CUDA_LAUNCH_BLOCKING = 1

SET TORCH_USE_CUDA_DSA = 1`

(in the conda env, I was running), but it returned the asynchronous stack, nonetheless.

This is the actual error, that I am facing. Is this the vanishing or exploding gradient problem ?

trian and test accuracy for last 10 epochs are as follows:

plz, let me know, if you need any more information

ptrblck · May 21, 2024, 9:27pm

No, it’s an indexing error.

If you have trouble isolating it, feel free to post a minimal and executable code snippet reproducing the issue.

Vik001 · May 21, 2024, 9:40pm

This error will not occur at the same point always.

class DeepVANet(nn.Module):
    def __init__(self, bio_input_size=32, face_feature_size=16, bio_feature_size=64,pretrain=True):
        super(DeepVANet,self).__init__()
        self.face_feature_extractor = FaceFeatureExtractor(feature_size=face_feature_size,pretrain=pretrain)
        
        self.bio_feature_extractor =  Transformer1d(
                                        bio_input_size, 
                                        n_classes=64, 
                                        n_length=128, 
                                        d_model=32, 
                                        nhead=8, 
                                        dim_feedforward=128, 
                                        dropout=0.1, 
                                        activation='relu'
                                        )
    
        self.classifier = nn.Sequential(
            nn.Linear(face_feature_size + bio_feature_size, 50),
            nn.ReLU(inplace=True),
            nn.Linear(50, 20),
            nn.ReLU(inplace= True),
            nn.Linear(20,1),
            nn.Sigmoid()
        )

    def forward(self,x):

        img_features = self.face_feature_extractor(x[0])
        bio_features = self.bio_feature_extractor(x[1])
        features = torch.cat([img_features,bio_features.float()],dim=1)
        output = self.classifier(features)
        output = output.squeeze(-1)
        return output

BioFeatureExtractor(passed through CNN and then through LSTM, to give me 16 features) and bio_feature_extractor(gives me 64 features) are two different models, I am trying to fuse them to classify either 0 or 1.

I have run them both separately, they work perfectly fine, this problem only arises, when I am trying to fuse both the models.

The model can be found at GitHub - vvikasreddy/Deepvaner

let me know, if you need more info…

ptrblck · May 22, 2024, 12:22pm

Setting up an entire project without executable code which reproduces the issue for you, would take too much time and is not guaranteed to fail. Take a look at this post to see how your code can be adapted so we could use it for debugging.

Vik001 · August 28, 2024, 6:56pm

Figured out, because my inputs are not normalized, it was giving me those values, sometimes it is just going out of bounds, normalized and problem solved