CUDA Exception: Warp Illegal Address

I am try to finetune modified version of Blip2 model using a custom dataset. I am using mixed precision when training. As I found out error occurs here pooled_feature_map = adaptive_pool(reshaped_feature_map)

 # Reshape the upsampled feature map to combine the channel and spatial dimensions
    reshaped_feature_map = upsampled_feature_map.view(batch_size, 1280 * 32 * 32).float()
    reshaped_feature_map = reshaped_feature_map.contiguous()

    # Adaptive Pooling to match the target number of elements (257 * 1408)
    adaptive_pool = nn.AdaptiveAvgPool1d(257 * 1408)
    with autocast(enabled=False):  # Force full precision
        #pooled_tensor = adaptive_avg_pooling_layer(input_tensor)
        pooled_feature_map = adaptive_pool(reshaped_feature_map)
    #pooled_feature_map = adaptive_pool(reshaped_feature_map.float())
    pooled_feature_map = pooled_feature_map.half()

    # Final reshape to the target shape (257, 1408)
    image_embeds = pooled_feature_map.view(batch_size, 257, 1408)

Error found by debugging

Epoch 1, Step 1, Loss: 14.890625
[New Thread 0x7af8253ff640 (LWP 7201)]
Batch 2 shapes:
input_ids: torch.Size([8, 256])
attention_mask: torch.Size([8, 256])
pixel_values: torch.Size([8, 3, 224, 224])
labels: torch.Size([8, 128])

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7af699eb1230

Thread 205 “pt_autograd_0” received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 5, grid 5392, block (0,0,0), thread (7,0,0), device 0, sm 0, warp 1, lane 7]
0x00007af699eb1370 in void at::native::(anonymous namespace)::adaptive_average_pool(float const*, float*, int, int, int, int, long, long, long)<<<(8,2,1),(32,8,1)>>> ()

Which PyTorch version are you using? If an older one, could you update to the latest stable or nightly release?

I am using PyTorch 2.4 and error mainly occurs in following line of code.
pooled_feature_map = adaptive_pool(reshaped_feature_map)

Pls advice.

I am using PyTorch 2.4 and error mainly occurs in following line of code.
pooled_feature_map = adaptive_pool(reshaped_feature_map)

Pls advice.

Could you post the shape of reshaped_feature_map, please?

I tried to find the error using the following code. I printed the sizes of all the variables but couldn’t find any issue. batch-size is 4

    reshaped_feature_map = upsampled_feature_map.view(batch_size, 1280 * 32 * 32)
    # Check input size and data type before pooling
    print(f"Reshaped feature map shape: {reshaped_feature_map.shape}, dtype: {reshaped_feature_map.dtype}")

    # Perform the pooling
    # Adaptive Pooling to match the target number of elements (257 * 1408)
    adaptive_pool = nn.AdaptiveAvgPool1d(257 * 1408)
    try:
        pooled_feature_map = adaptive_pool(reshaped_feature_map)
        print("pooling sucessfful ")
    except Exception as e:
        print(f"Error occurred during pooling: {e}")
        raise

    print(f"pooled feature map shape: {  pooled_feature_map.shape}, dtype: {  pooled_feature_map.dtype}")
    #pooled_feature_map = adaptive_pool(reshaped_feature_map)

    # Final reshape to the target shape (257, 1408)
    image_embeds = pooled_feature_map.view(batch_size, 257, 1408)

Sizes of the varialbles

Reshaped feature map shape: torch.Size([4, 1310720]), dtype: torch.float16
pooling sucessfful
pooled feature map shape: torch.Size([4, 361856]), dtype: torch.float16
Reshaped image_embeds shape: torch.Size([4, 257, 1408]), dtype: torch.float16
image_attention_mask torch.Size([4, 257])

Does this mean you cannot reproduce the issue in isolation using the input shapes you have shared?

I used chatgpt to explain the error and it says that adaptive_avg_pooling_layer is having the issue. Furthermore, according to the GPT, it can not perform the pooling operation as I am using float16. But when I changed the code as above it performed the pooling operation. When I trained the model I could train it for one and a half epoch and after that, I got the same error. Next, I increased the batch size and checked, and then I got the same error before training completed the first epoch. Is it a memory issue?

I used chatgpt to explain the error and it says that adaptive_avg_pooling_layer is having the issue. Furthermore, according to the GPT, it can not perform the pooling operation as I am using float16. But when I changed the code as above it performed the pooling operation. When I trained the model I could train it for one and a half epoch and after that, I got the same error.

CUDA Exception: Warp Illegal Address The exception was triggered at PC 0x730431eb1200 Thread 205 “pt_autograd_0” received signal CUDA_EXCEPTION_14, Warp Illegal Address. [Switching focus to CUDA kernel 0, grid 5389, block (0,0,0), thread (7,0,0), device 0, sm 0, warp 2, lane 7] 0x0000730431eb1380 in void at::native::(anonymous namespace)::adaptive_average_poolc10::Half(c10::Half const*, c10::Half*, int, int, int, int, long, long, long) <<<(8,2,1),(32,8,1)>>> ()

Next, I increased the batch size and checked, and then I got the same error before training completed the first epoch. Is it a memory issue?

Could you post a minimal code snippet that is causing the issue or which input shape was used to trigger the error?

Error appeared after completing the first epoch completed.
Code

    reshaped_feature_map = upsampled_feature_map.view(batch_size, 1280 * 32 * 32)
    # Check input size and data type before pooling
    print(f"Reshaped feature map shape: {reshaped_feature_map.shape}, dtype: {reshaped_feature_map.dtype}")

    # Perform the pooling
    # Adaptive Pooling to match the target number of elements (257 * 1408)
    adaptive_pool = nn.AdaptiveAvgPool1d( 257 * 1408)
    try:
        pooled_feature_map = adaptive_pool(reshaped_feature_map)
        print("pooling sucessfful ")
    except Exception as e:
        print(f"Error occurred during pooling: {e}")
        raise

    print(f"pooled feature map shape: {  pooled_feature_map.shape}, dtype: {  pooled_feature_map.dtype}")

Error