Illegal memory access on tensors with large dimensions for extension

5had3z · August 22, 2020, 4:48am

I am trying to use the correlation function from https://github.com/NVlabs/PWC-Net and/or https://github.com/lliuz/ARFlow. I have updated both to use the static method implementation required for 1.6.0 and replaced THCudaTensor/at::Tensor with torch::Tensor.

If I try to test with reasonably sized tensors (i.e. 16, 8, 128, 128), it fails with an illegal memory access during the backward methods (after resizing blobs), however it will work fine with smaller shaped tensors (i.e. 16, 64, 64, 64), even if the volume is greater.

I haven’t touched the CUDA kernels for either of them, and I guess they should be fine as they’re both published (and one being from nvidia)…

Both of them are previously used older versions of CUDA (8&9) and Torch (0.4 and 1.1), could this also be the issue?

Source for my changes can be found here.

Cheers

ptrblck · August 24, 2020, 9:40am

Your link is not available and yields a 404.
I don’t know which custom kernels you are using, but since the illegal memory access seems to be size-dependent, I guess some int32/int64 indexing might fail.
You could run the code via cuda-gdb and create an issue in the corresponding repository, where the kernel is provided.

5had3z · August 24, 2020, 10:05am

I did a few more tests and debugging via printing in the kernels, and it seems for some combinations of input height/width and padding (moreso lack of padding), during the correlation forward and backward, it tries to access outside of the image i.e. index -1.

I’m not entirely sure who is the original author of the correlation kernel, I think the earliest I’ve noticed is nvidia’s PWC-net. Its 404’ing because I figured this out, made the fix and noted that I need at least some padding (and to increase it if I have this problem again) and then merged it back into my master branch.

The bad accesses occur where the boundary checks are (unless otherwise stated in a comment). If I can figure out a more elegant solution, I could make one and submit a PR? I believe there should probably be a way of calculating required minimal required padding.

github.com

5had3z/stereo-to-all/blob/master/nnet_training/correlation_package/correlation_cuda_kernel.cu

#include "correlation_cuda_kernel.cuh"

#define CUDA_NUM_THREADS 1024
#define THREADS_PER_BLOCK 32

#include <ATen/Dispatch.h>
#include <ATen/cuda/CUDAContext.h>
#include <cuda_runtime.h>

// using at::Half;
#define TensorAcc4R torch::PackedTensorAccessor32<scalar_t,4,torch::RestrictPtrTraits>

template <typename scalar_t>
__global__ void channels_first(const TensorAcc4R input, TensorAcc4R rinput,
	const int channels, const int height, const int width, const int pad_size)
{
	// n (batch size), c (num of channels), y (height), x (width)
	const int n = blockIdx.x;
	const int y = blockIdx.y;
	const int x = blockIdx.z;

This file has been truncated. show original

ptrblck · August 24, 2020, 10:08am

That sounds like great debugging. Sure, if you have a proper fix, the authors would probably be really glad to receive it.