Torch.bincount behaves differently on CPU and GPU

HAL-42 · May 5, 2022, 1:38pm

It seems that for a tensor with dype=uint8, device=cuda, if there exits element=255, then torch.bincount will not count other bins than 255.
The problem only occurs when tensor on GPU.

My pytorch version is 1.10.0.

suraj.pt · May 5, 2022, 2:51pm

This error doesn’t occur in pytorch 1.11; can you try updating your version?

HAL-42 · May 6, 2022, 6:38am

It seems this error still occur in pytorch 1.11.0，torch.cuda.version=11.3。

tom · May 6, 2022, 8:33am

I can see this on some post-1.11 cut dev branch. At first glance, it could be a bug with the CUDA implementation that shows with uint8 - for dtype int and long bincount seems to work as expected for me.

Best regards

Thomas

tom · May 6, 2022, 12:23pm

Staring down the code a bit, this might be suspicious:

github.com

pytorch/pytorch/blob/621ff0f9735cd8c4c5d6becb291ad050c35e01c0/aten/src/ATen/native/cuda/SummaryOps.cu#L36

      
        
                   i += gridDim.x * blockDim.x)
            
            
/*
              Memory types used for the 3 histogram implementations.
              See `CUDA_tensor_histogram` below.
             */
            enum class CUDAHistogramMemoryType { SHARED, MULTI_BLOCK, GLOBAL };
            namespace {
              template<typename input_t, typename IndexType>
              __device__ static IndexType getBin(input_t bVal, input_t minvalue, input_t maxvalue, int64_t nbins) {
                IndexType bin = (int)((bVal - minvalue) * nbins / (maxvalue - minvalue));
                // (only applicable for histc)
                // while each bin is inclusive at the lower end and exclusive at the higher, i.e. [start, end)
                // the last bin is inclusive at both, i.e. [start, end], in order to include maxvalue if exists
                // therefore when bin == nbins, adjust bin to the last bin
                if (bin == nbins) bin -= 1;
                return bin;
              }
            }
            
            
/*

I will check if that is the culprit and if so, I’ll send a PR.

Best regards

Thomas

Edited: This was not it. The computation is in int64 anyways.

tom · May 6, 2022, 12:55pm

Now I found it.
The maxvalue computed below has an overflow (we have nbins = 256 and so maxvalue will be 0 for uint8). Right now I’m checking whether the fix should be clamping to the numeric type max value or setting the maxvalue needs to be nbins - 1 and then I’ll send a PR.

github.com

pytorch/pytorch/blob/621ff0f9735cd8c4c5d6becb291ad050c35e01c0/aten/src/ATen/native/cuda/SummaryOps.cu#L316

      
        
              AT_ERROR("bincount only supports 1-d non-negative integral inputs.");
            }
            
            
bool has_weights = weights.defined();
            if (has_weights && weights.size(0) != self.size(0)) {
              AT_ERROR("input and weights should have the same length");
            }
            
            
const int64_t nbins = std::max(*self.max().cpu().data_ptr<input_t>() + (int64_t)1, minlength);
            const input_t minvalue = 0;
            const input_t maxvalue = nbins;
            // alloc output counter on GPU
            Tensor output;
            if (has_weights) {
              output = native::zeros(
                  {nbins},
                  optTypeMetaToScalarType(weights.options().dtype_opt()),
                  weights.options().layout_opt(),
                  weights.options().device_opt(),
                  weights.options().pinned_memory_opt());
              cuda::CUDA_tensor_histogram<weights_t, input_t, true>(

Best regards

Thomas

P.S.: Thank you @HAL-42 for reporting this with repro. That is always very helpful.

P.P.S.: I sent Fix bincount to use acc scalar for the bounds by t-vi · Pull Request #76979 · pytorch/pytorch · GitHub , we will see how it goes.