Cdp_simple_quicksort made the Cuda-context consumed 50MB more…why?and what’s the best way to sort in CUDA?

cdp_simple_quicksort function from here made the Cuda-context consumed 50MB more than not compiled cdp_simple_quicksort…WHY 50MB so much?

and what’s the best way to sort in CUDA?

Thanks!

The issue doesn’t seem to be related to PyTorch, so you might want to ask in the NVIDIA discussion board.

1 Like