Method for better utilization of GPU memory for Kmeans clustering

I have implemented K means clustering algorithm in GPU using PyTorch. I have a Tesla K80 GPU (11GB memory).

I have used the following methods to be able to increase the number of data points and clusters.

  • Explicitly delete variables initialized once they are out of scope, this releases GPU memory that has no use.

  • Used half precision floating point.

Still. I was only able to classify a maximum of 3 million data points into 500 clusters. You will be able to find the code below.

Is there any better method supported by PyTorch to utilize the GPU memory such that the GPU memory is used for mostly calculation while the data is being streamed from CPU to GPU. This way while the data (matrices) is being transferred to the GPU, calculation on some other parts are done. This way the whole dataset doesn’t need to completely be present in the GPU. Then I will be able to cluster more data points into more clusters. Please let me know if it is possible.

Any help on this issue is appreciated.


check out this github repo.

find documentation here.

On a GPU (in google colab), clustering 10 million 2D samples into 3 clusters takes about 25 seconds.