Implement kmean clustering accross multiple GPUs

Hi,
Thanks for reading this post. I have a question regarding how to implement the following algorithm on pytorch distrubuted.
The first step of the algorithm is to randomly sample k (=500) data from the dataset and push them forward the network and get features with dimension 512 for each data point in the dataset. Let call this matrix of features centriods (with shape 500 by 512).
Then for each iteration, we push the dataset through the network and compute a loss with regard the centriods. We also push the features of each data points into a list.
Finally, the list (with shape 51000 by 512) is use to compute new centriods (with shape 500 by 512) with kmean, for example.
Then we repeat these three steps.
My question is how to implement this algorithm and what kind of data structure we should use?
Best regards!

This is going to be a pretty long reply :kissing:

So You can use the tensor data structure as it would be quite ok.

Here’s how k-means cluster algorithm works and I’ll be using this as a reference for your implementation.

  1. Initial k number of centroids are selected / computed.

  2. A distance metrics like Euclidean or Manhattan distance or even Mean Square Error is computed for each data point against each centroid

  3. The centroid with the shortest distance relative to a data point is where that data point belongs to and so K clusters are formed

  4. The mean/average of all data points in each K cluster is computed to form the new centroid.

  5. Methods 1 through 4 is repeated until the values of the previous centroids is equal to the current centroid.

Now in your case, you want to select random data points from your data to form initial centroids (K-=500)

First of all, you set a seed for your network so that the first network output and the output for your 500 selected centroids will always be the same for every run instance like so: torch.manual_seed(0) or any other number…

Second of all, you define your neural network architecture such the the final layer will give an output size of 512 like so: nn.Linear(..., 512) so that the output shape will be (N, 512) , then define your forward(self, input) method.

Thirdly In your neural network class in the def __init__(self, ...) add two attributes like so:

self.centroids = None
self.previous_centroids = None

these attributes would hold the values of the centroid matrix and the previous centroid matrix respectively.

Fourthly, you add a method to your network class called set_centroids(self, centroid_matrix) such that:

def set_centroids(self, centroid_matrix):
    self.controids = centroid_matrix

What’ll happen here basically is that after making an object of you neural network class and passing your centroid of shape (500, 1) into the network to compute a centroid matrix of shape (500, 512), you pass that centroid matrix as a parameter into the set_centoids(self, centroid_matrix) method like so:

#centroid should be of shape (500, 1)
model = Network(params)
centroid_matrix = model(centroid)
model.set_centroids(centroid_matrix)

Fifthly, Add another class attribute to your __init__(self, ...) method of your network class like so:

self.distance = nn.MSELoss()

This will be responsible for computing the MSE which will serve as the distance metric for computing distance between data points and each centroids.

Sixthly, add another method to your network class: what_centroid(self, sample) which will compute the distances between a single data point of the model output and the centroids.so it can output the centroid index that the data point belongs to (the one with the least distance).

Now your what_centroid(self, sample) method will look like this:

def what_centroid(self, sample):
    distance_list = []
    for c in self.centroids:
        distance = self.distance(sample, c)
        distance_list.append(distance)
    min_distance = min(distance_list)
    centroid_idx = distance_list.index(min_distance)
    return centroid_idx

Seventhly, add another method to form the clusters like so:

def cluster(self, model_output):
    cluster_dict = dict()
    for i in model_output:
        centroid_idx = self.what_centroid(i)
        if(centroid_idx not in cluster_dict):
            cluster_dict[centroid_idx] = [i]
        else:
            cluster_dict[centroid_idx].append(i)
    return cluster_dict

Here, their are 500 clusters with index from 0 to 499. So what this method will do will be to get the centroid that each data point belong to and form a dictionary with keys as centroid index (0 to 499) each with values of list of data points belonging to corresponding centroids.

Eightly Compute the new centroids from the dictionary and update the value of self.previous_centroids like so:

def set_new_centroids(self, cluster_dict):
    new_centroids = []
    for i in cluster_dict:
        cluster_dict[i] = torch.cat(cluster_dict[i], dim=0).reshape(-1, 512)
        new_centroids.append(torch.Tensor(cluster_dict[i]).mean(dim=0))

    new_centroids = torch.cat(new_centroids, dim=0).reshape(self.centroids.shape[0], self.centroids.shape[1]) #shape (500, 512)

    #update centroid
    self.previous_centroid = self.centroids
    self.set_centroid(new_centroids)

Now we have defined everything required to for the network class.
What is left is to compute a loss for backward propagation and optimization.

You said that the loss will be computed with regards to the centroids right?
Now let’s say the loss will be the MSE between the previous and the next centroids. An implication of this would be that your model will most likely not converge the usual way a k-means cluster algorithm would.
This is simply because the model parameters will change for every backward propagation and whenever you pass in the same dataset of shape (51000, 1) to the network with output shape as (51000, 512), the values will change for every forwards pass in an iteration with accordance to the change in model parameters.

I hope you’re getting the picture I’m trying to paint.

Having said all that, before training pass your randomly selected 500 samples that’ll serve as centroids to the network to compute a centroid matrix (500, 512) and store them as the network initial centroid with the set_centroid() class method.

During training as I’ve mentioned earlier you would want to feed the data to the network in batches, then for each batch iteration in an epoch you store the outputs in a list and when the last batch of that epoch is reached you then use the list to form the clusters and compute the new centroid with the corresponding class methods.

Also make sure to turn the list of output tensors for an epoch to a single multi-dimensional tensor and pass it to the cluster() method and then it’s output to the set_new_centroids() method.

Now, you use the self.distance attribute of the class to compute a loss/distance, and then you back propagate like so:

loss = network.distance(network.centroid, network.previous_centroid)
loss.backward()
optimizer.step()

I HAVEN’T TESTED THESE CODES OUT, THEY ARE JUST FOR ILLUSTRATION TO GIVE YOU AN IDEA OF WHAT TO DO. YOU’RE FREE TO COPY THEM AND EVEN OPTIMIZE THEM IF YOU WILL

3 Likes

Looks like I could not modify the original post.
Sorry for confusing you. Actually, I would like to store the centroids globally such that all the processes on each GPU could modify and accessed. Do you have any suggestions?

You can simply make it into a global variable outside of the class and access and modify however you want

In my opinion it’s not necessary tho because remember all processes would have to finish running across the GPUs before a new Centroid is computed and changed.
So at the end, tho they’ll be accessing the Centroids differently each gpu process wouldn’t be able to independently change its value until all the GPU processes are done for new centroids to be computed and old centroids to be updated