Pairwise correlation

Hi,

What would be the optimal way to calculate pairwaise correlations of 2 tensors?
I’m trying to replicate this pandas dataframe feature:
u = df[features_list].corrwith(df[‘prediction’])
https://docs.pymars.org/en/latest/reference/dataframe/generated/mars.dataframe.DataFrame.corrwith.html

Need to calculate a pairwise correlations of a matrix with a vector and I would like to do it on gpu (could do it with numpy and then convert to tensor but it’s quite slow).

Thanks in advance.

I’m not super familiar with the pandas function, but if this description is what you need, you can just do this:

y = 200  # length of vector
x = 1000  # number of such vectors that make up the matrix

# generate some data for illustration
vector = torch.randn(y)
matrix = 0.5 * vector + 0.5 * torch.randn(x, y)

correlation = (matrix * vector).sum(dim=1) / ((matrix * matrix).sum(dim=1) * (vector * vector).sum()).sqrt()
print(correlation.shape)
print(correlation.mean())

Output:
torch.Size([1000])  # one correlation for each vector in the matrix
tensor(0.7248)  # close to sqrt(0.5) which is the expected result

Thanks @Andrei_Cristea. The thing is that I’m having an issue regarding dimensions. I need as the output, the correlation of each column of the dataframe with a specific column. I’m attaching a code example below with the output:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
print("DATAFRAME")
print(df)
print("\nCORRELATIONS")
feature_cols = ['A','B','C', 'D']
corrs = df.loc[:, feature_cols].corrwith(df['D'])
print(corrs)

And the output looks like this:

DATAFRAME
   A  B  C  D
0  7  1  1  1
1  9  1  8  5
2  8  3  2  1
3  8  7  4  6
4  9  5  0  0
5  0  7  4  9
6  6  5  4  8
7  5  9  0  3
8  1  0  8  2
9  1  1  7  3

CORRELATIONS
A   -0.300373
B    0.410389
C    0.322430
D    1.000000
dtype: float64

I don’t know how to structure the tensors to get a similar structure as the shown output.

Thanks in advance.

Just a small change to the above.

y = 10  # number of rows
x = 4  # columns A-D

# generate some data for illustration
matrix = torch.empty(10, 4).random_(10)
mat_dm = matrix - matrix.mean(dim=0)  # demean each column
vec = mat_dm[:, -1]

correlation = (mat_dm.T * vec).sum(dim=1) / ((mat_dm.T * mat_dm.T).sum(dim=1) * (vec * vec).sum()).sqrt()

print("TENSOR")
print(matrix)
print("CORRELATIONS")
print(correlation)

TENSOR
tensor([[3., 3., 2., 2.],
        [2., 7., 2., 9.],
        [5., 5., 4., 4.],
        [2., 0., 5., 3.],
        [7., 4., 2., 5.],
        [2., 9., 1., 6.],
        [8., 1., 2., 9.],
        [7., 9., 2., 4.],
        [9., 5., 4., 4.],
        [9., 9., 3., 7.]])
CORRELATIONS
tensor([ 0.1533,  0.2160, -0.4095,  1.0000])
1 Like

That’s exactly what I was looking for!

Thanks!