Derivative of covariance matrix

Suppose I have a bivariate covariance function c(x,y) and at input [x1,...,xn] and [y1,...,ym] and I would like to compute the covariance matrix with i,jth element d/dx c(xi,yj). Here I have assumed that each xi and yj are scalar valued, however this could easily be extended to finding the kth partial derivative of the covariance matrix. In general, I would like this to be extended to higher derivatives too (e.g. I would like to compute the matrix with i,jth element d/dx d/dy c(xi,yj)), but this may be an easy generalisation of the first issue.

I feel that PyTorch’s automatic differentiation system should be able to handle this, but I haven’t been able to come up with any nice solution. I think the issue is to do with broadcasting derivatives. For instance, to compute the non-differentiated covariance matrix, I would do something like c(x.unsqueeze(1),y), but the following code doesn’t work as I would like (using autograd.grad):

X = torch.Tensor([1,2,3]).requires_grad_(True).unsqueeze(1)
Y = torch.Tensor([4,5]).requires_grad_(True)
cov_matrix = c(X,Y)
dx_cov_matrix = grad(cov_matrix, [X], grad_outputs=[torch.ones([X.size(0),Y.size(0)])], create_graph=True)

Any help would be appreciated! Thanks

Here is an implementation of the covariance with pytorch from covariance-and-gradient-support

import torch

def cov(m, rowvar=False, inplace=False):
    '''Estimate a covariance matrix given data.

    Thanks :
    - https://discuss.pytorch.org/t/covariance-and-gradient-support/16217/2
    - https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L2276-L2494

    Covariance indicates the level to which two variables vary together.
    If we examine N-dimensional samples, `X = [x_1, x_2, ... x_N]^T`,
    then the covariance matrix element `C_{ij}` is the covariance of
    `x_i` and `x_j`. The element `C_{ii}` is the variance of `x_i`.

    Args:
        m: A 1-D or 2-D array containing multiple variables and observations.
            Each row of `m` represents a variable, and each column a single
            observation of all those variables.
        rowvar: If `rowvar` is True, then each row represents a
            variable, with observations in the columns. Otherwise, the
            relationship is transposed: each column represents a variable,
            while the rows contain observations.

    Returns:
        The covariance matrix of the variables.
    '''
    if m.dim() > 2:
        raise ValueError('m has more than 2 dimensions')
    if m.dim() < 2:
        m = m.view(1, -1)
    if not rowvar and m.size(0) != 1:
        m = m.t()
    m = m.type(torch.double)  
    fact = 1.0 / (m.size(1) - 1)
    if inplace:
        m -= torch.mean(m, dim=1, keepdim=True)
    else :
        m = m - torch.mean(m, dim=1, keepdim=True)
    mt = m.t()  # if complex: mt = m.t().conj()
    return fact * m.matmul(mt).squeeze()

You can also calculate the gradient as follows.

def gradient(y, x, grad_outputs=None):
    """Compute dy/dx @ grad_outputs"""
    if grad_outputs is None:
        grad_outputs = torch.ones_like(y)
    grad = torch.autograd.grad(y, [x], grad_outputs=grad_outputs, create_graph=True)[0]
    return grad

Now let’s take an example.

x, y = [90, 60, 90], [90, 90, 30]
A = torch.tensor([x, y], dtype = torch.double, requires_grad=True)
c = cov(m = A, rowvar = True)

tensor([[ 300., -300.], [-300., 1200.]], dtype=torch.float64, grad_fn=<MulBackward0>)

You can now calculate the gradient of c w.r.t to A, like this.

gradient(y = c, x = A)

tensor([[ 30., 0., -30.], [ 30., 0., -30.]], dtype=torch.float64, grad_fn=<AddBackward0>)

But I guess this is not the behaviour you want, because torch.autograd.grad makes dimension reductions for multi-dimensional parameters.

So you can calculate dc(xi, yj)/dx or dc(xi, yj)/dy like this :

grad_cij = gradient(y = c[i][j], x = A)

dc(xi, yj)/dx = grad_cij[0]

dc(xi, yj)/dy = grad_cij[1]
i = 0
j = 0
dcij_dA = gradient(y = c[i][j], x = A)

tensor([[ 10., -20., 10.], [ 0., 0., 0.]], dtype=torch.float64, grad_fn=)

Since c[0][0] is independent of y, dcij_dA[1] is null. Similarly, you can see that the derivative of c[1][1] with respect to x is also null.

i = 1
j = 1
dcij_dA = gradient(y = c[i][j], x = A)

tensor([[ 0., 0., 0.], [ 20., 20., -40.]], dtype=torch.float64, grad_fn=<AddBackward0>)

You can also notice that c[1][0] = c[0][1], all having non-zero derivatives with respect to x and y.

i = 0
j = 1
dcij_dA = gradient(y = c[i][j], x = A)

tensor([[ 10., 10., -20.], [ 5., -10., 5.]], dtype=torch.float64, grad_fn=<AddBackward0>)

i = 1
j = 0
dcij_dA = gradient(y = c[i][j], x = A)
dcij_dA

tensor([[ 10., 10., -20.], [ 5., -10., 5.]], dtype=torch.float64, grad_fn=<AddBackward0>)

You can also calculate everything like this :

n, m = c.shape
dc_dA = torch.stack([gradient(y = c[i][j], x = A) for j in range(m) for i in range(n)])