Implementation of Gram Matrix in Neural Style Tutorial

Hi all,
I want to implement Gram Matrix of a tensor with shape
(batch_size, channel_size, patch_size, height, width)

We can calculate it only by this code in numpy A @ A.T.
My question is why in neural style transfer tutorial, first we convert our matrix into 2D matrix, then use .mm function to multiply it by its transpose?

def gram_matrix(input):
    a, b, c, d = input.size()  # a=batch size(=1)
    # b=number of feature maps
    # (c,d)=dimensions of a f. map (N=c*d)

    features = input.view(a * b, c * d)  # resise F_XL into \hat F_XL

    G =, features.t())  # compute the gram product

    # we 'normalize' the values of the gram matrix
    # by dividing by the number of element in each feature maps.
    return G.div(a * b * c * d)

Is there any optimization consideration or anything else?


The tutorial explains it as:

F_XL is reshaped to form F̂_hat_XL, a KxN matrix, where K is the number of feature maps at layer L and N is the length of any vectorized feature map F^k_XL.

It doesn’t seem to be working and I’m not sure, what this operation calculates in your case:

a, b, c, d = 2, 3, 4, 5
x = torch.randn(a, b, c, d)
x_np = x.numpy()
g_np = x_np @ x_np.T
> ValueError: shapes (2,3,4,5) and (5,4,3,2) not aligned: 5 (dim 3) != 3 (dim 2)
1 Like

Actually, It seems I asked my question ambiguously.
By A @ A.T, I mean the definition of Gram matrix which is multiplication of a matrix to its transpose, not a numpy code. My question was why do we have to reshape matrices.Now, I can understand that the only way to multiply two 4-D matrices is to reshape them into 2-D matrices then multiply them (the .t() function only works on 2-D matrix, obviously).

Based on tutorial, batch_size and channel_size has been considered as number of feature maps and height and width of images as length of any vectorized feature map

Now, in my problem, I have a 5-D matrix with a dimesion called patch_size and I considered it as number of feature maps that only can cause a difference in output matrix size, but not the values, so I think it will not cause any problem in computing loss process.

Thanks for help