I am trying to get the gradient of a vector (with length m and batch size N) with respect to another vector (with length m and batch size N). Hence, I am expecting the gradient matrix to be Nxmxm.

For example,

x = torch.rand(5, 3)
x.requires_grad_(True)
x.retain_grad()
u = x*x
dudx = grad(u, x, torch.ones(x.size()[0], 3, 3), create_graph=True, retain_graph=True)

What is the mistake I am making? Many thanks in advance

First, You seem to assume that N is a batch size that should be considered specially? But the autograd does not know about that. It will compute the gradient considering N as any other dimension.

Also autograd.grad() does a vector jacobian product (when grad_outputs is provided) and so will return something with the same size as the input.
You can use autograd.functional.jacobian() to get the full Jacobian but as mentionned above, you will get NxmxNxm as N is considered as any other dimension.

You can do a for loop around jacobian for the N dimension if you want to consider the batch samples one at a time.