but in train step i (loss.backward()) I got this error:
grad can be implicitly created only for scalar outputs
that mean output of loss function can not be vector!!!
Is there a solution to this problem?
i want loss be a vector

Well the issue here is that in order to take a gradient, the derivative must be with respect to a single value. How should the vector loss be interpreted?

@saeed_i You can use reduction=None parameter in the MSELoss to get what you are desiring. But I agree with @eqy , in order to take a gradient, the derivative must be with respect to a single value. While calculating the gradient you must consider the error over the complete dataset.

When you use a loss function to train a model, the loss function is
telling you which set of model parameters is “better” than other sets
of model parameters.

Let’s say you have a model, and when it has weight_A as its
parameters it produces loss_vector_A = [1.1, 4.4, 2.2].
Let’s also say the when the same model has weight_B as its
parameters it produces loss_vector_B = [2.2, 1.1, 3.3].

Is the model a better model with weight_A or weight_B? If the loss
function produced just a scalar (instead of a vector), we would just say
that the smaller scalar value corresponds to the better model. (That’s
really what “loss function” means.)

(If you say just add up the elements of your loss vectors to see
which model is better, then you would really be saying that loss_vector_A.sum() and loss_vector_B.sum() should be
your scalar loss-function values.)

yes i know these things.
i confused in mathematical calucation:
assume we have:
x = torch.tensor([2. , 3. , 5.], requires_grad= True)
y_1 = x.pow(2)
y_2 = x.pow(2).sum()

You are making the hidden assumption that y_1[0] depends only
on x[0], y_1[1] only on x[1], and y_1[2] only on x[2]. This
happens to be true in your particular example of y_1 = x.pow (2).

How would you change your reasoning for the case where y_1 = x * x.roll (1)?

Please look at the concept of the Jacobean matrix and how it is
the generalization of the gradient to a vector-valued function.

(And to avoid any misconception, let me reiterate what I said in
my previous post: In your current example, y_1 is a vector, rather
than a scalar, so it cannot be used as a loss function.)

I think backward() documentation explains it pretty well, you can backward() on vector if you specify grad_tensor[s], that is a gradient of some vector-to-scalar function, if this argument consists of ones that’s the same as .sum().backward(), and a weighted sum otherwise.
As parameter.grad is for scalar-valued functions, reduction to a scalar is present one way or the other.