# Some questions about the Adam optimizer

I use the full batch gradient decent method with Adam optimizer, obtaining the following result

as seen in the picture, the direction of the first update is not along the direction of the gradient, but during the first update, there are no momentum acculumated, why the direction of first update is the same as gradient.

Hi Youlong!

Looking at the original Adam paper linked to in pytorch’s
`Adam` documentation, I believe this is the way `Adam` is
supposed to work. (I have not looked at pytorch’s `Adam`
implementation.)

Quoting from the abstract of the Adam paper:

The method … is invariant to diagonal rescaling of the gradients

Note, that the direction of the gradient in some sense “washes out”
is portrayed as a desirable feature of `Adam`.

(Some suggestive terminology: The “m” in “Adam” refers to “moment”
rather than “momentum.” It is true that `Adam` "accumulates moments,
but in its very first step it moves a a direction that scales like
`gradient / sqrt (gradient**2)`, which is to say, it does not
(necessarily) move in the direction of the gradient.)

We can verify that this behavior is displayed by pytorch’s `Adam`:

``````>>> import torch
>>> torch.__version__
'1.9.0'
>>> def fs (t):
...     return  (t * t * torch.tensor ([1.0, 1.0])).sum()
...
>>> def fa (t):
...     return  (t * t * torch.tensor ([2.0, 0.5])).sum()
...
>>> tss = torch.tensor ([1.0, 1.0], requires_grad = True)
>>> tsa = torch.tensor ([1.0, 1.0], requires_grad = True)
>>> tas = torch.tensor ([1.0, 1.0], requires_grad = True)
>>> taa = torch.tensor ([1.0, 1.0], requires_grad = True)
>>> sgds = torch.optim.SGD ([tss], lr = 0.1)
>>> sgda = torch.optim.SGD ([tsa], lr = 0.1)
>>> fs (tss).backward()
tensor([2., 2.])
>>> fa (tsa).backward()
tensor([4., 1.])
>>> fs (tas).backward()
tensor([2., 2.])
>>> fa (taa).backward()
tensor([4., 1.])
>>> sgds.step()
>>> tss
>>> sgda.step()
>>> tsa
>>> tas
>>> taa
Here, `fa (t)` is a function that has a larger gradient in the `t[0]`
direction than in the `t[1]` direction (while `fs (s)` is symmetrical).
But you can see that `Adam` takes a step directly toward the origin
`fa (t)`. (By way of comparison, the steps taken by `SGD` are in