I am new to Pytorch so I might have misunderstood. The three approaches don’t look equivalent to me. I did a bit test with the first two. Bascially to fit four 1s to 7 and difference is that the first script changes weights every step while the second every 5 steps.

import torch

import torch.nn as nn

torch.manual_seed(1)

model = nn.Sequential(

nn.Linear(4, 8, bias=False),

nn.ReLU(),

nn.Linear(8, 1, bias=False),

)

x = torch.ones(4)

y0 = torch.tensor(7.)

loss_fn = nn.MSELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

optimizer.zero_grad()

for i in range(200):

```
y = model(x)
loss = loss_fn(y, y0)
if loss.item() < 1e-5:
print(f'after {i} steps')
break
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

print(‘y:’, y)

for label, value in model.state_dict().items():

print(label)

print(value)

after 126 steps

y: tensor([6.9988], grad_fn=)

0.weight

tensor([[ 0.8411, 0.3628, 0.4865, 0.8181],

[-0.4707, 0.2999, -0.1029, 0.2544],

[-0.0641, -0.1948, 0.0051, -0.1089],

[ 0.1826, -0.1949, -0.0365, -0.0450],

[ 0.6402, 0.5657, 1.0048, 0.7233],

[-0.1862, -0.3020, -0.0838, -0.2157],

[ 0.4406, 0.6248, 0.8989, 0.8726],

[-0.6859, 0.1128, -0.0575, 0.2771]])

2.weight

tensor([[ 0.8684, -0.3221, -0.2251, -0.1705, 0.9041, -0.0589, 0.7641, 0.0078]])

import torch

import torch.nn as nn

torch.manual_seed(1)

model = nn.Sequential(

nn.Linear(4, 8, bias=False),

nn.ReLU(),

nn.Linear(8, 1, bias=False),

)

x = torch.ones(4)

y0 = torch.tensor(7.)

loss_fn = nn.MSELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

optimizer.zero_grad()

for i in range(5000):

```
y = model(x)
loss = loss_fn(y, y0)
if loss.item() < 1e-5:
print(f'after {i} steps')
break
loss.backward()
if(i+1) % 5 == 0:
optimizer.step()
optimizer.zero_grad()
```

print(‘y:’, y)

for label, value in model.state_dict().items():

print(label)

print(value)

after 630 steps

y: tensor([6.9988], grad_fn=)

0.weight

tensor([[ 0.8411, 0.3628, 0.4865, 0.8181],

[-0.4707, 0.2999, -0.1029, 0.2544],

[-0.0641, -0.1948, 0.0051, -0.1089],

[ 0.1826, -0.1949, -0.0365, -0.0450],

[ 0.6402, 0.5657, 1.0048, 0.7233],

[-0.1862, -0.3020, -0.0838, -0.2157],

[ 0.4406, 0.6248, 0.8989, 0.8726],

[-0.6859, 0.1128, -0.0575, 0.2771]])

2.weight

tensor([[ 0.8684, -0.3221, -0.2251, -0.1705, 0.9041, -0.0589, 0.7641, 0.0078]])

As you can see, the first program converges five times faster than the second one. Result would be similar if I change input to random numbers. It seems to me with the second approach weights are updated using an average gradients. So the first approach is more efficient.

one more question, why is the third approach takes more memory? Say,

…

loss_sum += loss

import sys

sys.getsizeof(loss)

sys.getsizeof(loss_sum)

I would get same results. (72 in my case). So where is the associated history stored? is there a property or function to get them? Thanks