Simple autograd example extended but some problem

So I have extended the book sample about how autograd work by extending as follows.
Original example was just 1 sample of test data with 1 feature so I kinda of extended to 5,3.
The example is extremely simple working of training model with manual code (forward layer) and backward pass is also manually done and its result compared against loss.backward as a proof how really backward() pass works. Great.

Now, instead of 1, I took two sample size and run through network and forward multiplication works but comparison of loss.backward() fails against manual computation. What did I do wrong here?

In third example below, I took two sample and still iterates one by one in the loop and comparison works but I am not sure this is really how training works. I’d rather 2nd example code below to work that forward pass will swoop through 2 sample sizes simultaneously if not the backward computation comparison does not fail.

EXAMPLE 1 (1 sample size per loop)

cat -n nn-manual-1.py 
     1	import torch
     2	import code
     3	cuda = torch.device('cuda')
     4	
     5	# Create weight and bias values.
     6	
     7	CONFIG_ENABLE_TEST=1
     8	FEATURE_SIZE=5
     9	SAMPLE_SIZE=3
    10	w=torch.rand(FEATURE_SIZE, requires_grad=True, device='cuda')
    11	b=torch.rand(FEATURE_SIZE, requires_grad=True, device='cuda')
    12	
    13	torch.manual_seed(1)
    14	
    15	# Create input(x), output (y, expected).
    16	# input(x) used for forward pass: z=w*x+b, z will be computed y rather than expected y. diff=(z-y)
    17	
    18	x_data=torch.rand([SAMPLE_SIZE, FEATURE_SIZE], device='cuda')
    19	y_data=torch.rand(FEATURE_SIZE, device='cuda')
    20	x=x_data[0:1]
    21	y=y_data
    22	
    23	for i in range(0, 2):
    24	    print("-------- ", i, " ---------")
    25	    z=torch.add(torch.mul(w, x), b)
    26	    loss = (y-z).pow(2).sum()
    27	
    28	    loss.backward()
    29	    print("loss: ", loss)
    30	    print("w: ", w, type(w))
    31	    print("b: ", b, type(b))
    32	    print('dL/dw : ', w.grad, type(w.grad))
    33	    print('dL/db : ', b.grad, type(b.grad))
    34	
    35	    # verifying output of loss.backward...
    36	
    37	    print("verifying output of loss.backward...(compare with DL/DW)")
    38	
    39	
    40	    # test1=DL/Dw = DL/DZ * DZ/DW 
    41	    # 1. DL/DZ=D/DZ (y-z)**2 = D/DZ y**2-2yz+z**2 = -2y + 2z = 2(z-y)
    42	    # 2. DZ/DW = D/DW w * x + b = x.
    43	    # 3. DL/DW = DL/DZ * DZ/DW = 2(z-y)x = 2x(z-y)  = 2x(w*x+B)-y
    44	
    45	    # test2=DL/db = DL/DZ * DZ/Db
    46	    # 1a. same as 1.
    47	    # 2a.DW/DB = d/db w * x  + b = 1
    48	    # 3a. DL/Db = DL/DZ * DZ/Db = 2(z-y) * 1 = 2(z-y) = 2 (w * x + b) -y
    49	    
    50	    test1=2 * x * ((w*x+b)-y)
    51	    print("dL/dw    : ", w.grad)
    52	    print("t1       : ", test1)
    53	
    54	    test2=2 * ((w*x + b) - y)
    55	    print("dL/db    : ", b.grad)
    56	    print("t2       : ", test2)
    57	
    58	    # update weights
    59	
    60	    w1 = w + w.grad
    61	    b1 = b + b.grad
    62	    w=w1.detach()
    63	    w.requires_grad=True
    64	    b=b1.detach()
    65	    b.requires_grad=True
    66	
    67	    print("new updated w1/b1: ")
    68	    print("w: ", w, type(w))
    69	    print("b: ", b, type(b))
    70	
[root@ixt-hq-180 nn-manual]# python3 nn-manual-1.py 
--------  0  ---------
loss:  tensor(0.8684, device='cuda:0', grad_fn=<SumBackward0>)
w:  tensor([0.2686, 0.3329, 0.0467, 0.8056, 0.8514], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([0.2696, 0.3146, 0.6340, 0.8193, 0.2254], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
dL/dw :  tensor([-0.2908,  0.0051, -0.0749,  0.9804, -0.0844], device='cuda:0') <class 'torch.Tensor'>
dL/db :  tensor([-0.3266,  0.1840, -0.0829,  1.8201, -0.1154], device='cuda:0') <class 'torch.Tensor'>
verifying output of loss.backward...(compare with DL/DW)
dL/dw    :  tensor([-0.2908,  0.0051, -0.0749,  0.9804, -0.0844], device='cuda:0')
t1       :  tensor([[-0.2908,  0.0051, -0.0749,  0.9804, -0.0844]], device='cuda:0',
       grad_fn=<MulBackward0>)
dL/db    :  tensor([-0.3266,  0.1840, -0.0829,  1.8201, -0.1154], device='cuda:0')
t2       :  tensor([[-0.3266,  0.1840, -0.0829,  1.8201, -0.1154]], device='cuda:0',
       grad_fn=<MulBackward0>)
new updated w1/b1: 
w:  tensor([-0.0222,  0.3380, -0.0282,  1.7859,  0.7670], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([-0.0570,  0.4986,  0.5511,  2.6394,  0.1100], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
--------  1  ---------
loss:  tensor(11.3448, device='cuda:0', grad_fn=<SumBackward0>)
w:  tensor([-0.0222,  0.3380, -0.0282,  1.7859,  0.7670], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([-0.0570,  0.4986,  0.5511,  2.6394,  0.1100], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
dL/dw :  tensor([-1.3333,  0.0152, -0.3468,  3.5099, -0.3434], device='cuda:0') <class 'torch.Tensor'>
dL/db :  tensor([-1.4976,  0.5523, -0.3840,  6.5164, -0.4697], device='cuda:0') <class 'torch.Tensor'>
verifying output of loss.backward...(compare with DL/DW)
dL/dw    :  tensor([-1.3333,  0.0152, -0.3468,  3.5099, -0.3434], device='cuda:0')
t1       :  tensor([[-1.3333,  0.0152, -0.3468,  3.5099, -0.3434]], device='cuda:0',
       grad_fn=<MulBackward0>)
dL/db    :  tensor([-1.4976,  0.5523, -0.3840,  6.5164, -0.4697], device='cuda:0')
t2       :  tensor([[-1.4976,  0.5523, -0.3840,  6.5164, -0.4697]], device='cuda:0',
       grad_fn=<MulBackward0>)
new updated w1/b1: 
w:  tensor([-1.3555,  0.3531, -0.3749,  5.2959,  0.4235], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([-1.5547,  1.0509,  0.1671,  9.1558, -0.3597], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>

EXAMPLE 2: two sample size per loop (Check following lines for difference from example 1 above)

delta:
x=x_data[0:2]
output:

dL/dw    :  tensor([-1.3059,  0.1650, -0.0976,  1.7289, -0.1745], device='cuda:0')
t1       :  tensor([[-0.6494, -0.0013, -0.0579,  0.7400, -0.1629],
cat nn-manual-1.py ; python3 nn-manual-1.py 
import torch
import code
cuda = torch.device('cuda')

# Create weight and bias values.

CONFIG_ENABLE_TEST=1
FEATURE_SIZE=5
SAMPLE_SIZE=3
w=torch.rand(FEATURE_SIZE, requires_grad=True, device='cuda')
b=torch.rand(FEATURE_SIZE, requires_grad=True, device='cuda')

torch.manual_seed(1)

# Create input(x), output (y, expected).
# input(x) used for forward pass: z=w*x+b, z will be computed y rather than expected y. diff=(z-y)

x_data=torch.rand([SAMPLE_SIZE, FEATURE_SIZE], device='cuda')
y_data=torch.rand(FEATURE_SIZE, device='cuda')
x=x_data[0:2]
y=y_data

for i in range(0, 2):
    print("-------- ", i, " ---------")
    z=torch.add(torch.mul(w, x), b)
    loss = (y-z).pow(2).sum()

    loss.backward()
    print("loss: ", loss)
    print("w: ", w, type(w))
    print("b: ", b, type(b))
    print('dL/dw : ', w.grad, type(w.grad))
    print('dL/db : ', b.grad, type(b.grad))

    # verifying output of loss.backward...

    print("verifying output of loss.backward...(compare with DL/DW)")


    # test1=DL/Dw = DL/DZ * DZ/DW 
    # 1. DL/DZ=D/DZ (y-z)**2 = D/DZ y**2-2yz+z**2 = -2y + 2z = 2(z-y)
    # 2. DZ/DW = D/DW w * x + b = x.
    # 3. DL/DW = DL/DZ * DZ/DW = 2(z-y)x = 2x(z-y)  = 2x(w*x+B)-y

    # test2=DL/db = DL/DZ * DZ/Db
    # 1a. same as 1.
    # 2a.DW/DB = d/db w * x  + b = 1
    # 3a. DL/Db = DL/DZ * DZ/Db = 2(z-y) * 1 = 2(z-y) = 2 (w * x + b) -y
    
    test1=2 * x * ((w*x+b)-y)
    print("dL/dw    : ", w.grad)
    print("t1       : ", test1)

    test2=2 * ((w*x + b) - y)
    print("dL/db    : ", b.grad)
    print("t2       : ", test2)

    # update weights

    w1 = w + w.grad
    b1 = b + b.grad
    w=w1.detach()
    w.requires_grad=True
    b=b1.detach()
    b.requires_grad=True

    print("new updated w1/b1: ")
    print("w: ", w, type(w))
    print("b: ", b, type(b))

--------  0  ---------
loss:  tensor(1.8291, device='cuda:0', grad_fn=<SumBackward0>)
w:  tensor([0.1335, 0.8721, 0.6207, 0.9881, 0.2463], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([0.1884, 0.1846, 0.1250, 0.4980, 0.6141], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
dL/dw :  tensor([-1.3059,  0.1650, -0.0976,  1.7289, -0.1745], device='cuda:0') <class 'torch.Tensor'>
dL/db :  tensor([-1.4552,  0.4470, -1.2063,  2.9352, -0.7959], device='cuda:0') <class 'torch.Tensor'>
verifying output of loss.backward...(compare with DL/DW)
**dL/dw    :  tensor([-1.3059,  0.1650, -0.0976,  1.7289, -0.1745], device='cuda:0')**
**t1       :  tensor([[-0.6494, -0.0013, -0.0579,  0.7400, -0.1629],**
**        [-0.6565,  0.1663, -0.0397,  0.9889, -0.0115]], device='cuda:0',**
**       grad_fn=<MulBackward0>)**
**dL/db    :  tensor([-1.4552,  0.4470, -1.2063,  2.9352, -0.7959], device='cuda:0')**
**t2       :  tensor([[-0.7295, -0.0464, -0.0641,  1.3739, -0.2228],**
**        [-0.7257,  0.4934, -1.1422,  1.5612, -0.5731]], device='cuda:0',**
**       grad_fn=<MulBackward0>)**
new updated w1/b1: 
w:  tensor([-1.1725,  1.0371,  0.5231,  2.7170,  0.0718], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([-1.2667,  0.6316, -1.0813,  3.4331, -0.1818], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
--------  1  ---------
loss:  tensor(69.6964, device='cuda:0', grad_fn=<SumBackward0>)
w:  tensor([-1.1725,  1.0371,  0.5231,  2.7170,  0.0718], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([-1.2667,  0.6316, -1.0813,  3.4331, -0.1818], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
dL/dw :  tensor([-10.7375,   0.5285,  -2.5198,  10.9996,  -1.5573], device='cuda:0') <class 'torch.Tensor'>
dL/db :  tensor([-11.9639,   2.3553,  -6.2146,  18.7286,  -4.2419], device='cuda:0') <class 'torch.Tensor'>
verifying output of loss.backward...(compare with DL/DW)
dL/dw    :  tensor([-10.7375,   0.5285,  -2.5198,  10.9996,  -1.5573], device='cuda:0')
**t1       :  tensor([[-5.3105,  0.0235, -2.3961,  4.9052, -1.5135],**
**        [-5.4271,  0.5050, -0.1237,  6.0945, -0.0437]], device='cuda:0',**
**       grad_fn=<MulBackward0>)**
**dL/db    :  tensor([-11.9639,   2.3553,  -6.2146,  18.7286,  -4.2419], device='cuda:0')**
**t2       :  tensor([[-5.9651,  0.8567, -2.6531,  9.1068, -2.0699],**
**        [-5.9989,  1.4986, -3.5615,  9.6218, -2.1720]], device='cuda:0',**
**       grad_fn=<MulBackward0>)**
new updated w1/b1: 
w:  tensor([-11.9100,   1.5656,  -1.9967,  13.7167,  -1.4854], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([-13.2307,   2.9869,  -7.2959,  22.1617,  -4.4237], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
[root@ixt-hq-180 nn-manual]#

3RD EXAMPLE:

cat nn-manual-2.py ; python3 nn-manual-2.py 
import torch
import code
cuda = torch.device('cuda')

# Create weight and bias values.

CONFIG_ENABLE_TEST=1
FEATURE_SIZE=5
SAMPLE_SIZE=3
w=torch.rand(FEATURE_SIZE, requires_grad=True, device='cuda')
b=torch.rand(FEATURE_SIZE, requires_grad=True, device='cuda')

torch.manual_seed(1)

# Create input(x), output (y, expected).
# input(x) used for forward pass: z=w*x+b, z will be computed y rather than expected y. diff=(z-y)

x_data=torch.rand([SAMPLE_SIZE,FEATURE_SIZE], device='cuda')
y_data=torch.rand(FEATURE_SIZE, device='cuda')
x_batch=x_data[0:2]

#test1 and test2 no longer matches with loss.backward when two rows are taken as below.
x=x_data[0:2]
y=y_data

for i in range(0, 2):
    print("epochs: -------- ", i, " ---------")
    for x in x_batch:
        z=torch.add(torch.mul(w, x), b)
        loss = (y-z).pow(2).sum()

        loss.backward()
        print("loss: ", loss)
        print("w: ", w, type(w))
        print("b: ", b, type(b))
        print('dL/dw : ', w.grad, type(w.grad))
        print('dL/db : ', b.grad, type(b.grad))

        # verifying output of loss.backward...

        print("verifying output of loss.backward...(compare with DL/DW)")

        # test1=DL/Dw = DL/DZ * DZ/DW 
        # 1. DL/DZ=D/DZ (y-z)**2 = D/DZ y**2-2yz+z**2 = -2y + 2z = 2(z-y)
        # 2. DZ/DW = D/DW w * x + b = x.
        # 3. DL/DW = DL/DZ * DZ/DW = 2(z-y)x = 2x(z-y)  = 2x(w*x+B)-y

        # test2=DL/db = DL/DZ * DZ/Db
        # 1a. same as 1.
        # 2a.DW/DB = d/db w * x  + b = 1
        # 3a. DL/Db = DL/DZ * DZ/Db = 2(z-y) * 1 = 2(z-y) = 2 (w * x + b) -y
    
        test1=2 * x * ((w*x+b)-y)
        print("dL/dw    : ", w.grad)
        print("test1    : ", test1)
    
        test2=2 * ((w*x + b) - y)
        print("dL/db    : ", b.grad)
        print("test2    : ", test2)

        # update weights

        w1 = w + w.grad
        b1 = b + b.grad
        w=w1.detach()
        w.requires_grad=True
        b=b1.detach()
        b.requires_grad=True

        print("new updated w1/b1: ")
        print("w: ", w, type(w))
        print("b: ", b, type(b))

epochs: --------  0  ---------
loss:  tensor(2.2088, device='cuda:0', grad_fn=<SumBackward0>)
w:  tensor([0.5683, 0.3380, 0.6250, 0.4771, 0.2660], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([0.7391, 0.8436, 0.9909, 0.9707, 0.8136], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
dL/dw :  tensor([1.0203, 0.0341, 1.5132, 0.9528, 0.1499], device='cuda:0') <class 'torch.Tensor'>
dL/db :  tensor([1.1461, 1.2423, 1.6755, 1.7690, 0.2050], device='cuda:0') <class 'torch.Tensor'>
verifying output of loss.backward...(compare with DL/DW)
dL/dw    :  tensor([1.0203, 0.0341, 1.5132, 0.9528, 0.1499], device='cuda:0')
test1    :  tensor([1.0203, 0.0341, 1.5132, 0.9528, 0.1499], device='cuda:0',
       grad_fn=<MulBackward0>)
dL/db    :  tensor([1.1461, 1.2423, 1.6755, 1.7690, 0.2050], device='cuda:0')
test2    :  tensor([1.1461, 1.2423, 1.6755, 1.7690, 0.2050], device='cuda:0',
       grad_fn=<MulBackward0>)
new updated w1/b1: 
w:  tensor([1.5887, 0.3721, 2.1382, 1.4299, 0.4159], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([1.8852, 2.0859, 2.6664, 2.7397, 1.0186], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
loss:  tensor(25.9549, device='cuda:0', grad_fn=<SumBackward0>)
w:  tensor([1.5887, 0.3721, 2.1382, 1.4299, 0.4159], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([1.8852, 2.0859, 2.6664, 2.7397, 1.0186], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
dL/dw :  tensor([4.7955, 1.3340, 0.1406, 4.1832, 0.0049], device='cuda:0') <class 'torch.Tensor'>
dL/db :  tensor([5.3008, 3.9591, 4.0461, 6.6044, 0.2426], device='cuda:0') <class 'torch.Tensor'>
verifying output of loss.backward...(compare with DL/DW)
dL/dw    :  tensor([4.7955, 1.3340, 0.1406, 4.1832, 0.0049], device='cuda:0')
test1    :  tensor([4.7955, 1.3340, 0.1406, 4.1832, 0.0049], device='cuda:0',
       grad_fn=<MulBackward0>)
dL/db    :  tensor([5.3008, 3.9591, 4.0461, 6.6044, 0.2426], device='cuda:0')
test2    :  tensor([5.3008, 3.9591, 4.0461, 6.6044, 0.2426], device='cuda:0',
       grad_fn=<MulBackward0>)
new updated w1/b1: 
w:  tensor([6.3842, 1.7061, 2.2788, 5.6131, 0.4208], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([7.1859, 6.0449, 6.7124, 9.3440, 1.2613], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
epochs: --------  1  ---------
loss:  tensor(392.9895, device='cuda:0', grad_fn=<SumBackward0>)
w:  tensor([6.3842, 1.7061, 2.2788, 5.6131, 0.4208], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([7.1859, 6.0449, 6.7124, 9.3440, 1.2613], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
dL/dw :  tensor([21.7179,  0.3217, 14.5457, 12.9532,  0.9700], device='cuda:0') <class 'torch.Tensor'>
dL/db :  tensor([24.3950, 11.7201, 16.1057, 24.0484,  1.3266], device='cuda:0') <class 'torch.Tensor'>
verifying output of loss.backward...(compare with DL/DW)
dL/dw    :  tensor([21.7179,  0.3217, 14.5457, 12.9532,  0.9700], device='cuda:0')
test1    :  tensor([21.7179,  0.3217, 14.5457, 12.9532,  0.9700], device='cuda:0',
       grad_fn=<MulBackward0>)
dL/db    :  tensor([24.3950, 11.7201, 16.1057, 24.0484,  1.3266], device='cuda:0')
test2    :  tensor([24.3950, 11.7201, 16.1057, 24.0484,  1.3266], device='cuda:0',
       grad_fn=<MulBackward0>)
new updated w1/b1: 
w:  tensor([28.1021,  2.0279, 16.8244, 18.5664,  1.3908], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([31.5810, 17.7650, 22.8181, 33.3925,  2.5879], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
loss:  tensor(6030.5767, device='cuda:0', grad_fn=<SumBackward0>)
w:  tensor([28.1021,  2.0279, 16.8244, 18.5664,  1.3908], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([31.5810, 17.7650, 22.8181, 33.3925,  2.5879], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
dL/dw :  tensor([1.0193e+02, 1.2276e+01, 1.5762e+00, 5.6765e+01, 6.8872e-02],
       device='cuda:0') <class 'torch.Tensor'>
dL/db :  tensor([112.6646,  36.4331,  45.3700,  89.6185,   3.4204], device='cuda:0') <class 'torch.Tensor'>
verifying output of loss.backward...(compare with DL/DW)
dL/dw    :  tensor([1.0193e+02, 1.2276e+01, 1.5762e+00, 5.6765e+01, 6.8872e-02],
       device='cuda:0')
test1    :  tensor([1.0193e+02, 1.2276e+01, 1.5762e+00, 5.6765e+01, 6.8872e-02],
       device='cuda:0', grad_fn=<MulBackward0>)
dL/db    :  tensor([112.6646,  36.4331,  45.3700,  89.6185,   3.4204], device='cuda:0')
test2    :  tensor([112.6646,  36.4331,  45.3700,  89.6185,   3.4204], device='cuda:0',
       grad_fn=<MulBackward0>)
new updated w1/b1: 
w:  tensor([130.0274,  14.3042,  18.4006,  75.3309,   1.4597], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>
b:  tensor([144.2456,  54.1982,  68.1881, 123.0110,   6.0082], device='cuda:0',
       requires_grad=True) <class 'torch.Tensor'>