How to make one normalized matrix be 'closer' to another normalized matrix by using gradient descent?

Hi all! I want to try a weird idea: using gradient descent to make one normalized matrix A be ‘closer’ to another normalized matrix B, i.e., minimize the mean square error between A (can be viewed as the output) and B (can be viewed as the target). So I write the following toy code:

import torch
from torch.autograd import Variable
import numpy as np

np.random.seed(0)
torch.manual_seed(0)

a = np.random.randint(0, 255, (5, 5)).astype(np.float32)
a = Variable(torch.from_numpy(a), requires_grad=True)

b = np.random.randint(0, 255, (5, 5)).astype(np.float32)
b = Variable(torch.from_numpy(b))

for _ in range(10):
    a_max = a.max().repeat(a.size())
    c = a / a_max

    b_max = b.max().repeat(b.size())
    d = b / b_max

    loss = torch.nn.MSELoss()(c, d)
    print(loss.data[0])

    loss.backward()

    a.data -= 1000000 * a.grad.data

Unfortunately, I encounter a runtimeError: can’t assign a FloatTensor to a scalar value of type float.

However, it seems that everything work well when I detach the node a_max from the current graph,:

import torch
from torch.autograd import Variable
import numpy as np

np.random.seed(0)
torch.manual_seed(0)

a = np.random.randint(0, 255, (5, 5)).astype(np.float32)
a = Variable(torch.from_numpy(a), requires_grad=True)

b = np.random.randint(0, 255, (5, 5)).astype(np.float32)
b = Variable(torch.from_numpy(b))

for _ in range(10):
    a_max = a.max().repeat(a.size()).detach()
    c = a / a_max

    b_max = b.max().repeat(b.size())
    d = b / b_max

    loss = torch.nn.MSELoss()(c, d)
    print(loss.data[0])

    loss.backward()
    a.data -= 1000000 * a.grad.data

# now c and b are close enough
print(c.data)
print(d.data)

I have no idea why detach that node make the program executable. I would appreciate it if you can point me out what cause the runtime error. Thanks a lot!