I’ll admit I’m a bit confused about how the optimization code works under the hood. The code for SGD uses python for loops to iterate through each parameter:
for p in group['params']:
if p.grad is not None:
params_with_grad.append(p)
d_p_list.append(p.grad)
state = self.state[p]
if 'momentum_buffer' not in state:
momentum_buffer_list.append(None)
else:
momentum_buffer_list.append(state['momentum_buffer'])
Wouldn’t this be incredibly slow? Is there some sort of just in time compilation going on to speed things up? If I were to implement my own optimizer would I have access to the same performance?