I’ll admit I’m a bit confused about how the optimization code works under the hood. The code for SGD uses python for loops to iterate through each parameter:
for p in group['params']: if p.grad is not None: params_with_grad.append(p) d_p_list.append(p.grad) state = self.state[p] if 'momentum_buffer' not in state: momentum_buffer_list.append(None) else: momentum_buffer_list.append(state['momentum_buffer'])
Wouldn’t this be incredibly slow? Is there some sort of just in time compilation going on to speed things up? If I were to implement my own optimizer would I have access to the same performance?