Custom optimizer class

I would like to implement a custom optimizer class so that I have more control during the gradient descent step. Here is my approach:

class SGD:

    def __init__(self, model, learning_rate, momentum):
        self.model = model
        self.learning_rate = learning_rate
        self.momentum = momentum

        self.v_old = [torch.zeros_like(p.data) for p in model.parameters()]

    def step(self):
        with torch.no_grad():
            update = lambda v_old, grad: self.momentum * v_old + self.learning_rate * grad
            self.v_old = [update(v_old, params.grad) for v_old, params in zip(self.v_old, self.model.parameters())]
            for parameters, v_old in zip(self.model.parameters(), self.v_old):
                parameters -= v_old

I use this class as follows:

train_loader, test_loader = get_data_loader(batch_size)
model = Model()
cross_entropy_loss = nn.CrossEntropyLoss()
optimizer = SGD(model=model, learning_rate=0.01, momentum=0.9)

for epoch in range(10):

    running_loss = 0.0
    running_counter = 0

    for i, data in enumerate(train_loader):
        inputs, labels = data[0], data[1]
        model.zero_grad()
        outputs = model(inputs)
        loss = cross_entropy_loss(outputs, labels)
        loss.backward()

        optimizer.step()

        running_loss += loss.item()
        running_counter += labels.size(0)

    running_loss = running_loss / running_counter
    print(f"epoch {epoch} loss {running_loss}")

I would like to know if I can improve the SGD class and if the approach is considered good style? Or are there more elegant ways to create custom optimizer classes?

Hi,

That looks good to me.

One thing we usually do in core optimizer is to lazily create the buffers (like v_old). Namely only create it during the first step.
That way, you can do things like:

optimizer = SGD(model=model, learning_rate=0.01, momentum=0.9)
mode.cuda()
# Your training loop

With your current code, this would fail because the buffer will be on cpu and the weights on gpu.

Iā€™m not completely sure if I understood you correctly. It is better to move the buffer to the GPU, right? But how can I do that?

Can I just do something like

self.v_old = [torch.zeros_like(p.data).to(device) for p in model.parameters()]

or is this not possible?

Ho what I meant is to do:

# inside the __init__
self.v_old = None

# inside the step
if self.v_old is None:
    self.v_old = [torch.zeros_like(p) for p in model.parameters()]
1 Like