I develop the spaCy NLP library. We have our own NN library, Thinc, to avoid dependencies (plus I started writing it before PyTorch was around :p), but for obvious reasons we’d like to let people use PyTorch models in spaCy as well.
The plan has been to write small shim classes that would wrap PyTorch (or other libraries’) models to have the same API as Thinc. You can find the wrapper class so far here: https://github.com/explosion/thinc/blob/master/thinc/extra/wrappers.py
How do I resize an input layer? If neurons are added, the weights for the new activations should be zero. If the new size is smaller, the last activations should be truncated.
How do I resize an output layer? If neurons are added, the weights for the new activations should be zero. If the new size is smaller, the last activations should be truncated.
Thinc has a
use_params()context-manager, which allows usage of weights passed in for the scope of a block. Is
load_state_dict()the right thing there?
The heart of the wrapper is Thinc’s
begin_update() method. This takes a batch of inputs, and returns a tuple with a batch of outputs and a callback to complete the backward pass. This was pretty easy to do, but I wrote it a few months ago — hopefully it’s still current?
def begin_update(self, x_data, drop=0.): '''Return the output of the wrapped PyTorch model for the given input, along with a callback to handle the backward pass. ''' x_var = torch.autograd.Variable(torch.Tensor(x_data), requires_grad=True) # Make prediction y_var = self._model(x_var) def backward_pytorch(dy_data, sgd=None): dy_var = torch.autograd.Variable(torch.Tensor(dy_data)) torch.autograd.backward((y_var,), grad_variables=(dy_var,)) dX = self.ops.asarray(x_var.grad.data) if sgd is not None: optimizer.step() return dX return self.ops.asarray(y_var.data), backward
The main outstanding problem with the
begin_update() wrapper above is that Thinc takes an argument
drop which is a float between 0 and 1. This is used to dropout the outgoing activations. We shouldn’t need to worry about making this auto-differentiable – it should be fine to get the dropout mask, and then just multiply it by the activations. Then we can multiply the incoming gradient by the mask as well, since the mask will be stored in the enclosing scope.
I’ve also drafted out the serialization methods. Thinc uses to/from_bytes/disk. The architecture is not saved, just the parameters — we assume that the architecture is reconstructed before you call
from_bytes(). So this seems fairly straight-forward.