Hi,

In theory, we should be able to obtain a solution with a much smaller network (ie, 2 hidden units + bias). Please see Section 6.1 of Goodfellow et al (2016).

The smooth L1 loss and the selu activation function seem to help in the learning process. Below please find a solution that uses as starting base the autograd example.

```
# -*- coding: utf-8 -*-
import torch
import numpy as np
from torch.autograd import Variable
from torch import FloatTensor
import torch.nn.functional as F
dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 2, 2, 2, 1
# Create random Tensors to hold input and outputs, and wrap them in Variables.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Variables during the backward pass.
x = Variable(FloatTensor(np.array([[0, 0], [0, 1], [1, 0], [1, 1]])))
y = Variable(FloatTensor(np.array([[0., 1., 1., 0.]])))
# Create random Tensors for weights, and wrap them in Variables.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Variables during the backward pass.
W = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
c = Variable(torch.zeros(D_in).type(dtype), requires_grad=True)
b = Variable(torch.zeros(D_out).type(dtype), requires_grad=True)
learning_rate = 1e-3
for t in range(200000):
# Forward pass: compute predicted y using operations on Variables; these
# are exactly the same operations we used to compute the forward pass using
# Tensors, but we do not need to keep references to intermediate values since
# we are not implementing the backward pass by hand.
y_pred = F.selu(x.mm(W).add(c)).mm(w).add(b)
# Compute and print loss using operations on Variables.
# Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape
# (1,); loss.data[0] is a scalar value holding the loss.
# loss = (y_pred - y).pow(2).sum()
loss = F.smooth_l1_loss(y_pred, y)
if t % 10000 == 0:
print(t, loss.data[0])
print(t, y_pred.data)
# print(t, c.data)
# print(t, w.data)
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Variables with requires_grad=True.
# After this call w1.grad and w2.grad will be Variables holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss.backward()
# Update weights using gradient descent; w1.data and w2.data are Tensors,
# w1.grad and w2.grad are Variables and w1.grad.data and w2.grad.data are
# Tensors.
W.data -= learning_rate * W.grad.data
w.data -= learning_rate * w.grad.data
c.data -= learning_rate * c.grad.data
b.data -= learning_rate * b.grad.data
# Manually zero the gradients after updating weights
W.grad.data.zero_()
w.grad.data.zero_()
c.grad.data.zero_()
b.grad.data.zero_()
print("W: ")
print(W)
print("w: ")
print(w)
```