Finding Polynomial Function Questions

I’m trying to create a model that is able to learn any lower polynomial function. (Can this even be done?) So far I’ve got code that is able to learn the y=6x^2 + 2x - 4 quadratic equation.

I noticed that by providing [x, x^2] instead of [x] as the input the model performs much better. However, when I use [x, x^2, x^3] as the input the model doesn’t work anymore (huge test set loss). Why does providing x^3 to the model result in such a poor result?

Below is the code that I’m using. I’m quite new to Pytorch so general tips/critique is also appriciated. Thanks!

import torch
from torch import Tensor
from torch.nn import Linear, MSELoss, functional as F
from torch.autograd import Variable
import numpy as np

def our_function(x):
    # calculate the y value using the function 6x^2 + 2x - 4
    return 6 * x * x + 2 * x - 4

def data_generator(data_size=1000):
    x = np.random.randint(-1000, 1000, size=data_size)
    y = our_function(x)

    # Adding x^2 enables the model to find quadratic function much better.
    inputs = np.column_stack([x, x ** 2])
    # inputs = np.column_stack([x, x ** 2, x ** 3])
    labels = y.reshape(-1, 1)  # without the reshape we get wrong results

    return inputs, labels


class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = Linear(2, 50)
        # self.fc1 = Linear(3, 50)
        self.fc2 = Linear(50, 50)
        self.fc3 = Linear(50, 1)

        self.criterion = MSELoss()
        self.optimizer = torch.optim.Adam(self.parameters(), lr=0.01)

    def forward(self, x):
        # Shoud i add a dropout?
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def train_model(self, epochs=1000, data_size=1000):
        x_train, y_train = data_generator(data_size)

        for epoch in range(epochs):
            y_pred = model(Variable(Tensor(x_train)))
            y_train = Variable(Tensor(y_train))
            loss = self.criterion(y_pred, y_train)
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

            print(f"Epoch: {epoch} Loss: {loss / data_size:,.10f}")

model = Net()
model.train_model()

Generally, a neural net fits a piecewise function of x, and there are no x*x terms in a network. Thus, adding x^2 input is beneficial.

Because cubic polynomial root finding is more difficult that quadratic, indeed it becomes a nonconvex problem, I believe.

If you mean a problem with quadratic target equation, perhaps it is just a numerical issue, as your x^3 column range is -1e9…1e9, while it should have zero effect after training.

Hi Martijn!

Yes*!

*) For some definition of “lower polynomial.”

As Alex noted, by providing x**2 as input, you are giving your
model very valuable raw material with which to construct your
quadratic polynomial.

In fact, including x**2 lets you train an almost trivial model that
consists of a single Linear layer (and no activation functions):

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = Linear(2, 1)

    def forward(self, x):
        x = self.fc1(x)
        return x

The Linear (2, 1) layer (with bias), is, in fact, exactly the general
quadratic polynomial when fed with [x, x**2] as its input, so this
model will train to essentially machine precision, and the weight
and bias of the Linear will train to match the coefficients of your
our_function() quadratic polynomial.

First, adding x**3 just confuses the issue, making the training less
nimble. But more importantly, as Alex noted, x**3 is, in your case,
very large.

Given the range of your sample data, x is of order one thousand,
and x**2 is of order one million, while x**3 is of order one billion.
As training starts, the network has no idea what the three inputs
“mean” and has no sense of their relative scales. The x**3 term
has the largest effect on the output initially. So the network has to
spend many training iterations learning to ignore x**3 (or to process
it into some approximation to a quadratic), which wastes valuable
time, and potentially sends the training off on some wild goose chase
from which it never recovers.

A minor comment: MSELoss (using its defaults) takes the mean
of the loss over the batch, so loss / data_size is incorrectly
dividing by the batch size a second time.

In terms of understanding your loss, x is of order one thousand,
so your quadratic (that is, y_train) is of order one million. If your
errors (that is, y_pred - y_train) were comparable to y_train
(that is, your relative errors were of order 1), then your errors and
squared errors would be of order one million and one trillion,
respectively. So your loss (MSELoss means “mean-squared-error”)
will be of order one trillion (10**12) as training starts. Just be aware
that even if your training works well and your loss falls a lot, it will
still be numerically large.

You should probably train on MSELoss, but maybe also print out
sqrt (MSELoss) – the so-call root-mean-squared (rms) error.
Even this will be large, starting out on the order of one million. So
I would suggest also tracking the relative rms error, something like
sqrt (MSELoss / (y_train**2).mean()). This gives you a sense of
how well your predictions are doing relative to the scale of the values
your are trying to predict.

I’d like to expand on this a little. It is true that Linear layers introduce
no quadratic terms. Also the ReLU activation function is piecewise
linear. But many non-linear activation functions do have quadratic (and
higher-order) terms in their expansions, so they do introduce quadratic
terms into the overall function computed by the network.

(For example, pytorch’s ELU (“exponential-linear unit”) has a regime
where the quadratic term dominates.)

My quibble about activation functions introducing quadratic terms
notwithstanding, explicitly adding x**2 is, of course, beneficial in a
case like yours because the network doesn’t have to reproduce it by
digging it out of non-linearities in the activation functions.

Best.

K. Frank

1 Like

Yeah, but the thing is, usual networks won’t generate quadratic and linear “features” at once, so polynomials are approximated with something like additive models.

Thanks for the replies @KFrank @googlebot!

Indeed a simple Linear module with one layer does a much better job at finding the polynomial function than my original model. The beauty is in simplicity apparently.

As both of you mentioned this seems to be cause of the problem I ran into in my original post. I’m still a bit surprised the network doesn’t figure out it simply needs to ignore x**3 though.

Oops! Thanks for pointing that out, some leftover code from a previous version.

So in general, if one suspects that the data contains a certain relation/function, would you add said function to the input instead of letting the Neural Network try to find it?

Hi Martijn!

I’d go further: If I suspected that a dataset contained a certain
relation, I would try to model that relation directly, and not use
a neural network.

Things like performing linear regressions and fitting polynomials
with neural networks are nice toy problems to explore how neural
networks and gradient-descent optimization work, but if I were
doing it for real, I would avoid the complication of a neural network,
and just perform the regression or fit the polynomial.

Best.

K. Frank

1 Like