Why I can't overfit a single batch (the loss keeps oscilating)?

The_Snail · January 29, 2023, 11:11am

I am training simple feedforward neural network for regression task and I am just trying to overfit single batch of 32 examples (with 9 features) to see if the implementation is ok, however, the loss keeps oscilating no matter the learning rate and hidden size I try, it looks like this:

The data is standard scaled. The network is just having one hidden layer with ReLU:

BATCH_SIZE = 32
LR = 0.0001
NUM_EPOCHS = 100
HIDDEN_SIZE = 512

class FullyConnected(nn.Module):
    def __init__(self, hidden_size=HIDDEN_SIZE):
        super().__init__()
        self.fc1 = nn.Linear(in_features=in_features, out_features=hidden_size)
        self.fc2 = nn.Linear(in_features=hidden_size, out_features=1)
        
    def forward(self, x): 
        out = self.fc1(x)
        out = F.relu(out)
        out = self.fc2(out)
        return out

And the training loop is the usual one, except the fact that I am just overfitting for now a single batch:

model = FullyConnected().to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=LR)
model.train()

X, y = next(iter(train_dataloader))

for epoch in range(NUM_EPOCHS):
    y_pred = model(X)
    loss = criterion(y_pred, y)
                                                                                                                                                                                                                    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Why I can’t overfit a single batch and the loss is oscilating in a weird way despite the learning rate being small?

ptrblck · January 29, 2023, 8:56pm

Your model works fine for me and overfits random inputs to random outputs:

BATCH_SIZE = 32
LR = 0.0001
NUM_EPOCHS = 500
HIDDEN_SIZE = 512
device = "cuda"

class FullyConnected(nn.Module):
    def __init__(self, hidden_size=HIDDEN_SIZE):
        super().__init__()
        self.fc1 = nn.Linear(in_features=16, out_features=hidden_size)
        self.fc2 = nn.Linear(in_features=hidden_size, out_features=1)
        
    def forward(self, x): 
        out = self.fc1(x)
        out = F.relu(out)
        out = self.fc2(out)
        return out
    
    
model = FullyConnected().to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=LR)
model.train()

x, target = torch.randn(BATCH_SIZE, 16), torch.randn(BATCH_SIZE, 1)
x = x.to(device)
target = target.to(device)

losses = []
for epoch in range(NUM_EPOCHS):
    output = model(x)
    loss = criterion(output, target)                                                                                                                                                                                 
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print("epoch {}, loss {:.3f}".format(epoch, loss.item()))
    losses.append(loss.item())
    
plt.plot(losses)

Loss curve:

The_Snail · January 30, 2023, 2:11pm

Hi,
Running the model on random data is great way to differentiate between model and data problems, will remember that method for future, thanks!
As you have shown, the issue seems to be definietly connected to data, and while further experimenting on full dataset I noticed unusual thing - when I chnage the shuffle paramter in DataLoader to False the oscillation stops. I tested that to be sure on different optimizer but it gives same result:

I thought that shuffling is stabilizing the training, so the effect should be the opposite, I am confused by that result, do you have any ideas of why could that be?
I inspected the data but I wasn’t able to notice anything wrong in it.

ptrblck · January 30, 2023, 6:56pm

Shuffling should generally help in training and be careful analyzing the training loss only.
Could you also check the validation loss curve for the shuffled and unshuffled training runs?

The_Snail · January 30, 2023, 9:33pm

I was tracking also the validation loss (and MAPE) but they all show similar behavior.
For Adam:

For SGD:

I was wondering if it could be connected to the fact that the data is time series but I can’t imagine how it would make such impact.
However, the good thing is that the model in both cases learns, it’s just a mistery why the shuffling gives opposite effect.

ptrblck · January 30, 2023, 9:46pm

Yes, this could be the case and I was about to ask the same: i.e. if you have any temporal dependency between your samples.
Could you describe your dataset a bit more and what the samples/targets represent?

The_Snail · January 30, 2023, 10:21pm

It’s energy load data, so there could be periods where the values are quite low for some time, but they are evenly spaced in the dataset. That’s a typical slice of the data:

Input features consist of datetime variables (hour, day of month etc.) and past load variables (load 24h ago, load 1 week ago).
The data is not changing with time when it comes to its ‘shape’, there is no trend and the distribution looks similar.

The_Snail · January 30, 2023, 10:28pm

I am wondering if there could be any tests I could run to try to get to the root cause of it and reveal what I am overlooking. Like maybe tracking mean target of the batches or something like that?

ptrblck · January 30, 2023, 11:57pm

How are you dealing with this time dependency?
Are you indexing the data for previous time steps using the current index with a negative offset in the __getitem__?
I’m wondering if you are destroying this time dependency when shuffling is enabled.

The_Snail · January 31, 2023, 2:00pm

The features (including past load features) are already created previously and whole dataset is stored in pkl. In the Dataset class I simply read this file and then in __getitem__ i just index the dataframe to which the data is read and change it to tensor:

def __getitem__(self, index):
    X = self.dataset[self.X_columns].iloc[index, :]
    y = self.dataset[['consumption']].iloc[index, :]
    X = torch.tensor(X, dtype=torch.float32, device=device)
    y = torch.tensor(y, dtype=torch.float32, device=device)
    return X, y