Strange difference in performance between 2 regression programs

Hi,

Here’re 2 regression programs with the same sets of data and the same basic model (2 inputs and 1 outpu):

Program 1 :

import numpy as np
import torch
import time

# Define the model
def model(x):
    return x @ w.t() + b

# MSE loss
def mse(t1, t2):
    diff = t1 - t2
    return torch.sum(diff * diff) / diff.numel()

def fit(num_epochs, model, loss_fn, w, b):
    for i in range(num_epochs):
        preds = model(inputs)
        loss = loss_fn(preds, targets)
        loss.backward()
        with torch.no_grad():
            w -= w.grad * lr
            b -= b.grad * lr
            w.grad.zero_()
            b.grad.zero_()

lr = 1e-3
nb_epochs = 1000

nb_data = 1000
min_x = 2.0
max_x = 3.0
min_y = 5.0
max_y = 9.0

X = np.linspace(min_x, max_x, num=nb_data, dtype=np.float32)
Y = np.linspace(min_y, max_y, num=nb_data, dtype=np.float32)
inputs = np.stack((X, Y), axis=1)
targets = X + Y
targets = targets.reshape(targets.size, 1)

# Convert inputs and targets to tensors
inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)

# Weights and biases
w = torch.randn(1, 2, requires_grad=True)
b = torch.randn(1, requires_grad=True)

begin = time.time()
fit(nb_epochs, model, mse, w, b)
end = time.time()
print(f"Duration = {end-begin} s")

# Calculate loss
preds = model(inputs)
loss = mse(preds, targets)
print(f"Loss = {loss}")

# Calculate a prediction
pred_y = model(torch.Tensor([[2.5, 6]]))
print("predict ", pred_y.item(), " should be ===>",8.5 )
type or paste code here

Program 2 :

import torch.nn as nn
import torch
import numpy as np
from torch.utils.data import TensorDataset, DataLoader
import torch.nn.functional as F
import time

# Define a utility function to train the model
def fit(num_epochs, model, loss_fn, opt):
    for epoch in range(num_epochs):
        for xb,yb in train_dl:
            # Generate predictions
            pred = model(xb)
            loss = loss_fn(pred, yb)
            # Perform gradient descent
            loss.backward()
            opt.step()
            opt.zero_grad()

device = "cpu"
#device = "cuda:0"

lr = 1e-3
batch_size = 100
nb_epochs = 1000

nb_data = 1000
min_x = 2.0
max_x = 3.0
min_y = 5.0
max_y = 9.0

X = np.linspace(min_x, max_x, num=nb_data, dtype=np.float32)
Y = np.linspace(min_y, max_y, num=nb_data, dtype=np.float32)
inputs = np.stack((X, Y), axis=1)
targets = X + Y
targets = targets.reshape(targets.size, 1)

inputs = torch.from_numpy(inputs).to(device)
targets = torch.from_numpy(targets).to(device)

train_ds = TensorDataset(inputs, targets)
# Define data loader
train_dl = DataLoader(train_ds, batch_size, shuffle=True)

# Define model, 2 inputs, 1 output
model = nn.Linear(2, 1).to(device)
# Define optimizer
opt = torch.optim.SGD(model.parameters(), lr=lr)
# Define loss function
loss_fn = F.mse_loss

# Train the model for some epochs
begin = time.time()
fit(nb_epochs, model, loss_fn, opt)
end = time.time()
print(f"Duration = {end-begin} s")

# Calculate final loss
preds = model(inputs)
loss = loss_fn(preds, targets)
print(f"Loss = {loss}")

# Evaluate a prediction
pred_y = model(torch.Tensor([[2.5, 6]]).to(device))
print("predict ", pred_y.item(), " should be ===>",8.5 )

They have the same number of data (1000) and the same number of epochs (1000)

When I run these programs on the same machine (Ubuntu 20.04, 32Gb, core I7, NVIDIA 2080, torch 1.12.1 ), here are the duration for the training (fit function) :

program 1 : 0.25s
program 2 (on CPU) : 8.5s
program 2 (on GPU) : 11.8s

Why a so big difference between program 1 and program 2? And for program 2 with GPU, why is it worst than with CPU?

Regards,

Philippe

Your model is tiny as it’s a single operation/layer and you are most likely seeing the overhead of creating the DataLoader, shuffling the data, etc.
You can profile parts of your code to narrow down where the slowdown is coming from.
Just executing the DataLoader on my system takes a few seconds for 1000 epochs.

Thank you for your answer.

I measure the duration just for the fit function, so the preparation of the data (creation of DataLoader and shuffling) is not tacking into account for the duration. But, maybe this overconsumption of time is due to extraction of data from the DataLoader in the

for xb,yb in train_dl:

loop?

Another strange thing is that it takes more time to process when I use a GPU but maybe it is due to the time to transfer data to the GPU?

For information, at first, I wanted to understand why a simple regression example took so long to run with Pytorch Lighting. So to compare with a simpler solution, I started by creating this tiny models with just Pytorch, but that led me to ask myself this question about the difference in performance between these two solutions.

Philippe

That’s not true, since iterating the DataLoader will recreate it in each epoch to e.g. create a new sampler etc. If num_workers>0 is used than also the workers will be re-spawned unless persistent_workers=True is used. Again, you can profile your code to narrow down the slowdown.

Your GPU profiling is invalid, since CUDA operations are executed asynchronously, so you would need to synchronize the code before starting and stopping the timers. However, even with proper profiling I would not expect to see any speedup, since your entire code is already bottlenecked by the data loading. A single tiny linear layer with in_features=2 and out_features=1 would also not benefit lrgely from a GPU execution.

Thank you for all this information.

I’ll continue my investigation with profiling and a bigger dataset.

Philippe

1 Like

Sounds good!
Also, try to increase the actual workload of the model and let me know how it goes.