While converting a colleague’s Keras network into PyTorch, I noticed that the training speed became significantly slower. The actual conversion is validated (gets the same results with actual data).
Below, I’ve provided some minimal examples that demonstrate the behavior using random data and a simple fully-connected network.
Summary: using an RTX 2080 Super GPU (driver version 460.80, CUDA version 11.2) with an Ubuntu 18.04.5 LTS container, I get ~2 seconds/epoch from Keras and ~15 seconds/epoch from PyTorch.
While generic suggestions to make PyTorch faster are always appreciated, I particularly want to understand what Keras is doing that PyTorch isn’t (or vice versa) in such a simple setup.
Keras code:
# imports and basic setup
import numpy as np
import tensorflow as tf
# print versions
print_versions = True
if print_versions:
import platform
print("Software versions")
print(f" * Python: {platform.python_version()}")
print(f" * numpy: {np.__version__}")
print(f" * tensorflow: {tf.__version__}")
# generate random data
N_train_class0 = N_train_class1 = 1_250_000
event_dim = 8
rng_seed = 0
rng = np.random.Generator(np.random.PCG64(rng_seed))
class0_events = rng.uniform(100,500,size=(N_train_class0, event_dim))
class0_ytarget = np.zeros(shape=(N_train_class0, 1))
class1_events = rng.uniform(100,500,size=(N_train_class1, event_dim))
class1_ytarget = np.zeros(shape=(N_train_class1, 1))
permutation = rng.permutation(N_train_class0 + N_train_class1)
events_train = np.concatenate([class0_events, class1_events])[permutation]
ytarget_train = np.concatenate([class0_ytarget, class1_ytarget])[permutation]
# setup model
tf.random.set_seed(0)
network = tf.keras.Sequential(name="event_variable")
network.add(tf.keras.layers.InputLayer(input_shape=(event_dim,)))
hidden_node_counts = [128, 64, 64, 64, 32]
for node_count in hidden_node_counts:
network.add(tf.keras.layers.Dense(node_count, activation='relu'))
network.add(tf.keras.layers.Dense(1, activation='sigmoid'))
network.summary()
event_input_tensor = tf.keras.Input(shape=(event_dim,), name='event_input')
output_tensor = network(event_input_tensor)
model = tf.keras.Model(
inputs = event_input_tensor,
outputs = output_tensor
)
model.summary()
# prepare for training
model.compile(optimizer='adam', loss='binary_crossentropy')
# do training
model.fit(x=events_train, y=ytarget_train, batch_size=5000, epochs=10, validation_split=0.2)
Keras output:
Software versions
* Python: 3.6.9
* numpy: 1.19.5
* tensorflow: 2.5.0
Model: "event_variable"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 1152
_________________________________________________________________
dense_1 (Dense) (None, 64) 8256
_________________________________________________________________
dense_2 (Dense) (None, 64) 4160
_________________________________________________________________
dense_3 (Dense) (None, 64) 4160
_________________________________________________________________
dense_4 (Dense) (None, 32) 2080
_________________________________________________________________
dense_5 (Dense) (None, 1) 33
=================================================================
Total params: 19,841
Trainable params: 19,841
Non-trainable params: 0
_________________________________________________________________
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
event_input (InputLayer) [(None, 8)] 0
_________________________________________________________________
event_variable (Sequential) (None, 1) 19841
=================================================================
Total params: 19,841
Trainable params: 19,841
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
400/400 [==============================] - 2s 4ms/step - loss: 0.0222 - val_loss: 3.5520e-16
Epoch 2/10
400/400 [==============================] - 2s 4ms/step - loss: 3.2005e-16 - val_loss: 3.5520e-16
Epoch 3/10
400/400 [==============================] - 2s 4ms/step - loss: 3.2005e-16 - val_loss: 3.5520e-16
Epoch 4/10
400/400 [==============================] - 2s 4ms/step - loss: 3.2005e-16 - val_loss: 3.5520e-16
Epoch 5/10
400/400 [==============================] - 2s 4ms/step - loss: 3.2005e-16 - val_loss: 3.5520e-16
Epoch 6/10
400/400 [==============================] - 2s 4ms/step - loss: 3.2005e-16 - val_loss: 3.5520e-16
Epoch 7/10
400/400 [==============================] - 2s 4ms/step - loss: 3.2005e-16 - val_loss: 3.5520e-16
Epoch 8/10
400/400 [==============================] - 2s 4ms/step - loss: 3.2005e-16 - val_loss: 3.5520e-16
Epoch 9/10
400/400 [==============================] - 2s 4ms/step - loss: 3.2005e-16 - val_loss: 3.5520e-16
Epoch 10/10
400/400 [==============================] - 2s 4ms/step - loss: 3.2005e-16 - val_loss: 3.5520e-16
PyTorch code:
# imports and basic setup
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader, random_split
import torch.optim as optim
from tqdm import tqdm
import numpy as np
# print versions
print_versions = True
if print_versions:
import platform
print("Software versions")
print(f" * Python: {platform.python_version()}")
print(f" * numpy: {np.__version__}")
print(f" * torch: {torch.__version__}")
# choose cpu or gpu
if torch.cuda.is_available():
device = torch.device('cuda')
print("Using GPU")
else:
device = torch.device('cpu')
print("Using CPU")
# generate random data
N_train_class0 = N_train_class1 = 1_250_000
event_dim = 8
rng_seed = 0
rng = np.random.Generator(np.random.PCG64(rng_seed))
class0_events = rng.uniform(100,500,size=(N_train_class0, event_dim))
class0_ytarget = np.zeros(shape=(N_train_class0, 1))
class1_events = rng.uniform(100,500,size=(N_train_class1, event_dim))
class1_ytarget = np.zeros(shape=(N_train_class1, 1))
permutation = rng.permutation(N_train_class0 + N_train_class1)
dataset = TensorDataset(
torch.Tensor(np.concatenate([class0_events, class1_events])[permutation]).to(device),
torch.Tensor(np.concatenate([class0_ytarget, class1_ytarget])[permutation]).to(device)
)
# setup model
torch.manual_seed(0)
def init_weights(m):
if type(m) == nn.Linear:
torch.nn.init.xavier_uniform(m.weight)
m.bias.data.fill_(0.0)
hidden_node_counts = [128, 64, 64, 64, 32]
layers = [
nn.Linear(in_features = event_dim, out_features = hidden_node_counts[0]),
nn.ReLU()
]
for counter in range(len(hidden_node_counts)-1):
layers.extend([
nn.Linear(in_features = hidden_node_counts[counter], out_features = hidden_node_counts[counter+1]),
nn.ReLU()
])
layers.extend([
nn.Linear(in_features = hidden_node_counts[-1], out_features = 1),
nn.Sigmoid()
])
model = nn.Sequential(*layers).to(device)
print(model)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Trainable params: {}".format(total_params))
# prepare for training
pct_valid = 0.2
num_valid = int(len(dataset)*pct_valid)
split_train, split_valid = random_split(dataset, [len(dataset)-num_valid, num_valid])
batch_size = 5000
loader_train = DataLoader(split_train, batch_size=batch_size, shuffle=True)
loader_valid = DataLoader(split_valid, batch_size=batch_size, shuffle=True)
criterion = nn.BCELoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
# do training
epochs = 10
for epoch in range(epochs):
print("Epoch {}/{}".format(epoch+1,epochs))
train_loss = 0
for i, data in tqdm(enumerate(loader_train), unit="batch", total=len(loader_train)):
model.train()
model.zero_grad()
optimizer.zero_grad()
event_data, target_data = data
output = model(event_data)
batch_loss = criterion(output, target_data)
batch_loss.backward()
optimizer.step()
model.eval()
train_loss += batch_loss.item()
train_loss /= len(loader_train)
tqdm.write("loss: {}".format(train_loss))
# validation
valid_loss = 0
with torch.no_grad():
for i, data in enumerate(loader_valid):
event_data, target_data = data
output = model(event_data)
batch_loss = criterion(output, target_data)
valid_loss += batch_loss.item()
valid_loss /= len(loader_valid)
tqdm.write("val_loss: {}".format(valid_loss))
PyTorch output:
Software versions
* Python: 3.8.8
* numpy: 1.19.2
* torch: 1.8.1
Using GPU
Sequential(
(0): Linear(in_features=8, out_features=128, bias=True)
(1): ReLU()
(2): Linear(in_features=128, out_features=64, bias=True)
(3): ReLU()
(4): Linear(in_features=64, out_features=64, bias=True)
(5): ReLU()
(6): Linear(in_features=64, out_features=64, bias=True)
(7): ReLU()
(8): Linear(in_features=64, out_features=32, bias=True)
(9): ReLU()
(10): Linear(in_features=32, out_features=1, bias=True)
(11): Sigmoid()
)
Trainable params: 19841
Epoch 1/10
100%|██████████| 400/400 [00:15<00:00, 25.90batch/s]
loss: 0.0007586651195278834
val_loss: 9.65595376399564e-11
Epoch 2/10
100%|██████████| 400/400 [00:15<00:00, 26.06batch/s]
loss: 8.720163338594642e-11
val_loss: 6.139279044338475e-11
Epoch 3/10
100%|██████████| 400/400 [00:15<00:00, 26.20batch/s]
loss: 5.653502557594059e-11
val_loss: 3.945827847622041e-11
Epoch 4/10
100%|██████████| 400/400 [00:15<00:00, 26.17batch/s]
loss: 3.585220385111595e-11
val_loss: 2.4437905700794295e-11
Epoch 5/10
100%|██████████| 400/400 [00:15<00:00, 25.85batch/s]
loss: 2.306700272415238e-11
val_loss: 1.6093254558841032e-11
Epoch 6/10
100%|██████████| 400/400 [00:15<00:00, 26.12batch/s]
loss: 1.5616419936706484e-11
val_loss: 9.655952790815769e-12
Epoch 7/10
100%|██████████| 400/400 [00:15<00:00, 26.15batch/s]
loss: 1.0102988931559586e-11
val_loss: 5.602836770923769e-12
Epoch 8/10
100%|██████████| 400/400 [00:15<00:00, 26.10batch/s]
loss: 7.271767608636043e-12
val_loss: 3.4570694570912332e-12
Epoch 9/10
100%|██████████| 400/400 [00:15<00:00, 25.95batch/s]
loss: 5.5432326155450965e-12
val_loss: 2.6226043559063328e-12
Epoch 10/10
100%|██████████| 400/400 [00:15<00:00, 26.10batch/s]
loss: 4.321337200834802e-12
val_loss: 1.907348680038612e-12