Hello everybody.
I’ve been experimenting with different models and different frameworks, and I’ve noticed that, when using CPU, training a LSTM model on the IMDB dataset is 3x to 5x slower on PyTorch (around 739 seconds) compared to the Keras and TensorFlow implementations (around 201 seconds and around 135 seconds, respectively). Moreover, I’ve also noticed that the first epoch takes significantly more time than the rest of the epochs:
-PyTorch: Epoch 1 done in 235.0469572544098s
-PyTorch: Epoch 2 done in 125.87335634231567s
-PyTorch: Epoch 3 done in 125.26632475852966s
-PyTorch: Epoch 4 done in 126.59195327758789s
-PyTorch: Epoch 5 done in 126.00697541236877s
Which doesn’t occur when using the other frameworks:
Keras:
Epoch 1/5
98/98 [==============================] - 41s 408ms/step - loss: 0.5280 - accuracy: 0.7300
Epoch 2/5
98/98 [==============================] - 40s 404ms/step - loss: 0.3441 - accuracy: 0.8566
Epoch 3/5
98/98 [==============================] - 40s 406ms/step - loss: 0.2384 - accuracy: 0.9080
Epoch 4/5
98/98 [==============================] - 40s 406ms/step - loss: 0.1625 - accuracy: 0.9386
Epoch 5/5
98/98 [==============================] - 40s 406ms/step - loss: 0.1176 - accuracy: 0.9580
TensorFlow:
-TensorFlow: Epoch 1 done in 37.287458419799805s
-TensorFlow: Epoch 2 done in 36.93708920478821s
-TensorFlow: Epoch 3 done in 36.85307550430298s
-TensorFlow: Epoch 4 done in 37.23605704307556s
-TensorFlow: Epoch 5 done in 37.04216718673706s
While using GPU, the problem seems to disappear.
PyTorch:
-PyTorch: Epoch 1 done in 2.6681089401245117s
-PyTorch: Epoch 2 done in 2.623263120651245s
-PyTorch: Epoch 3 done in 2.6285109519958496s
-PyTorch: Epoch 4 done in 2.6813976764678955s
-PyTorch: Epoch 5 done in 2.6470844745635986s
Keras:
Epoch 1/5
98/98 [==============================] - 6s 44ms/step - loss: 0.5434 - accuracy: 0.7220
Epoch 2/5
98/98 [==============================] - 4s 44ms/step - loss: 0.4673 - accuracy: 0.7822
Epoch 3/5
98/98 [==============================] - 4s 45ms/step - loss: 0.2500 - accuracy: 0.8998
Epoch 4/5
98/98 [==============================] - 4s 46ms/step - loss: 0.1581 - accuracy: 0.9434
Epoch 5/5
98/98 [==============================] - 4s 46ms/step - loss: 0.0985 - accuracy: 0.9660
TensorFlow:
-TensorFlow: Epoch 1 done in 4.04967999458313s
-TensorFlow: Epoch 2 done in 2.443302869796753s
-TensorFlow: Epoch 3 done in 2.450983762741089s
-TensorFlow: Epoch 4 done in 2.4626052379608154s
-TensorFlow: Epoch 5 done in 2.4663102626800537s
Here’s the information on my PyTorch build:
PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.2
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
- CuDNN 7.6.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON
Here’s the model’s code:
class PyTorchLSTMMod(torch.nn.Module):
"""This class implements the LSTM model using PyTorch.
Arguments
---------
initializer: function
The weight initialization function from the torch.nn.init module that is used to initialize
the initial weights of the models.
vocabulary_size: int
The number of words that are to be considered among the words that used most frequently.
embedding_size: int
The number of dimensions to which the words will be mapped to.
hidden_size: int
The number of features of the hidden state.
dropout: float
The dropout rate that will be considered during training.
"""
def __init__(self, initializer, vocabulary_size, embedding_size, hidden_size, dropout):
super().__init__()
self.embed = torch.nn.Embedding(num_embeddings=vocabulary_size, embedding_dim=embedding_size)
self.dropout1 = torch.nn.Dropout(dropout)
self.lstm = torch.nn.LSTM(input_size=embedding_size, hidden_size=hidden_size, batch_first=True)
initializer(self.lstm.weight_ih_l0)
torch.nn.init.orthogonal_(self.lstm.weight_hh_l0)
self.dropout2 = torch.nn.Dropout(dropout)
self.fc = torch.nn.Linear(in_features=hidden_size, out_features=1)
def forward(self, inputs, is_training=False):
"""This function implements the forward pass of the model.
Arguments
---------
inputs: Tensor
The set of samples the model is to infer.
is_training: boolean
This indicates whether the forward pass is occuring during training
(i.e., if we should consider dropout).
"""
x = inputs
x = self.embed(x)
if is_training:
x = self.dropout1(x)
o, (h, c) = self.lstm(x)
out = h[-1]
if is_training:
out = self.dropout2(out)
f = self.fc(out)
return f.flatten()#torch.sigmoid(f).flatten()
def train_pytorch(self, optimizer, epoch, train_loader, device, data_type, log_interval):
"""This function implements a single epoch of the training process of the PyTorch model.
Arguments
---------
self: PyTorchLSTMMod
The model that is to be trained.
optimizer: torch.nn.optim
The optimizer to be used during the training process.
epoch: int
The epoch associated with the training process.
train_loader: DataLoader
The DataLoader that is used to load the training data during the training process.
Note that the DataLoader loads the data according to the batch size
defined with it was initialized.
device: string
The string that indicates which device is to be used at runtime (i.e., GPU or CPU).
data_type: string
This string indicates whether mixed precision is to be used or not.
log_interval: int
The interval at which the model logs the process of the training process
in terms of number of batches passed through the model.
"""
self.train()
epoch_start = time.time()
loss_fn = torch.nn.BCEWithLogitsLoss()
if data_type == 'mixed':
scaler = torch.cuda.amp.GradScaler()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
if data_type == 'mixed':
with torch.cuda.amp.autocast():
output = self(data, is_training=True)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
output = self(data, is_training=True)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
if log_interval == -1:
continue
if batch_idx % log_interval == 0:
print('Train set, Epoch {}\tLoss: {:.6f}'.format(
epoch, loss.item()))
print("-PyTorch: Epoch {} done in {}s\n".format(epoch, time.time() - epoch_start))
def test_pytorch(self, test_loader, device, data_type):
"""This function implements the testing process of the PyTorch model and returns the accuracy
obtained on the testing dataset.
Arguments
---------
model: torch.nn.Module
The model that is to be tested.
test_loader: DataLoader
The DataLoader that is used to load the testing data during the testing process.
Note that the DataLoader loads the data according to the batch size
defined with it was initialized.
device: string
The string that indicates which device is to be used at runtime (i.e., GPU or CPU).
data_type: string
This string indicates whether mixed precision is to be used or not.
"""
self.eval()
with torch.no_grad():
#Loss and correct prediction accumulators
test_loss = 0
correct = 0
total = 0
loss_fn = torch.nn.BCEWithLogitsLoss()
for data, target in test_loader:
data, target = data.to(device), target.to(device)
if data_type == 'mixed':
with torch.cuda.amp.autocast():
outputs = self(data).detach()
test_loss += loss_fn(outputs, target).detach()
preds = (outputs >= 0.5).float() == target
correct += preds.sum().item()
total += preds.size(0)
else:
outputs = self(data).detach()
test_loss += loss_fn(outputs, target).detach()
preds = (outputs >= 0.5).float() == target
correct += preds.sum().item()
total += preds.size(0)
#Print log
test_loss /= len(test_loader.dataset)
print('\nTest set, Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * (correct / total)))
return 100. * (correct / total)
Any ideas what could be the cause of this?
Thanks!