LSTM on CPU is significantly slower on PyTorch compared to other frameworks

Hello everybody.

I’ve been experimenting with different models and different frameworks, and I’ve noticed that, when using CPU, training a LSTM model on the IMDB dataset is 3x to 5x slower on PyTorch (around 739 seconds) compared to the Keras and TensorFlow implementations (around 201 seconds and around 135 seconds, respectively). Moreover, I’ve also noticed that the first epoch takes significantly more time than the rest of the epochs:

-PyTorch: Epoch 1 done in 235.0469572544098s

-PyTorch: Epoch 2 done in 125.87335634231567s

-PyTorch: Epoch 3 done in 125.26632475852966s

-PyTorch: Epoch 4 done in 126.59195327758789s

-PyTorch: Epoch 5 done in 126.00697541236877s

Which doesn’t occur when using the other frameworks:

Keras:

Epoch 1/5
98/98 [==============================] - 41s 408ms/step - loss: 0.5280 - accuracy: 0.7300
Epoch 2/5
98/98 [==============================] - 40s 404ms/step - loss: 0.3441 - accuracy: 0.8566
Epoch 3/5
98/98 [==============================] - 40s 406ms/step - loss: 0.2384 - accuracy: 0.9080
Epoch 4/5
98/98 [==============================] - 40s 406ms/step - loss: 0.1625 - accuracy: 0.9386
Epoch 5/5
98/98 [==============================] - 40s 406ms/step - loss: 0.1176 - accuracy: 0.9580

TensorFlow:

-TensorFlow: Epoch 1 done in 37.287458419799805s
-TensorFlow: Epoch 2 done in 36.93708920478821s
-TensorFlow: Epoch 3 done in 36.85307550430298s
-TensorFlow: Epoch 4 done in 37.23605704307556s
-TensorFlow: Epoch 5 done in 37.04216718673706s

While using GPU, the problem seems to disappear.

PyTorch:

-PyTorch: Epoch 1 done in 2.6681089401245117s

-PyTorch: Epoch 2 done in 2.623263120651245s

-PyTorch: Epoch 3 done in 2.6285109519958496s

-PyTorch: Epoch 4 done in 2.6813976764678955s

-PyTorch: Epoch 5 done in 2.6470844745635986s

Keras:

Epoch 1/5
98/98 [==============================] - 6s 44ms/step - loss: 0.5434 - accuracy: 0.7220
Epoch 2/5
98/98 [==============================] - 4s 44ms/step - loss: 0.4673 - accuracy: 0.7822
Epoch 3/5
98/98 [==============================] - 4s 45ms/step - loss: 0.2500 - accuracy: 0.8998
Epoch 4/5
98/98 [==============================] - 4s 46ms/step - loss: 0.1581 - accuracy: 0.9434
Epoch 5/5
98/98 [==============================] - 4s 46ms/step - loss: 0.0985 - accuracy: 0.9660

TensorFlow:

-TensorFlow: Epoch 1 done in 4.04967999458313s
-TensorFlow: Epoch 2 done in 2.443302869796753s
-TensorFlow: Epoch 3 done in 2.450983762741089s
-TensorFlow: Epoch 4 done in 2.4626052379608154s
-TensorFlow: Epoch 5 done in 2.4663102626800537s

Here’s the information on my PyTorch build:

PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON

Here’s the model’s code:

class PyTorchLSTMMod(torch.nn.Module):
    """This class implements the LSTM model using PyTorch.

    Arguments
    ---------
    initializer: function
        The weight initialization function from the torch.nn.init module that is used to initialize
        the initial weights of the models.
    vocabulary_size: int
        The number of words that are to be considered among the words that used most frequently.
    embedding_size: int
        The number of dimensions to which the words will be mapped to.
    hidden_size: int
        The number of features of the hidden state.
    dropout: float
        The dropout rate that will be considered during training.
    """
    def __init__(self, initializer, vocabulary_size, embedding_size, hidden_size, dropout):
        super().__init__()
        
        self.embed = torch.nn.Embedding(num_embeddings=vocabulary_size, embedding_dim=embedding_size)

        self.dropout1 = torch.nn.Dropout(dropout)

        self.lstm = torch.nn.LSTM(input_size=embedding_size, hidden_size=hidden_size, batch_first=True)
        initializer(self.lstm.weight_ih_l0)
        torch.nn.init.orthogonal_(self.lstm.weight_hh_l0)
        
        self.dropout2 = torch.nn.Dropout(dropout)
        
        self.fc = torch.nn.Linear(in_features=hidden_size, out_features=1)

        


    def forward(self, inputs, is_training=False):
        """This function implements the forward pass of the model.
        
        Arguments
        ---------
        inputs: Tensor
            The set of samples the model is to infer.
        is_training: boolean
            This indicates whether the forward pass is occuring during training
            (i.e., if we should consider dropout).
        """
        x = inputs
        x = self.embed(x)
        if is_training:
            x = self.dropout1(x)

        o, (h, c) = self.lstm(x)
        out = h[-1]
        if is_training:
            out = self.dropout2(out)
        f = self.fc(out) 
        return f.flatten()#torch.sigmoid(f).flatten()

    def train_pytorch(self, optimizer, epoch, train_loader, device, data_type, log_interval):
        """This function implements a single epoch of the training process of the PyTorch model.

        Arguments
        ---------
        self: PyTorchLSTMMod
            The model that is to be trained.
        optimizer: torch.nn.optim
            The optimizer to be used during the training process.
        epoch: int
            The epoch associated with the training process.
        train_loader: DataLoader
            The DataLoader that is used to load the training data during the training process.
            Note that the DataLoader loads the data according to the batch size
            defined with it was initialized.
        device: string
            The string that indicates which device is to be used at runtime (i.e., GPU or CPU).
        data_type: string
            This string indicates whether mixed precision is to be used or not.
        log_interval: int
            The interval at which the model logs the process of the training process
            in terms of number of batches passed through the model.
        """
        self.train()

        epoch_start = time.time()
        
        loss_fn = torch.nn.BCEWithLogitsLoss()

        if data_type == 'mixed':
            scaler = torch.cuda.amp.GradScaler()

        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)

            optimizer.zero_grad()

            if data_type == 'mixed':
                with torch.cuda.amp.autocast():
                    output = self(data, is_training=True)


                    loss = loss_fn(output, target)

                scaler.scale(loss).backward()

                scaler.step(optimizer)
                scaler.update()
            else:

                output = self(data, is_training=True)

                loss = loss_fn(output, target)

                loss.backward()


                optimizer.step()

            if log_interval == -1:
                continue

            if batch_idx % log_interval == 0:
                print('Train set, Epoch {}\tLoss: {:.6f}'.format(
                    epoch, loss.item()))
        print("-PyTorch: Epoch {} done in {}s\n".format(epoch, time.time() - epoch_start))

    def test_pytorch(self, test_loader, device, data_type):
        """This function implements the testing process of the PyTorch model and returns the accuracy
        obtained on the testing dataset.

        Arguments
        ---------
        model: torch.nn.Module
            The model that is to be tested.
        test_loader: DataLoader
            The DataLoader that is used to load the testing data during the testing process.
            Note that the DataLoader loads the data according to the batch size
            defined with it was initialized.
        device: string
            The string that indicates which device is to be used at runtime (i.e., GPU or CPU).
        data_type: string
            This string indicates whether mixed precision is to be used or not.

        """
        
        
        self.eval()

        with torch.no_grad():

            #Loss and correct prediction accumulators
            test_loss = 0
            correct = 0
            total = 0

            loss_fn = torch.nn.BCEWithLogitsLoss()


            for data, target in test_loader:

                data, target = data.to(device), target.to(device)

                if data_type == 'mixed':
                    with torch.cuda.amp.autocast():

                        outputs = self(data).detach()

                        test_loss += loss_fn(outputs, target).detach()


                        preds = (outputs >= 0.5).float() == target
                        correct += preds.sum().item()
                        total += preds.size(0)

                else:
                    outputs = self(data).detach()

                    test_loss += loss_fn(outputs, target).detach()

                    preds = (outputs >= 0.5).float() == target

                    correct += preds.sum().item()
                    total += preds.size(0)

            #Print log
            test_loss /= len(test_loader.dataset)
            print('\nTest set, Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
                test_loss, correct, len(test_loader.dataset),
                100. * (correct / total)))

            return 100. * (correct / total)

Any ideas what could be the cause of this?

Thanks!