Dataloader is extremely slow even with small dataset in memory

Here is a simple numerical regression example with random data.
Input: (10000,300) output: (10000,3) They have a simple quadratic relationship. It’s not because of data distribution. I had this problem in a real dataset of mine. I used a 3-layer fully-connected with batch normalization.
I try to use the same parameters for keras and pytorch on CPU, since it’s relatively small dataset. It turns out keras is almost 3x as fast.

colab script: https://colab.research.google.com/drive/1BQTCbIUOv-afuRbSn2chA4ae1bfD0bDk

def build_keras_model(optimizer='adam'):
  input = Input(shape=(300,))
  x = Dense(100, activation='tanh')(input)
  x = BatchNormalization()(x)
  x = Dense(50, activation='tanh')(x)
  x = BatchNormalization()(x)
  output = Dense(3, activation='linear')(x)
  model = Model(inputs=input, outputs=output)
  model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])
  return model

keras_model = build_keras_model()
keras_model.fit(X, Y, batch_size=512, epochs=200)
def build_pytorch_model():
  model = nn.Sequential(
    nn.Linear(300,100),
    nn.Tanh(),
    nn.BatchNorm1d(100, momentum=0.01),
    nn.Linear(100,50),
    nn.Tanh(),
    nn.BatchNorm1d(50, momentum=0.01),
    nn.Linear(50,3)
  )
  return model

loss_func = nn.MSELoss()
torch_model = build_pytorch_model()
optimizer = torch.optim.Adam(torch_model.parameters(), lr=lr)
fit_torch(torch_model, optimizer, loss_func, epochs, dl)

The script and result are shared through Colab. Thanks for anyone pointing out the reason or where I did wrong.

The speed issues seems to caused by dataloader. Anyway to speed up this? All the data is in memory, it should be fast. I could write random sampling myself, but will it be faster? How could keras do it that fast? I tried using a full batch, it’s not much faster either.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    31                                           def fit_torch(model, optimizer, loss_func, epochs, train_dl):
    32       201        309.0      1.5      0.0    for epoch in range(epochs):
    33       200        268.0      1.3      0.0      t0 = time.time()
    34      4200   21929837.0   5221.4     27.8      for x, y in train_dl:
    35      4000     493733.0    123.4      0.6        model.train()
    36      4000   11947182.0   2986.8     15.1        loss = loss_func(model(x),y)
    37      4000     326891.0     81.7      0.4        optimizer.zero_grad()
    38      4000   13885546.0   3471.4     17.6        loss.backward()
    39      4000    3682309.0    920.6      4.7        optimizer.step()
    40                                           #     mse = get_mse(model, dl)
    41       200       2280.0     11.4      0.0      mse = loss.item()
    42       200   26631701.0 133158.5     33.7      mae = get_mae(model, dl)
    43       200      73901.0    369.5      0.1      print('epoch: {} MSE:{} MAE:{} time per epoch:{}'.format(epoch,mse,mae,time.time()-t0))
    44       200        652.0      3.3      0.0      t0 = time.time()
    Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           def get_mae(model, dl):
    23       200      24120.0    120.6      0.1      model.eval()
    24       200      18308.0     91.5      0.1      loss_func = nn.L1Loss(reduction='sum')
    25       200        198.0      1.0      0.0      loss = 0.
    26      4200   19973969.0   4755.7     75.6      for x, y in dl:
    27      4000    6366256.0   1591.6     24.1          l = loss_func(model(x),y)
    28      4000      52603.0     13.2      0.2          loss += l.item()
    29       200       1685.0      8.4      0.0      return loss / len(dl.dataset)

I have the same problem on Colab. Not sure whether Colab’s disk acess is slow or this is a problem pertaining to every system.