Dataloader slows down when training with mac MPS

Wang-Yu-Qing · January 28, 2024, 8:55am

My dataset code

# just load image rescale it, to tensor and process annotation coord
def load_coord_data(img_path, anno_path, h, w):
  img = cv2.imread(img_path, cv2.IMREAD_COLOR)
  scale = img.shape[0] / h
  img = cv2.resize(img, (w, h))
  img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  img = img2tensor(img)

  with open(anno_path, 'r') as f:
    anno = json.loads(f.read())
    coords = np.zeros((len(50), 2), dtype=np.float32) # 50 classes' x, y coords
    for idx in anno:
      if _class in anno:
        coords[idx, 0], coords[idx, 1] = anno[idx]['x'], anno[idx]['y']
      else:
        coords[idx, 0], coords[idx, 1] = -1.0, -1.0

  return img, coords

class KeyPointsDataset(Dataset):
  def __init__(self, h, w, input_dir="xxx"):
    self.h, self.w = h, w
    files = os.listdir(input_dir)
    self.img_files = sorted([os.path.join(input_dir, fn) for fn in files if fn.endswith("jpg")])
    self.anno_files = sorted([os.path.join(input_dir, fn) for fn in files if fn.endswith("json")])
  
  def __getitem__(self, idx):
      img, labels = load_coord_data(self.img_files[idx], self.anno_files[idx], self.h, self.w)
      return img, labels
  
  def __len__(self):
    return len(self.img_files)

This is my training code:

dataset = KeyPointsDataset(h, w)
dataloader = DataLoader(
 dataset,
  batch_size=1,
  shuffle=True,
  num_workers=1,
  drop_last=False,
  pin_memory=True
)
  
loss = th.nn.MSELoss()
device = th.device('mps')
self.model = self.model.to(device)
self.model.train()

for epoch in range(10):
  for step, (img, labels) in enumerate(dataloader):
    img, labels = img.to(device, non_blocking=True), labels.to(device, non_blocking=True)
    # the time that dataloder took
    if step > 0:
      dataloader_time = round(time.monotonic() - toc, 2)
    else:
      dataloader_time = -1

    tic = time.monotonic()

    pred = self.model(img)
    
    _loss = loss(pred, labels)
    _loss.backward()
    self.opt.step()
    self.opt.zero_grad()

    toc = time.monotonic()
    if step % self.config.LOG_STEPS == 0:
      print('Epoch {:03d} | Step {:05d} | Step Loss {:.6} | Train time {} | Dataloader time {}'.format(
        epoch, step, float(_loss.cpu().detach().numpy()), round(toc - tic, 2), dataloader_time))

When I train my mode using MPS, the dataloader is fast at the first few steps but slows down afterwards:

Epoch 000 | Step 00000 | Step Loss 0.828719 | Train time 3.6 | Dataloader time -1.0
Epoch 000 | Step 00001 | Step Loss 0.708257 | Train time 1.09 | Dataloader time 0.13
Epoch 000 | Step 00002 | Step Loss 0.658343 | Train time 1.15 | Dataloader time 1.27
Epoch 000 | Step 00003 | Step Loss 0.493049 | Train time 1.8 | Dataloader time 2.02
Epoch 000 | Step 00004 | Step Loss 2.28905 | Train time 1.36 | Dataloader time 4.77
Epoch 000 | Step 00005 | Step Loss 0.322044 | Train time 2.05 | Dataloader time 3.58
Epoch 000 | Step 00006 | Step Loss 0.535195 | Train time 1.61 | Dataloader time 5.31
Epoch 000 | Step 00007 | Step Loss 0.647095 | Train time 1.93 | Dataloader time 4.69
Epoch 000 | Step 00008 | Step Loss 0.572585 | Train time 2.03 | Dataloader time 4.75
Epoch 000 | Step 00009 | Step Loss 0.533676 | Train time 5.66 | Dataloader time 6.76
Epoch 000 | Step 00010 | Step Loss 0.569616 | Train time 2.16 | Dataloader time 8.46
Epoch 000 | Step 00011 | Step Loss 0.527826 | Train time 1.95 | Dataloader time 6.09
Epoch 000 | Step 00012 | Step Loss 0.429697 | Train time 2.89 | Dataloader time 5.06
Epoch 000 | Step 00013 | Step Loss 0.463338 | Train time 3.53 | Dataloader time 7.06
Epoch 000 | Step 00014 | Step Loss 0.573107 | Train time 3.31 | Dataloader time 7.57
Epoch 000 | Step 00015 | Step Loss 0.664436 | Train time 2.01 | Dataloader time 6.17
Epoch 000 | Step 00016 | Step Loss 0.420959 | Train time 1.76 | Dataloader time 5.49
Epoch 000 | Step 00017 | Step Loss 0.366839 | Train time 5.88 | Dataloader time 5.55

However, if I switch to CPU (changing the device to th.device('cpu')), the dataloader is fast and consistant:

Epoch 000 | Step 00000 | Step Loss 0.768135 | Train time 6.0 | Dataloader time -1
Epoch 000 | Step 00001 | Step Loss 0.912373 | Train time 4.49 | Dataloader time 0.0
Epoch 000 | Step 00002 | Step Loss 0.678868 | Train time 5.33 | Dataloader time 0.0
Epoch 000 | Step 00003 | Step Loss 0.518494 | Train time 5.62 | Dataloader time 0.0
Epoch 000 | Step 00004 | Step Loss 0.647296 | Train time 5.34 | Dataloader time 0.0
Epoch 000 | Step 00005 | Step Loss 0.621026 | Train time 4.64 | Dataloader time 0.0
Epoch 000 | Step 00006 | Step Loss 0.611825 | Train time 5.24 | Dataloader time 0.0
Epoch 000 | Step 00007 | Step Loss 0.557198 | Train time 4.31 | Dataloader time 0.0
Epoch 000 | Step 00008 | Step Loss 0.341876 | Train time 5.15 | Dataloader time 0.0
Epoch 000 | Step 00009 | Step Loss 0.425114 | Train time 5.41 | Dataloader time 0.0
Epoch 000 | Step 00010 | Step Loss 0.526096 | Train time 5.83 | Dataloader time 0.0
Epoch 000 | Step 00011 | Step Loss 0.541208 | Train time 4.14 | Dataloader time 0.0

I’ve checked the time taken by img, labels = img.to(self.device, non_blocking=True), labels.to(self.device, non_blocking=True) for MPS, its tiny compared to the dataloader time:

...
t1 = time.time()
img, labels = img.to(device, non_blocking=True), labels.to(device, non_blocking=True)
print("to mps time: ", time.time() - t1)
...

It gives:

to mps time:  0.0009737014770507812
Epoch 000 | Step 00000 | Step Loss 0.779288 | Train time 3.54 | Dataloader time -1
to mps time:  0.0029366016387939453
Epoch 000 | Step 00001 | Step Loss 0.81939 | Train time 1.12 | Dataloader time 0.39
to mps time:  0.0012209415435791016
Epoch 000 | Step 00002 | Step Loss 0.755161 | Train time 1.22 | Dataloader time 1.94
to mps time:  0.0012700557708740234
Epoch 000 | Step 00003 | Step Loss 0.443458 | Train time 3.04 | Dataloader time 2.13
to mps time:  0.0002002716064453125
Epoch 000 | Step 00004 | Step Loss 7.6309 | Train time 3.05 | Dataloader time 6.62
to mps time:  0.0003991127014160156
Epoch 000 | Step 00005 | Step Loss 0.802644 | Train time 1.99 | Dataloader time 6.63
to mps time:  0.0002162456512451172
Epoch 000 | Step 00006 | Step Loss 0.690297 | Train time 1.95 | Dataloader time 5.55
to mps time:  0.00043082237243652344
Epoch 000 | Step 00007 | Step Loss 0.805149 | Train time 3.16 | Dataloader time 6.13
to mps time:  0.00084686279296875
Epoch 000 | Step 00008 | Step Loss 0.729108 | Train time 1.84 | Dataloader time 6.87
to mps time:  0.00031113624572753906
Epoch 000 | Step 00009 | Step Loss 0.575548 | Train time 1.73 | Dataloader time 4.74

I don’t know what slows down the dataloader when using MPS, and why it is much faster with CPU? It seems that the to(device) process is not the problem.

Wang-Yu-Qing · January 30, 2024, 4:56am

@rtwolfe94022 It turns out that the dataloader’s speed is fine. Most of the time is from _loss.cpu().detach().numpy() which synchronize the GPU. And my timing code wrapped this procedure’s time in dataloader_time. In my case, make batch size smaller can relieve the problem.

ref: Calling loss.item() is very slow