Majority of training time spent copying to mps unit

When trying to train a custom classifier (4 output features) based on the torchvision.models.resnet18 model on an M1 MacBook Air, I notice that the majority of the time logged when profiling is spent copying the image batches (standard 224x224 3-channel ImageNet resolution) to the mps device. This happens regardless of whether the model is compiled or not. Is this expected behavior?

Using torch.profiler with both a wait+skip step, a typical cycle takes about 165 ms. The table and picture below shows how this time is distributed with a batch size of 16 images.

Process                                         |   Time (ms)| Cycle ratio (%)
DataLoader (read img, uint8->float32, normalize)|    24.8 ms |           15 %
Copy to GPU                                     |   105.2 ms |           64 %
Forward pass (incl. loss)                       |    16.7 ms |           10 %
Backprop (incl. use_grad)                       |    17.9 ms |           11 %

As shown, almost 2/3 of the training consists of the aten__::copy_ operator, which I find pretty strange.

Hard to say without seeing an exact repro but I believe this is actually expected behavior for a small model like resnet18, I would try something much larger and try a larger batch size and see if the cycle ration for copy reduces

1 Like

In the case of the ResNet-18 model, increasing the batch size from 16 to 128 actually increases the the copy time ratio from 64 % to 83 % (while the total time per sample per epoch goes up from 10.3 ms to 12.9 ms).

Trying to plug in a Visual Transformer (ViT_B_16, batch size 16) instead, the cycle ratio does indeed decrease. If the batch size is increased (to 64), the profiling tool causes the Mac to deplete the whole 16 GB of RAM, making it hard to obtain any reliable results. Although mps gives a nice speed boost compared to running on just the bare cpu, one might suspect that the MacBook Air may not be the right machine for heavy model training. :smile:

Process                                         |   Time (ms)| Cycle ratio (%)
DataLoader (read img, uint8->float32, normalize)|      29 ms |          < 2 %
Copy to GPU                                     |     270 ms |           13 %
Forward pass (incl. loss)                       |     692 ms |           34 %
Backprop (incl. use_grad)                       |    1031 ms |           51 %

1 Like