Using randperm and arange on MPS vs CPU

basavyr · October 23, 2024, 8:36am

Hey there,

I am performing some benchmarks on a VGG19 model that was written from scratch and trained on CIFAR10 dataset.

The pretrained model is loaded onto GPU and then for every layer in the model some random operations should be performed on the weights. I cannot go into the exact detail of the operations and the reason for doing so, but essentially they consist in the following:

torch.randperm(DIM) for linear layers
torch.stack([torch.stack([torch.randperm(filter_size) for _ in range( n_filters)], dim=0) for _ in range(batch_size)], dim=0) for conv2d layers

Here I refer to them as linear and conv2d, but I know that for VGG-like architectures, they are referred to as features and classifiers.

Thus, in a simple for loop, these operations will be applied sequentially, depending on the type of the layer .

for k, v in pretrained_vgg19.state_dict().items():
            if "weight" in k:
                if len(v.shape) == 2:
                   # linear layers
                    v.copy_(...torch.randperm()...)
                elif len(v.shape) > 2:
                   # conv2d layers
                    v.copy_(...torch.stack([torch.stack([torch.randperm...)

Now, my issue/problem is related to the execution time of these operations: depending on the device used intorch.randperm() and torch.arange(). More precisely, if the randomization method will have the same device as the model (i.e., the GPU) computations will be slow. On the other hand, if randomization is called with device="cpu", the iterations will be much faster. See the comparison in the table shown below (averaged for 10 executions in both scenarios).

Layer	CPU rands (seconds)	GPU rands (seconds)
conv1	0.0014	0.0307
conv2	0.0060	0.7935
conv3	0.0111	0.9574
conv4	0.0205	1.9211
conv5	0.0486	3.8381
conv6	0.0924	7.7111
conv7	0.0799	7.6586
conv8	0.0790	7.6429
conv9	0.1689	17.5667
conv10	0.3066	30.7506
conv11	0.3052	30.8070
conv12	0.3180	30.7441
conv13	0.3049	30.7902
conv14	0.3074	30.9785
conv15	0.3156	30.9083
conv16	0.3059	47.7401
linear17	0.0021	0.0014
linear18	0.1672	0.0001
linear19	0.0113	0.0006

Shouldn’t this be the other way around? Meaning that generating new tensors via torch.randperm and torch.arange behave much faster on GPU, since the model is also loaded onto it?

Later edit: Forgot to mention that I am performing all benchmarks on an M3 Pro MacBook, and thus my GPU is MPS.

Tony-Y · October 24, 2024, 8:30am

I have performed a simple benchmark of torch.randperm().

import torch
import time

for device in ['cpu', 'mps']:
  for size in [10,100,1000,10_000,100_000,1000_000]:
    since = time.time()
    for _ in range(1000):
      _ = torch.randperm(size, device=device)
    elapsed = time.time() - since
    print(f'{device} {size} {elapsed}')

Results on M1 Pro:

cpu 10 0.0018110275268554688
cpu 100 0.0016379356384277344
cpu 1000 0.004246711730957031
cpu 10000 0.030288219451904297
cpu 100000 0.35094308853149414
cpu 1000000 4.289053916931152
mps 10 0.1772630214691162
mps 100 0.1574409008026123
mps 1000 0.2291860580444336
mps 10000 0.34720897674560547
mps 100000 0.44776439666748047
mps 1000000 1.5657782554626465

So, it should be done on CPU if the size is less than 100,000.

You should move the final tensor on CPU to the MPS device:

import torch
import time

for device in ['cpu', 'mps']:
  since = time.time()
  _ = torch.stack([torch.randperm(10, device='cpu') for _ in range(1000)]).to(device)
  elapsed = time.time() - since
  print(f'{device} {elapsed}')

Results on M1 Pro:

cpu 0.0049359798431396484
mps 0.010468244552612305