Hey there,
I am performing some benchmarks on a VGG19 model that was written from scratch and trained on CIFAR10 dataset.
The pretrained model is loaded onto GPU and then for every layer in the model some random operations should be performed on the weights. I cannot go into the exact detail of the operations and the reason for doing so, but essentially they consist in the following:
torch.randperm(DIM)
for linear layerstorch.stack([torch.stack([torch.randperm(filter_size) for _ in range( n_filters)], dim=0) for _ in range(batch_size)], dim=0)
forconv2d
layers
Here I refer to them as linear and conv2d, but I know that for VGG-like architectures, they are referred to as features and classifiers.
Thus, in a simple for
loop, these operations will be applied sequentially, depending on the type of the layer .
for k, v in pretrained_vgg19.state_dict().items():
if "weight" in k:
if len(v.shape) == 2:
# linear layers
v.copy_(...torch.randperm()...)
elif len(v.shape) > 2:
# conv2d layers
v.copy_(...torch.stack([torch.stack([torch.randperm...)
Now, my issue/problem is related to the execution time of these operations: depending on the device used intorch.randperm()
and torch.arange()
. More precisely, if the randomization method will have the same device as the model (i.e., the GPU) computations will be slow. On the other hand, if randomization is called with device="cpu"
, the iterations will be much faster. See the comparison in the table shown below (averaged for 10 executions in both scenarios).
Layer | CPU rands (seconds) | GPU rands (seconds) |
---|---|---|
conv1 | 0.0014 | 0.0307 |
conv2 | 0.0060 | 0.7935 |
conv3 | 0.0111 | 0.9574 |
conv4 | 0.0205 | 1.9211 |
conv5 | 0.0486 | 3.8381 |
conv6 | 0.0924 | 7.7111 |
conv7 | 0.0799 | 7.6586 |
conv8 | 0.0790 | 7.6429 |
conv9 | 0.1689 | 17.5667 |
conv10 | 0.3066 | 30.7506 |
conv11 | 0.3052 | 30.8070 |
conv12 | 0.3180 | 30.7441 |
conv13 | 0.3049 | 30.7902 |
conv14 | 0.3074 | 30.9785 |
conv15 | 0.3156 | 30.9083 |
conv16 | 0.3059 | 47.7401 |
linear17 | 0.0021 | 0.0014 |
linear18 | 0.1672 | 0.0001 |
linear19 | 0.0113 | 0.0006 |
Shouldn’t this be the other way around? Meaning that generating new tensors via torch.randperm
and torch.arange
behave much faster on GPU, since the model is also loaded onto it?
Later edit: Forgot to mention that I am performing all benchmarks on an M3 Pro MacBook, and thus my GPU is MPS
.