Torch models on GPU slow down python subprocess module?

I have been trying to use the python subprocess module while training a neural network in pytorch, but I noticed that subprocess runs many times slower if I have a network initialized on a gpu. Here is an example script I used, with a very simple linear network, profiling the times using line_profiler, and looping a simple subprocess call 100 times:

import torch
import torch.nn as nn
import subprocess
from line_profiler import LineProfiler

class TestNN(nn.Module):
    def __init__(self, device):
        super(TestNN, self).__init__()
        self.fc1 = nn.Linear(5,16)
        self.device=device
        self.to(self.device)

def test_subprocess():
    device = torch.device('cuda:0')
    testNet=TestNN(device)

    for i in range(100):
        subprocess.run(["ls",  "-l"], capture_output=True)

if __name__ == '__main__':

    lprofiler =LineProfiler()
    lp_wrapper = lprofiler(test_subprocess)
    
    lp_wrapper()
    lprofiler.print_stats()

Just by moving the small network to gpu results in more than 4x slower execution of subprocess.run().

My results from line_profiler when the network is on cpu:

Total time: 1.46088 s
File: test_subprocess_gpu.py
Function: test_subprocess at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                           def test_subprocess():
    14         1        172.0    172.0      0.0      device = torch.device('cpu')
    15         1        806.0    806.0      0.1      testNet=TestNN(device)
    16                                           
    17       101       1235.0     12.2      0.1      for i in range(100):
    18       100    1458671.0  14586.7     99.8          subprocess.run(["ls",  "-l"], capture_output=True)

My results when the network is initialized on GPU:

Timer unit: 1e-06 s

Total time: 8.63406 s
File: test_subprocess_gpu.py
Function: test_subprocess at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    13                                           def test_subprocess():
    14         1        174.0    174.0      0.0      device = torch.device('cuda:0')
    15         1    2084937.0 2084937.0     24.1      testNet=TestNN(device)
    16                                           
    17       101       1163.0     11.5      0.0      for i in range(100):
    18       100    6547789.0  65477.9     75.8          subprocess.run(["ls",  "-l"], capture_output=True)

Does anyone know what causes this slow-down and how to increase the speed with the network initialized on GPU? I am very puzzled why initializing a neural network on GPU will have any affect on the speed of subprocess.run(). Any help is greatly appreciated!