Slowdown in CPU-based Preprocessing After Loading Model Weights onto GPU

I’m encountering an unexpected issue in my machine learning pipeline. Specifically, after loading the model weights onto the GPU, the CPU-based preprocessing function, which relies solely on NumPy and ndarray operations and is entirely independent of the GPU or the model, experiences a significant slowdown.

# super slow

def preprocess(arr):
    ...do some with numpy
    return arr

device = "cuda"
model = model.to(device)
ckpt = torch.load(saved_checkpoint_path, map_location=device)
model_state = ckpt['model']
model.load_state_dict(model_state)

src = []
for data in file_list:
    ndarray = read_data(data)
    output = preprocess(ndarray)
    src.append(output)

There is no data transfer between the CPU and GPU for the actual data used in the preprocessing. Strangely, performing preprocessing and then loading the model weights to the GPU separately is faster compared to the scenario where the weights are loaded onto the GPU before the preprocessing step. This suggests that the act of loading model weights onto the GPU is somehow affecting the CPU-based preprocessing, even though there should be no direct interaction between the two.

# super fast

src = []
for data in file_list:
    ndarray = read_data(data)
    output = preprocess(ndarray)
    src.append(output)

device = "cuda"
model = model.to(device)
ckpt = torch.load(saved_checkpoint_path, map_location=device)
model_state = ckpt['model']
model.load_state_dict(model_state)

I’ve ruled out data transfer issues and confirmed that the preprocessing itself is not GPU-dependent. Has anyone encountered a similar situation, and if so, what could be causing this unexpected slowdown? I would appreciate any insights or suggestions on how to troubleshoot and resolve this issue.

Could you post a minimal and executable code snippet reproducing the issue?

I discovered that the issue is related to the “subprocess” module, as discussed in Torch models on GPU slow down python subprocess module?. In my preprocessing code, I have utilized the subprocess module, as shown below.

from subprocess import PIPE, Popen
input = sample.tobytes()
cmd = [
    "/usr/bin/sox", 
    ...
]
p = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=PIPE)
out, err = p.communicate(input)

Actually I conducted a test by loading the model weight on the CPU and observed the same issue as described earlier. What is your perspective on how the model weight influences the functioning of the subprocess? Could you offer any advice or insights on this matter?


+ version information

[pip3] numpy==1.26.2
[pip3] torch==1.12.1+cu113
[pip3] torchaudio==0.12.1+cu113