Higher CPU usage while torchvision.transforms

CPU usage is around 250%(ubuntu top command) was using torchvision transforms to convert cv2 image to torch

normalize_transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]) ])

def normalizeCvImage(image_cv, device):
    return normalize_transform(image_cv).unsqueeze(0).to(device)

But usage drops to 100% when I do the operation manually,

def normalizeCvImage(image_cv, device):
    image = torch.Tensor(image_cv).to(device)
    image = image.permute(2, 0, 1).unsqueeze(0)
    image = (image - 127.5) / 127.5
    return image

More interestingly, this happens only on version 1.0.1.post2 (Cuda 10), whereas if I stick with version 1.0.0 (Cuda 10) the difference is minimal.

Maybe it’s doing some more multithreading or so? Have you benchmarked for speed? Maybe the higher CPU utilization is desirable if you have multiple CPUs?

Actually the speed (FPS) remains the same! Both run ~30fps (which is my cam’s max speed). Eventually I’ll run this on a Jetson TX2 which has a weak CPU compared to conventional ones. So if its 250% on my i7 + 1070, it would be worse on my TX2

Not that it makes a difference, probably, but conceptually, you also have an extra step in normalize_transform compared to you other function. I.e., transforms.ToTensor() converts tensors to [0, 1] range. So, it’s basically first doing that conversion, and then in an additional operation, it is doing (x-0.5) * 2.

How exactly are you running the two function to compare them, in a Python for-loop?

I’ve implemented MTCNN face detection and this is the part which feeds images from the camera to the network. The high CPU usage was bugging me (even though everything is on CUDA) and was playing around until I stumbled upon this. Everything else was kept the same except this function and surprise CPU usage dropped to 100%.

I see. So you are not using DataLoader or anything like that but just calling that function on each image separately?

In your function

def normalizeCvImage(image_cv, device):
    image = torch.Tensor(image_cv).to(device)
    image = image.permute(2, 0, 1).unsqueeze(0)
    image = (image - 127.5) / 127.5
    return image

you have this “early” to(device) whereas for the other, you do it at the end of the pipeline. Since you are using CUDA, this could explain the difference maybe. I.e., for the second “lower CPU” usage version, you could do

def normalizeCvImage(image_cv, device):
    image = torch.Tensor(image_cv)
    image = image.permute(2, 0, 1).unsqueeze(0)
    image = (image - 127.5) / 127.5
    return image.to(device)

and see what happens

Wohoo, You are right @rasbt, when I do that CPU usage jumps back to 250s. That makes sense as well. Is there better way to implement this in torchvision.transforms?

I see. So you are not using DataLoader or anything like that but just calling that function on each image separately?

Yes

I am not sure what the issue with the high CPU usage is exactly. The reason why you got lower CPU usage in the other case is not because the implementation was more efficient but because it was done on the GPU instead. What really matter, I’d say, is how fast your processing step finishes altogether. If you do steps on the GPU, sure, your CPU usage will drop, but doesn’t mean it’s faster or more efficient as these may be steps that a CPU is better at (also you need to consider that data transfer to GPU is slow as well)

Agreed, In my case, my TX2 has ample GPU to spare, but not CPU. So this serves the purpose as long as the performance isn’t hurt (And looks like it is not in this case) . What I was asking - Is there a way to move the tensor to GPU and then do normalize in torchvision.transforms or is it not meant to do this altogether?

oh and I forgot to thank you, Thank you!!

Hm, I am not sure if that’s possible since the API is mainly targeted for DL training where the GPU is busy running the model (also, some people like to put transforms in there that are not implemented on GPUs, e.g., PIL or OpenCV stuff). I think in your case, it’s maybe not a good use case for the data loader with custom transform. I would use maybe the data loader just for iterating and do the processing then via a manual function like you did.

1 Like