Simple for loop on GPU

Hello,
I would like to ask about parallelization the simple for loop or any loop to be executed on GPU. Currently i have the following code but works very slow on CPU. I cannot find the answer for get this work on GPU. Is it even possible? I search the web without any luck. Any help will be appreciated.
Thank you in advance.

img_height_vec = [0] * detected_bright_regions_image.shape[0]
img_width_vec = [0] * detected_bright_regions_image.shape[1]
    for i in range(0, detected_bright_regions_image.shape[0]):
        for j in range(0, detected_bright_regions_image.shape[1]):
            if(detected_bright_regions_image[i][j] > 180):
                img_height_vec[i] = 1
                img_width_vec[j] = 1

I guess, one way would be to create a mask and then ditch the for loop and use the vectorized implementation to achieve what you want.

Python loops are slow. If u can’t find a way to use some Pytorch methods to replace these loops, u can write them in C++ or use TouchScript. I would suggest first try to replace these loops with other methods. I think the following code should achieve ur goal

h_max, w_max = image.max(0)[0], image.max(1)[0]
h_vec, w_vec=h_max>180, w_max>180
# these 2 tensors are bool tensors, u can cast them to other types

Thank you very much.
I have found the following solution:

res = (image >= 180)
res = res.type(torch.int)
hv = (torch.sum(res, dim=0) >=1)
wv = (torch.sum(res, dim=1) >=1)

Pytorch is great. Much more friendly then tensorflow. It requires different mindset, but the results are great.

Thanks, for help.