Program stuck for a long time(30s) when trying to print the result of scatter_add

zht-github · May 14, 2023, 10:36am

HI!

I’m trying to implement an algorithm using tensor operators, but I found a strange thing when I run my algorithm on cuda. The process can run very fast( It took only 50 ms to process inputs with a length of 6M). The length of result tensors are only 4, but printing these tensors cost me 30s!

My result consists of 10 (1*4 ) tensors on cuda, when I try to print all these results, the program stuck for 30 seconds before printing the first tensor. and print the rest tensors in less than 1 ms.

When I set the device as ‘cpu’, this strange behavior does not happen.

And I found it is caused by the operator scatter_add(at line39 and line50). If I remove all scatter_add calls, the program behaves normally and prints all results without waiting.

If I try to print the result of every scatter_add immediately after its execution, the program will also freeze for 30s before doing anything. The unexpected freeze only occurred once even when I call the sccatter_add operator several times in the program, the freeze will occur immediately after the first call.

My code:

github.com

zht-github/Graph_Alogorithms/blob/main/scatter_add_bug_trigger/algorithm.py

import time
import torch
import pandas as pd
# torch.cuda.synchronize()
def convert_date(date_str):
    res = date_str.split('-')
    return int(res[0] + res[1] + res[2])

def convert_char(char):
    return ord(char)

def generate_group(group_by: list,device,length):

    start = time.time()
    group_hash = torch.zeros(length, device=device)
    for group_column in group_by:
        out,inverse = torch.unique(group_column,return_inverse = True)
        count = out.shape[0]
        group_hash = group_hash * count + inverse
    end = time.time()

This file has been truncated. show original

My Dataset:
Because my dataset is too big and github refuse to upload it, so I truncated and uploaded the first 200k lines of the dataset csv file.

github.com

zht-github/Graph_Alogorithms/blob/main/scatter_add_bug_trigger/lineitem.csv

L,L,L,L,L,L,L,L,L,L,L,L,L,L,L,L
1,155190,7706,1,17,21168.23,0.04,0.02,N,O,1996-03-13,1996-02-12,1996-03-22,DELIVER IN PERSON,TRUCK,egular courts above the
1,67310,7311,2,36,45983.16,0.09,0.06,N,O,1996-04-12,1996-02-28,1996-04-20,TAKE BACK RETURN,MAIL,ly final dependencies: slyly bold 
1,63700,3701,3,8,13309.6,0.1,0.02,N,O,1996-01-29,1996-03-05,1996-01-31,TAKE BACK RETURN,REG AIR,"riously. regular, express dep"
1,2132,4633,4,28,28955.64,0.09,0.06,N,O,1996-04-21,1996-03-30,1996-05-16,NONE,AIR,lites. fluffily even de
1,24027,1534,5,24,22824.48,0.1,0.04,N,O,1996-03-30,1996-03-14,1996-04-01,NONE,FOB, pending foxes. slyly re
1,15635,638,6,32,49620.16,0.07,0.02,N,O,1996-01-30,1996-02-07,1996-02-03,DELIVER IN PERSON,MAIL,arefully slyly ex
2,106170,1191,1,38,44694.46,0.0,0.05,N,O,1997-01-28,1997-01-14,1997-02-02,TAKE BACK RETURN,RAIL,ven requests. deposits breach a
3,4297,1798,1,45,54058.05,0.06,0.0,R,F,1994-02-02,1994-01-04,1994-02-23,NONE,AIR,ongside of the furiously brave acco
3,19036,6540,2,49,46796.47,0.1,0.0,R,F,1993-11-09,1993-12-20,1993-11-24,TAKE BACK RETURN,RAIL, unusual accounts. eve
3,128449,3474,3,27,39890.88,0.06,0.07,A,F,1994-01-16,1993-11-22,1994-01-23,DELIVER IN PERSON,SHIP,nal foxes wake. 
3,29380,1883,4,2,2618.76,0.01,0.06,A,F,1993-12-04,1994-01-07,1994-01-01,NONE,TRUCK,y. fluffily pending d
3,183095,650,5,28,32986.52,0.04,0.0,R,F,1993-12-14,1994-01-10,1994-01-01,TAKE BACK RETURN,FOB,ages nag slyly pending
3,62143,9662,6,26,28733.64,0.1,0.02,A,F,1993-10-29,1993-12-18,1993-11-04,TAKE BACK RETURN,RAIL,ges sleep after the caref
4,88035,5560,1,30,30690.9,0.03,0.08,N,O,1996-01-10,1995-12-14,1996-01-18,DELIVER IN PERSON,REG AIR,- quickly regular packages sleep. idly
5,108570,8571,1,15,23678.55,0.02,0.04,R,F,1994-10-31,1994-08-31,1994-11-20,NONE,AIR,ts wake furiously 
5,123927,3928,2,26,50723.92,0.07,0.08,R,F,1994-10-16,1994-09-25,1994-10-19,NONE,FOB,sts use slyly quickly special instruc
5,37531,35,3,50,73426.5,0.08,0.03,A,F,1994-08-08,1994-10-13,1994-08-26,DELIVER IN PERSON,AIR,eodolites. fluffily unusual
6,139636,2150,1,37,61998.31,0.08,0.03,A,F,1992-04-27,1992-05-15,1992-05-02,TAKE BACK RETURN,TRUCK,p furiously special foxes
7,182052,9607,1,12,13608.6,0.07,0.03,N,O,1996-05-07,1996-03-13,1996-06-03,TAKE BACK RETURN,FOB,ss pinto beans wake against th

This file has been truncated. show original

How to reproduce:
please change the file path on line 57 to the real path on your own device.
the buggy scatter_add calls are at line 39 and line 50

my enviroment:
torch2.0 , pandas 1.5.3, cuda 11.7

please help me with this problem

zht-github · May 14, 2023, 10:38am

sorry I’m new here and can only upload one image at a time
execution on cpu：

zht-github · May 14, 2023, 10:38am

If I try to print the result of every scatter_add immediately after its execution, the program will also freeze for 30s before doing anything. The unexpected freeze only occurred once even when I call the sccatter_add operator several times in the program, the freeze will occur immediately after the first call.

ptrblck · May 14, 2023, 7:18pm

The script isn’t freezing, but it waiting for the actual CUDA kernels to finish their execution, which makes your profiling invalid.
Since CUDA kernels are executed asynchronously you would need to synchronize the code before starting and stopping the timers as you would otherwise profile the dispatching, kernel launches, or implicit synchronizations via e.g. print statements of CUDATensors, which is the case here.

zht-github · May 15, 2023, 12:22pm

Dear ptrblck：

Thanks to your useful and prompt reply.

I found it’s because the index of the scatter operator has too much collision (6M elements scatter to only 4 different indexes). May be there is lock mechanism when updating tensor value in cuda scatter operator? But cpu version can run much faster.

Thanks to your kindness and useful reply again!