Cuda_Launch_Blocking=1 reduces the training speed?

Hi, I was trying to debug about the

"RuntimeError: cuda runtime error (710) : device-side assert triggered "

by using Cuda_Launch_Blocking=1.

It seems like setting the above environment variable to 1 slows down the training speed of whole code.

Should I use this blocking=1 only in debugging and not in training?
Does it reduce the speed?

Yes, CUDA_LAUNCH_BLOCKING=1 is a debug env variable used to block kernel launches and to report the proper stacktrace once an assert is triggered. You should not use it in production, but only during debugging.

