If I want to replace a tensor x partially by y based on a condition (a Boolean tensor), I think the most suitable method is to call masked_scatter since it is designed for such usage. However, I found torch.where is even faster?
4 function calls in 0.112 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.112 0.112 <string>:1(<module>)
1 0.000 0.000 0.112 0.112 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.112 0.112 0.112 0.112 {method 'masked_scatter_' of 'torch._C._TensorBase' objects}
4 function calls in 0.000 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method where}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Using CPU has a similar result. Besides, torch.where has a lower memory usage (not shown in the result above) despite I am using a inplace masked_scatter_. Why would this happen?
Note that CUDA operations are executed asynchronously, so you would have to synchronize the code before starting and stopping timers.
I’m not deeply familiar with cProfile, but assume that it doesn’t synchronize the GPU internally.
The torch.utils.benchmark utils might be useful to compare different methods, as they will add warmup iterations and synchronize internally.
Yes, basically the same as what I got. Comparing where and x[condition] = y[condition], the advantage of where is that it has lower memory usage. I do not know how to get the benchmark, but I observed from my task manager.