Is evaluating the network thread-safe?

albanD · December 3, 2019, 9:01pm

Batching will help in the sense that instead of calling X times a small op, you will call once an X times larger op. Even though in theory these are the same. The second one reduces any overhead from the whole framework as you pay once that overhead instead of X times (and pytorch is known to be expensive).

Willem · December 11, 2019, 10:21am

So I ran some more tests based on your remark that batching should help. I can confirm that the small batch size (of 1), together with the small model, is the culprit here. When I switch to a decent batch size (e.g. 100 or so), then the parallelization speedup is close to 16 times (=the number hardware threads). So c++ pytorch allows for parallelization as advertised; great!
Apparently, somehow the “overhead” part of pytorch does not parallelize very well. I am still at a loss why this is. I think it might have to do with very frequent memory allocations and deallocations, but honestly, my guess is as good as anybody’s. But the overhead is only really important for very small models with too small batch sizes.
@albanD : Thanks a lot!

fatvlad · September 10, 2020, 10:47am

Hi,
Haven’t found this topic before because of slightly misleading title, but it seems that I ran into same issue (only in my case network is even smaller than the one used by @Willem). Batching does not help me much because each worker calls forward and grad independently and in unsynchronized manner. Does this “overhead” problem have any resolution? Maybe thread-local allocation?