How to scale/warmup the learning rate for large batch size?

@caesar025 thanks for posting!

There’s some previous discussions about how to adjust learning rate when scaling up batch size, did you try it already? Should we split batch_size according to ngpu_per_node when DistributedDataparallel - #19 by junb