(Only on CPU)
Initializing a tensor directly as float16 or bfloat16 with normal distribution is slower than converting from float32:
python -m timeit -s "import torch" "torch.randn(10, 100, 100, 1000, dtype=torch.float32)"
1 loop, best of 5: 680 msec per loop
python -m timeit -s "import torch" "torch.randn(10, 100, 100, 1000, dtype=torch.bfloat16)"
1 loop, best of 5: 1.57 sec per loop
python -m timeit -s "import torch" "torch.randn(10, 100, 100, 1000, dtype=torch.float32).to(torch.bfloat16)"
1 loop, best of 5: 744 msec per loop
Is there a convenient way to profile it deeper? Also, not sure what exactly is happening in Distribution.cpp
Same for init.xavier_normal_, but uniform is fine.
Or maybe should I just not init models in fp16 or bfloat16 anyways?