Question about Bernoulli sampling speed

Hi,
I found that Bernoulli sampling can be very slow when implementing mask matrix.
Here is my test code:


and the speed is:
0.35076022148132324
0.194166421890258

When running on GPU, the difference can be larger. I assumed that Bernoulli does sampling on CPU, while random matrix can be directly created on GPU. But here is still a question, Why the speed difference exists when both running on CPU?

Can someone help me ? Thanks!