Question about Bernoulli sampling speed

I found that Bernoulli sampling can be very slow when implementing mask matrix.
Here is my test code:

and the speed is:

When running on GPU, the difference can be larger. I assumed that Bernoulli does sampling on CPU, while random matrix can be directly created on GPU. But here is still a question, Why the speed difference exists when both running on CPU?

Can someone help me ? Thanks!