ReLU6 is capped at 6, but why 6, and why not 5 or 4 or 3? And is there any paper comparing them?

I believe Alex Krizhevsky used it in Convolutional Deep Belief Networks on CIFAR-10 and describes it at:

Our ReLU units dieer from those of [8] in two respects. First, we cap the units at 6, so our ReLU activation function is

y = min(max(x, 0), 6).

In our tests, this encourages the model to learn sparse features earlier. In the formulation of [8], this is

equivalent to imagining that each ReLU unit consists of only 6 replicated bias-shifted Bernoulli units, rather than an innute amount. We will refer to ReLU units capped at n as ReLU-n units.