ReLU6 is capped at 6, but why 6, and why not 5 or 4 or 3? And is there any paper comparing them?
I believe Alex Krizhevsky used it in Convolutional Deep Belief Networks on CIFAR-10 and describes it at:
Our ReLU units dieer from those of  in two respects. First, we cap the units at 6, so our ReLU activation function is
y = min(max(x, 0), 6).
In our tests, this encourages the model to learn sparse features earlier. In the formulation of , this is
equivalent to imagining that each ReLU unit consists of only 6 replicated bias-shifted Bernoulli units, rather than an innute amount. We will refer to ReLU units capped at n as ReLU-n units.