On the A10, we can enable or disable ECC. Disabling it would give us a bit more available memory and make it faster. I think by default, it is enabled, so I assume, most people use this. So what should we do?
Error correction is ideal for very precise tasks where being off by a percent would devastate results.
That said, neural networks benefit from a lack of precision in the mantissa and it can act as an indirect form of regularization. Hence bfloat being a preferred dtype.
Additionally, it’s standard practice to use Dropout layers that just replace a given parameter with a zero 10-50% of the time. Similar effect without ECC but more equivalent to a model wide Dropout with p value of 0.004.
On the other hand, on rare occasions, you might get exploding gradients, such as if the error is in the exponent. However, from experience, that only occurs maybe 1/100 training runs.
I assume that most people use ECC enabled, as this is the default, and they would not change it. So if this is basically some more memory and faster speed for everyone doing NN training, why isn’t this then more widely known to disable ECC? When I search for this, rarely anything shows up. So either people don’t care too much about getting a few more percent of speed and memory (but why?), or it’s not so safe as you say.
Did you directly compare ECC enabled vs disabled?
1% of training runs, this is very relative: On a single GPU? Trained for how long, what kind of model, what kind of data, how big is the model?
Exploding gradient is not necessarily from ECC. You can get the same also without just by the nature of NN training.
When I was messing with ECC on/off, it was on a server I built with 6 Tesla K80s(12GB per unit x 6 = 72GB). Model size was around 300-500m parameters, but large batches,(in fact the model size is less relevant than the number of flops) and a training run was over the course of a day and a half. So I made sure periodic model saves took place regularly. As I recall, it was a noticeable speedup, although I don’t have any exact numbers.
You could probably avoid exploding gradients by making use of gradient clipping, data clipping, and parameter clipping.