Looking for a sanity check re: optimization

ericrhenry · May 19, 2020, 8:16pm

After laying down a lot of infrastructure (C++ frontend), I am just launching into attempting to train. I am working with a 4-class image classification problem, with several thousand 41X41 images. The model is a couple of chained convolutions with rectifiers in between, followed by a MaxPool and flattening to feed into a single Linear layer. (I know the model has not been “tuned” in any way, and is likely primed for serious overfitting, but this is just a shake-down cruise.)
When using Adam optimization, MSELoss (reduction=sum) and a moderate batch size, the total loss moves downward pretty nicely, with the expected amount of oscillations due to the stochastic nature of the optimization. When I inspect the activations in the Linear layer after a decent number of iterations, it shows values of both signs (not in itself concerning), with absolute values both fairly small and of order 1. In addition, there are a lot of extremely small values hovering right around ±10^(-42) or so, with exponents in that general range. (I am inspecting these values by converting values from the original Float32 tensors to double precision, through my own adaptor system.)
I was just wondering if, to those more experienced in this, this behavior raises any specific flags as a symptom of a specific issue. As I said, the overall progress of the optimization seems quite satisfactory. Is the smallness of these activations just an indicator that they are essentially “irrelevant”, or is there some possible numerical artifact at work?
Thanks,
Eric

ptrblck · May 21, 2020, 8:04am

I wouldn’t worry about these small values and would assume these activations are zero.
Do you have any references for this concern or is this just a skeptical point of view?

ericrhenry · May 21, 2020, 2:28pm

Thanks for the response. I would like to add that you have been an absolute stalwart in addressing questions “all over the place” in these forums; just know that this is noted and very much appreciated.

Excuse the long answer; I just wanted to articulate why I thought a “sanity check” was warranted.

Source of the concern is rooted in over forty years dealing with numerical mathematics, which I guess fuels a general skeptical attitude. My first concern was about the validity of the source for these numbers: At the end of the optimization run, I am just inspecting the values directly using the tensors returned by the model->parameters() call. I had earlier questions about the “custody” of these values (presumably maintained via shared_ptr/intrusive_ptr semantics), that they are in fact a direct reflection of the values currently being held by the optimizer. They apparently evolve from the originally random values assigned when the model is “registered”, which suggests this is the case.

Secondly, the set of values I refer to are well separated from the others in log space and are often contiguous within the parameters tensor for the activations. Given that the values are supposedly maintained as Float32, they all fall in absolute value just below the nominal “smallest” such value (1.2e-38). If I let Pytorch just print out the tensor itself (i.e. without conversion to my double-precision space), the activations are identified as “CPUFloatType”, but it actually prints out values such as 3.7737e-42, which simply do not fall within the “usual” range of possible values. They do however fall in the range of denormalized values (where the mantissa progressively loses precision), which extend down to about 1.e-45.

I also made a rather interesting/weird observation: I am using Adam optimization, and when I use the default (0) value for the weight_decay parameter, absolutely none of these extreme values appear; in fact, the dynamic range of activations appears to be no more than a factor of 1000 or so. My original observations were based on assigning a value of 0.01 to this parameter, which produces these odd FP values. (The two optimizations progress in very similar fashion.) I’ve now seen this consistent and distinct behavior toggling back and forth between the two values of this parameter several times. It’s likely that this behavior is readily explained in terms of how the Adam algorithm uses this parameter, and it’s possible that my choice of non-zero value is somehow inappropriate.

All part of my ongoing education process…

Thanks again,
Eric

ptrblck · May 22, 2020, 8:26am

Thanks for the detailed explanation and I think your “sanity check” makes sense.

Note that I don’t have a background in numerical mathematics, so please take this post with a grain of salt.

Yes, this is correct. The optimizer stores references to all parameters and you could inspect the id() of the parameter inside the model, which should match the stored references inside the optimizer.

Would this not be expected assuming the weight should be updated to a zero value?
During training your optimizer might scale the learning rate (e.g. Adam uses internal running estimates), so that further updates might be smaller in their magnitude. Could this explain the small denormal values or what is the concern about the denormals?

Weight decay would add the norm of all parameters to the loss, such that the magnitude of these parameters would be limited (or converge towards zero).

Let me know, what you think and what your concerns are.

ericrhenry · May 22, 2020, 12:19pm

Thanks again. As I said, it is quite likely that the overall behavior is consistent with the workings of the Adam procedure; I am still getting up to speed on certain details. I’m certainly not alarmed, just a bit curious. One trigger is that, from endless experience inspecting things in a debugger, when I see odd FP values, for example the dreaded xxx.e-308 for double precision, this is often the result of either uninitialized values, where initialization is expected, or somebody pushing an integer into what is supposed to be a FP location.
So similar “near-zero but not quite zero” Float32 values do trigger a bit of suspicion. (But not alarm; that is reserved for inf or nan values :-).) The apparent absence of true zero values added to that, I guess. Because of the large overhead in valgrind runs, I’ve only done limited checks on optimization, and these did not raise any flags.
Given that the optimization seems to advance properly, I will just proceed with the assumption that these values were on their way to zero, but just didn’t quite make it. No problem marking this as “solved”.