Why doesn't adaptive learning rate vary using Adam solver?

marcman411 · September 26, 2018, 6:26pm

Problem

I am trying to use Adam to optimize my network and am running into two issues:

Each layer is set as its own parameter group, yet all the layers have the same weight. Why are the learning rates seemingly linked when they should be adjusted based on the gradients?
The learning rate seems to converge to the initial value set rather than change adaptively. Is this normal?

Details

I understand that Adam adjusts the learning rate based on the network gradients. However, when I print out the step_size at each step, I find they all just converge to the initial rates I set.

After ~30,000 samples, I get step_sizes for each parameter group as given below (dict keys are param group index):

{0: 0.00019659575703128232, 1: 0.00039319151406256463, 2: 0.00019659575703128232, 
3: 0.00039319151406256463, 4: 0.00019659575703128232, 5: 0.00039319151406256463, 
6: 0.00019659575703128232, 7: 0.00039319151406256463, 8: 0.00019659575703128232, 
9: 0.00039319151406256463, 10: 0.00019659575703128232, 11: 0.00039319151406256463, 
12: 0.00019659575703128232, 13: 0.00039319151406256463, 14: 0.00019659575703128232, 
15: 0.00039319151406256463, 16: 0.00019659575703128232, 17: 0.00039319151406256463, 
18: 0.00019659575703128232, 19: 0.00039319151406256463, 20: 0.00019659575703128232, 
21: 0.00039319151406256463, 22: 0.00019659575703128232, 23: 0.00039319151406256463, 
24: 0.00019659575703128232, 25: 0.00039319151406256463, 26: 0.00019659575703128232, 
27: 0.00039319151406256463, 28: 0.00019659575703128232, 29: 0.00039319151406256463, 
30: 0.00019659575703128232, 31: 0.00039319151406256463, 32: 0.00019659575703128232, 
33: 0.00039319151406256463, 34: 0.00019659575703128232, 35: 0.00039319151406256463, 
36: 0.00019659575703128232, 37: 0.00039319151406256463, 38: 0.00019659575703128232, 
39: 0.00039319151406256463, 40: 0.00019659575703128232, 41: 0.00039319151406256463, 
42: 0.00019659575703128232, 43: 0.00039319151406256463, 44: 0.00019659575703128232, 
45: 0.00039319151406256463, 46: 0.00019659575703128232, 47: 0.00039319151406256463, 
48: 0.00019659575703128232, 49: 0.00039319151406256463}

The initial LR was 2e-4 with a LR multiplier of 2 for biases (which is why the groups alternate by a factor of 2).

But when I first start training I get different values:

{0: 4.706334506549056e-05, 1: 9.412669013098112e-05, 2: 4.706334506549056e-05, 
3: 9.412669013098112e-05, 4: 4.706334506549056e-05, 5: 9.412669013098112e-05, 
6: 4.706334506549056e-05, 7: 9.412669013098112e-05, 8: 4.706334506549056e-05, 
9: 9.412669013098112e-05, 10: 4.706334506549056e-05, 11: 9.412669013098112e-05, 
12: 4.706334506549056e-05, 13: 9.412669013098112e-05, 14: 4.706334506549056e-05, 
15: 9.412669013098112e-05, 16: 4.706334506549056e-05, 17: 9.412669013098112e-05, 
18: 4.706334506549056e-05, 19: 9.412669013098112e-05, 20: 4.706334506549056e-05, 
21: 9.412669013098112e-05, 22: 4.706334506549056e-05, 23: 9.412669013098112e-05, 
24: 4.706334506549056e-05, 25: 9.412669013098112e-05, 26: 4.706334506549056e-05, 
27: 9.412669013098112e-05, 28: 4.706334506549056e-05, 29: 9.412669013098112e-05, 
30: 4.706334506549056e-05, 31: 9.412669013098112e-05, 32: 4.706334506549056e-05, 
33: 9.412669013098112e-05, 34: 4.706334506549056e-05, 35: 9.412669013098112e-05, 
36: 4.706334506549056e-05, 37: 9.412669013098112e-05, 38: 4.706334506549056e-05, 
39: 9.412669013098112e-05, 40: 4.706334506549056e-05, 41: 9.412669013098112e-05, 
42: 4.706334506549056e-05, 43: 9.412669013098112e-05, 44: 4.706334506549056e-05, 
45: 9.412669013098112e-05, 46: 4.706334506549056e-05, 47: 9.412669013098112e-05, 
48: 4.706334506549056e-05, 49: 9.412669013098112e-05}

I am confused why I am not starting training right at 2e-4 or 4e-4, and why it converges to those values as training continues. I am also confused why the parameter groups (which correspond to layers in the network), don’t have more variance in their step_size. It seems like the layers are linked together. Perhaps I am incorrect, but I thought Adam uses adaptive learning rates for each parameter. Where am I going wrong?

I set the parameter groups as below:

def set_caffe_param_mult(m, base_lr, base_weight_decay):
	'''Assign a LR multiplier of 2 and a decay multiplier of 0 to the
	 bias weights (which is common in Caffe)'''
	param_list = []
	for name, params in m.named_parameters():
		if name.find('bias') != -1:
			param_list.append({'params' : params, 'lr' : 2 * base_lr, 'weight_decay' : 0.0})
		else:
			param_list.append({'params' : params, 'weight_decay' : base_weight_decay})
	return param_list


param_list = set_caffe_param_mult(model, 2e-4, 0.005)
optimizer = Adam(params = param_list, lr=2e-4, weight_decay=0.005)