Initialisation function for custom optimiser in PyTorch affects neural network performance?

I am trying to create a custom optimiser which extends the SGD optimiser.

This was my first attempt (at the initialisation function):

class minimiser_1(torch.optim.Optimizer): 
    def __init__(self, params, base_optimizer, rho=0.05, **kwargs): 
            params_copy = list(params)
            defaults = dict(rho=rho, **kwargs) 
            self.base_optimizer = base_optimizer(params_copy, **kwargs)
            defaults = {**self.base_optimizer.defaults, **defaults}
            super().__init__(params_copy, defaults) 

With this, the test accuracy of a ResNet on CIFAR-10 comes to be about 88%.

For reference, if I call minimiser_1(model.parameters(), optim.SGD, lr = 0.1, momentum = 0.9, weight_decay = 5e-4) on a really simple toy network, the relevant attributes are:

------------------------------------
Optimiser.defaults:
{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0005, 'nesterov': False, 'maximize': False, 'foreach': None, 'rho': 0.05}
------------------------------------
Optimiser.param_groups:
[{'params': [Parameter containing:
tensor([[-0.1430, -0.0121,  0.3725, -0.2907],
        [ 0.4572,  0.0989,  0.1075, -0.0367]], requires_grad=True), Parameter containing:
tensor([-0.0995,  0.0171], requires_grad=True), Parameter containing:
tensor([[ 0.1915,  0.6281],
        [ 0.4855, -0.1574]], requires_grad=True), Parameter containing:
tensor([ 0.3206, -0.0072], requires_grad=True)], 'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0005, 'nesterov': False, 'maximize': False, 'foreach': None, 'rho': 0.05}]
------------------------------------
Optimiser.base_optimizer.defaults:
{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0005, 'nesterov': False, 'maximize': False, 'foreach': None}
------------------------------------
Optimiser.base_optimizer.param_groups:
[{'params': [Parameter containing:
tensor([[-0.1430, -0.0121,  0.3725, -0.2907],
        [ 0.4572,  0.0989,  0.1075, -0.0367]], requires_grad=True), Parameter containing:
tensor([-0.0995,  0.0171], requires_grad=True), Parameter containing:
tensor([[ 0.1915,  0.6281],
        [ 0.4855, -0.1574]], requires_grad=True), Parameter containing:
tensor([ 0.3206, -0.0072], requires_grad=True)], 'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0005, 'nesterov': False, 'maximize': False, 'foreach': None}]
------------------------------------

However, if I change the initialisation function as follows, then the performance is consistently better, with test accuracy of around 95%:

class minimiser_2(torch.optim.Optimizer):
    def __init__(self, params, base_optimizer, rho=0.05, **kwargs):
        defaults = dict(rho=rho, **kwargs)
        super().__init__(params, defaults)
        self.base_optimizer = base_optimizer(self.param_groups, **kwargs)
        self.param_groups = self.base_optimizer.param_groups
        self.defaults.update(self.base_optimizer.defaults)

If I call minimiser_2(model.parameters(), optim.SGD, lr = 0.1, momentum = 0.9, weight_decay = 5e-4) on the same example network:

------------------------------------
Optimiser.defaults:
{'rho': 0.05, 'lr': 0.1, 'momentum': 0.9, 'weight_decay': 0.0005, 'dampening': 0, 'nesterov': False, 'maximize': False, 'foreach': None}
------------------------------------
Optimiser.param_groups:
[{'params': [Parameter containing:
tensor([[-0.1430, -0.0121,  0.3725, -0.2907],
        [ 0.4572,  0.0989,  0.1075, -0.0367]], requires_grad=True), Parameter containing:
tensor([-0.0995,  0.0171], requires_grad=True), Parameter containing:
tensor([[ 0.1915,  0.6281],
        [ 0.4855, -0.1574]], requires_grad=True), Parameter containing:
tensor([ 0.3206, -0.0072], requires_grad=True)], 'rho': 0.05, 'lr': 0.1, 'momentum': 0.9, 'weight_decay': 0.0005, 'dampening': 0, 'nesterov': False, 'maximize': False, 'foreach': None}]
------------------------------------
Optimiser.base_optimizer.defaults:
{'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0005, 'nesterov': False, 'maximize': False, 'foreach': None}
------------------------------------
Optimiser.base_optimizer.param_groups:
[{'params': [Parameter containing:
tensor([[-0.1430, -0.0121,  0.3725, -0.2907],
        [ 0.4572,  0.0989,  0.1075, -0.0367]], requires_grad=True), Parameter containing:
tensor([-0.0995,  0.0171], requires_grad=True), Parameter containing:
tensor([[ 0.1915,  0.6281],
        [ 0.4855, -0.1574]], requires_grad=True), Parameter containing:
tensor([ 0.3206, -0.0072], requires_grad=True)], 'rho': 0.05, 'lr': 0.1, 'momentum': 0.9, 'weight_decay': 0.0005, 'dampening': 0, 'nesterov': False, 'maximize': False, 'foreach': None}]
------------------------------------

I am struggling to understand this discrepancy in performance, because the relevant attributes (param_groups, defaults, etc. printed above) are identical (except for the order of key-value pairs in the dictionaries, with rho appearing at the end with minimiser_1 and towards the start with minimiser_2).

Can anyone provide any insight?