This is a request for a new feature in PyTorch, a linear layer with noisy weights as created in Noisy Networks for Exploration. The feature is useful for exploration in RL problems. I’ve already created an issue at the PyTorch Github, but it seems to have fallen through the cracks over there. I’ve created a basic implementation of the layer at a link included therein. It needs some tweaking and improvement, but I want to get the okay to go ahead with it before I put any more time in.
It seems to be possible to create a noisyModule from any module, just looping on the parameters:
class NoisyModule(nn.Module): def __init__(self, module): super(NoisyModule, self).__init__() self.module = module for param in self.module.parameters(): sigma_param = Parameter(torch.Tensor(param.size()) epsilon_param = Parameter(torch.Tensor(param.size()) self.register_parameter("sigma_"+param.name, sigma_param) self.register_buffer("epsilon_"+param.name, epsilon_param)
How’s that noisy network been doing in your tests with RL?..
Another change U can do to get some that noisy effect if use the randomized ReLu activation, “RReLu”, which performs quite well on Atari when I tried and performed better than Elu that a lot are using. Not only learned faster but lead to more robust and stable model
Also, scaling the noise in proportion to variance it causes in outputs like in OpenAI version might help as well:
^look at paragraph titled “Adaptive Noise Scaling” in Section 3
This isn’t quite so trivial. First, we need to be able to turn off the noise for testing. Second, we need to be able to choose when to resample the noise tensors, since for NoisyNet-A3C it’s supposed to be resampled after every k-step policy rollout. But yes, it should be possible to implement for other kinds of layers, although defining the forward pass might be challenging for some.
I hadn’t heard of this paper, thanks! One convenient feature of the noisy networks paper is that you optimize the variance parameters with the other parameters of the network, throwing away the need for an adaptive scaling schedule like that. Looking forward to seeing a rigorous comparison of all these methods.