How to make the parameter of torch.nn.Threshold learnable?

dlmacedo · July 9, 2017, 2:20pm

I the following code, I would like to make delta a parameter learnable by the model instead of a fixed scalar value.

I would like to have the option of a learnable delta for each component of the input tensor or for each layer.

 self.features = nn.Sequential(
      nn.Conv2d(3, 64, kernel_size=5, stride=1, padding=2),
      nn.Threshold(-delta, -delta, inplace=True),
      nn.MaxPool2d(kernel_size=3, stride=2),
      nn.Conv2d(64, 192, kernel_size=5, padding=2),
      nn.Threshold(-delta, -delta, inplace=True),
      nn.MaxPool2d(kernel_size=3, stride=2),
 )

hughperkins · July 9, 2017, 2:55pm

I tried using torch.clamp, but also seems non-differentiable:

import torch
from torch import autograd

threshold = autograd.Variable(torch.rand(1), requires_grad=True)
print('threshold', threshold)
# m = torch.nn.Threshold(threshold, threshold)
input = autograd.Variable(torch.rand(1, 5), requires_grad=True) - 0.5
print('input', input)
# out = m(input)
out = torch.clamp(input, min=threshold)
print('out', out)
out.backward(torch.ones(1, 5))
print('threshold.grad.data', threshold.grad.data)

> Traceback (most recent call last):
>   File "4729.py", line 11, in <module>
>     out = torch.clamp(input, min=threshold)
>   File "/Users/hugh2/conda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/variable.py", line 396, in clamp
>     return CmaxConstant(min)(self)
>   File "/Users/hugh2/conda3/envs/pytorch/lib/python3.6/site-packages/torch/autograd/_functions/pointwise.py", line 232, in forward
>     self._max_buffer = i.gt(self.constant).type_as(i)
> TypeError: gt received an invalid combination of arguments - got (Variable), but expected one of:
>  * (float value)
>       didn't match because some of the arguments have invalid types: (Variable)
>  * (torch.FloatTensor other)
>       didn't match because some of the arguments have invalid types: (Variable)

I tried on tensorflow, and seemed to work ok:

import tensorflow as tf

graph = tf.Graph()
with graph.as_default():
    input_t = tf.placeholder(tf.float32, [None], 'input')
    threshold_t = tf.Variable(0.05)
    out_t = tf.minimum(input_t, threshold_t)
    sess = tf.Session()
    with sess.as_default():
        sess.run(tf.global_variables_initializer())
        print('out', sess.run(out_t, feed_dict={input_t: [-0.3, 0.0, 0.7]}))

        # get grad of out_t wrt threshold_t
        grad_out_t = tf.gradients(out_t, [threshold_t])[0]
        print('d(out)/d(theshold)', sess.run(grad_out_t, feed_dict={input_t: [-0.3, 0.0, 0.7]}))
        print('d(out)/d(theshold)', sess.run(grad_out_t, feed_dict={input_t: [-0.3, 0.0, -0.7]}))
        print('d(out)/d(theshold)', sess.run(grad_out_t, feed_dict={input_t: [-0.3, 0.5, 0.7]}))

out [-0.30000001  0.          0.05      ]
d(out)/d(theshold) 1.0
d(out)/d(theshold) 0.0
d(out)/d(theshold) 2.0

Edit: I guess maybe this needs the Scalar thing that’s on the way, in order to be solved?

tom · July 9, 2017, 6:52pm

out = input.max(threshold)

Best regards

Thomas

dlmacedo · July 10, 2017, 1:57am

 threshold = autograd.Variable(torch.rand(1), requires_grad=True)
 self.features = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=5, stride=1, padding=2),
    nn.Max(threshold),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Conv2d(64, 192, kernel_size=5, padding=2),
    nn.Max(threshold),
    nn.MaxPool2d(kernel_size=3, stride=2),
 )

Do you think the code above should work?

hughperkins · July 11, 2017, 1:04pm

As far as I can tell, it’s not possible. Also see Creating a custom loss function

So, it looks like you could create a custom autograd module to handle this. If it was me, I might consider logging it on pytorch issues page and/or submitting the custom autograd module for PR.

albanD · July 11, 2017, 1:09pm

Hi,

The think is that the threshold operation is not differentiable wrt the threshold value.
More specificatly, if the operation it is performing for each element is:

if inp[el] <= threshold:
    out[el] = thresholded_value
else:
    out[el] = inp[el]

What is d(out)/d(threshold) here?

hughperkins · July 11, 2017, 1:11pm

It works in tensorflow. I reckon it’s not differentiable at the threshold itself, but it’s differnetaible almost everywhere?

albanD · July 11, 2017, 1:16pm

Well the threshold_value will have a gradient that accumulate the grad_out for every element where it has been thresholded. So this one in theory you could learn, even though I am not sure what that means in practice.

The threshold is definitely not learnable with pure gradients, or maybe I am missing something? What would be the gradient “almost everywhere” ?

hughperkins · July 11, 2017, 1:28pm

So, we have:

hughperkins · July 11, 2017, 1:30pm

(by the way, you can see that the theoretical result I’ve proposed matches the results I’m getting from tensorflow)

tom · July 11, 2017, 1:31pm

Sum of the output_grad for things below the threshold, zero otherwise.
You can see this by looking sternly at the max formulation or, if you prefer, rewrite as relu(x-t)+t.

Best regards

Thomas

albanD · July 11, 2017, 2:36pm

@tom good point when threshold == threshold_value.
But can you get a similar expression for the general formula of threshold when they are not equal?

tom · July 11, 2017, 7:04pm

Hi @albanD,

No, I do not know what to do then with respect to the input cut-off. And with my bias towards theory, a using discontinuous function seems unintuitive, too.
In fact, I prefer to think about this as shrinkage (i.e. relu(x-t), with its well-studied connections e.g. to a quadratic activation penalty or regression with noisy observation) plus a bias and don’t really like to think about thresholding. If you fed the output to (optionally) relu plus a layer that uses bias, I would think do not need the offset +t outside the relu at all.

But that’s me.

Best regards

Thomas

hughperkins · July 12, 2017, 12:13am

I’m like 80% sure it should be differentaible. What makes you feel it would not be?

By the way, one of the nice things about tf is, I’ve tried some really convoluted bizarre cost functions, and they’ve all been differentabiel. It’s quite nice…

hughperkins · July 12, 2017, 12:19am

Oh youre right. Fair enough

dlmacedo · July 12, 2017, 3:23am

What is wrong bellow?

import torch
from torch import autograd

import torch.nn as nn

all = [
‘CNN’, ‘cnn’,
]

class CNN(nn.Module):
def init(self, dataset):
super(CNN, self).init()
self.threshold = autograd.Variable(torch.rand(1), requires_grad=True)
self.threshold.data.fill_(0.05)
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=2, stride=None, padding=0)
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)

def forward(self, x):
    x = self.conv1(x)
    x = self.maxpool(x)
    #x = x + self.threshold.expand_as(x)
    x = x + self.threshold
    x = self.relu(x)
    x = self.conv2(x)
    x = self.conv2_drop(x)
    x = self.maxpool(x)
    x = self.relu(x)
    x = x.view(-1, 320)
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    x = self.relu(x)
    return x

def cnn(dataset):
model = CNN(dataset)
return model

Error:

Traceback (most recent call last):
File “train.py”, line 390, in
main()
File “train.py”, line 165, in main
cnn(args.epochs, train_loader, val_loader, model, criterion, optimizer, experiment)
File “train.py”, line 200, in cnn
training_time += train(train_loader, model, criterion, optimizer, epoch)
File “train.py”, line 260, in train
output = model(input_var)
File “/home/dlm/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 206, in call
result = self.forward(*input, **kwargs)
File “/home/dlm/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 59, in forward
return self.module(*inputs[0], **kwargs[0])
File “/home/dlm/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 206, in call
result = self.forward(*input, **kwargs)
File “/home/dlm/code/deeplearninglab/sem/models/cnn.py”, line 27, in forward
x = x + self.threshold
File “/home/dlm/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py”, line 745, in add
return self.add(other)
File “/home/dlm/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py”, line 283, in add
return self._add(other, False)
File “/home/dlm/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py”, line 277, in _add
return Add(inplace)(self, other)
File “/home/dlm/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/basic_ops.py”, line 20, in forward
return a.add(b)
TypeError: add received an invalid combination of arguments - got (torch.FloatTensor), but expected one of:

(float value)
didn’t match because some of the arguments have invalid types: (torch.FloatTensor)
(torch.cuda.FloatTensor other)
didn’t match because some of the arguments have invalid types: (torch.FloatTensor)
(torch.cuda.sparse.FloatTensor other)
didn’t match because some of the arguments have invalid types: (torch.FloatTensor)
(float value, torch.cuda.FloatTensor other)
(float value, torch.cuda.sparse.FloatTensor other)

tom · July 12, 2017, 3:57am

I’d use nn.Parameter instead of Variable for the Parameter. You did call model.cuda, probably.

Best regards

Thomas

albanD · July 12, 2017, 8:50am

You should do self.threshold = nn.Parameter(torch.rand(1)).
All parameters of a nn.Module must be nn.Parameters otherwise they won’t appear when you call .parameters() and won’t move when you call .cuda() (which is your problem here).

dlmacedo · November 11, 2017, 1:09am

class DDReLU(nn.Module):
    def __init__(self):
        super(DDReLU, self).__init__()
        self.threshold = nn.Parameter(torch.rand(1), requires_grad=True)
        self.register_backward_hook(lambda module, grad_i, grad_o: (grad_i[0], grad_i[1]*0.01))
        #self.threshold.data.fill_(0.1)
        self.ReLU = nn.ReLU(True)

    def forward(self, x):
        print(self.threshold.data[0])
        return self.ReLU(x + self.threshold) - self.threshold
        #return self.ReLU(x) + self.threshold

Is the code above fine to change the relative learning rate of the new parameter?

By relative learning rate, I mean: The parameter created has a learning rate that is 0.01 times the one used to the other model’s parameters.

sparseinference · December 3, 2017, 12:50pm

I’ve been experimenting with learning the threshold parameters for expressions like .clamp(min=lower) where ‘lower’ is a Module Parameter.

Here is a function for accomplishing it for clamping to zero or negative values:

def Clamp(x, minval):
    """
    Clamps Variable x to minval.
    minval <= 0.0
    """
    return x.clamp(max=0.0).sub(minval).clamp(min=0.0).add(minval) + x.clamp(min=0.0)

With some extra work the same could be done for .clamp(max=upper) .