Is there a way to mix many different activation functions efficiently?

Veril · February 21, 2017, 7:17pm

Suppose I want a layer to consist of 10 different activation functions. Right now the only way to do this seems to basically involve the creation of multiple layers and concatenating them. This obviously introduces some computational inefficiency.

Is there some clever hack to apply point-wise nonlinearities based on masks or something?

The way I envision this possibly happening is by doing the linear tranformation as usual but then stacking multiple nonlinearities on top such that each nonlinearity layer ignores all but its own portion of the layer. Not sure how to approach that.

csarofeen · February 21, 2017, 8:44pm

Activation functions or general pointwise functions? Everything applied at the same time on the same tensor or a string of pointwise functions on subtensors?

Veril · February 21, 2017, 9:20pm

Not sure about what distinction you are making between activation functions and “general pointwise functions”.

The goal is basically have different neurons have different non-linearities. Not apply different functions to the same neurons.

csarofeen · February 21, 2017, 11:04pm

So why not have a series of tensors with value 0 or 1 to pre-multiply the input before each activation and accumulate in an output?

Veril · February 21, 2017, 11:17pm

Would that be an efficient solution? If I understood you correctly, you would have a lot of function operations with input as zero, or would that be optimized away?

apaszke · February 21, 2017, 11:29pm

That won’t be optimized away. If you really want to optimize it this should work:

nonlinearities = [F.relu, F.tanh, F.sigmoid]
masks = generate_masks(input) # a list of masks
preactivation = module(input) 
# preallocate output
# IMPORTANT: it shouldn't require grad!
# you don't care about the grad w.r.t the original content!
output = Variable(preactivation.new(preactivation.size()))
for nonlinearity, mask in zip(nonlinearities, masks):
    output[mask] = nonlinearity(preactivation[mask])

apaszke · February 21, 2017, 11:30pm

But I think the best way to see what’s fastest is to benchmark You might find the simplest solution fastest.

wxwx · February 9, 2018, 3:13am

hello, i have the same question, Have you solved this problem?

awalkingstick · June 30, 2020, 6:28pm

Did you wind up solving this? I’m hoping to do the same thing to an existing, pre-trained network (VGG11), but I’m not sure I follow @apaszke’s suggested approach.

It sounds like I’ll need to implement a new mixed layer to replace an existing ReLU layer. In my new mixed layer I’ll need to generate a set of masks for each activation function I intend to use. Then I need to pass each masked input to the corresponding activation function, and assign the output of these to the corresponding masked output.

Does that all sound correct/possible?

awalkingstick · July 1, 2020, 1:35am

This is what I wound up doing: https://github.com/briardoty/allen-inst-cell-types/blob/master/MixedActivationLayer.py