Is there a way to mix many different activation functions efficiently?

Suppose I want a layer to consist of 10 different activation functions. Right now the only way to do this seems to basically involve the creation of multiple layers and concatenating them. This obviously introduces some computational inefficiency.

Is there some clever hack to apply point-wise nonlinearities based on masks or something?

The way I envision this possibly happening is by doing the linear tranformation as usual but then stacking multiple nonlinearities on top such that each nonlinearity layer ignores all but its own portion of the layer. Not sure how to approach that.

1 Like

Activation functions or general pointwise functions? Everything applied at the same time on the same tensor or a string of pointwise functions on subtensors?

Not sure about what distinction you are making between activation functions and “general pointwise functions”.

The goal is basically have different neurons have different non-linearities. Not apply different functions to the same neurons.

So why not have a series of tensors with value 0 or 1 to pre-multiply the input before each activation and accumulate in an output?

Would that be an efficient solution? If I understood you correctly, you would have a lot of function operations with input as zero, or would that be optimized away?

That won’t be optimized away. If you really want to optimize it this should work:

nonlinearities = [F.relu, F.tanh, F.sigmoid]
masks = generate_masks(input) # a list of masks
preactivation = module(input) 
# preallocate output
# IMPORTANT: it shouldn't require grad!
# you don't care about the grad w.r.t the original content!
output = Variable(
for nonlinearity, mask in zip(nonlinearities, masks):
    output[mask] = nonlinearity(preactivation[mask])
1 Like

But I think the best way to see what’s fastest is to benchmark :slight_smile: You might find the simplest solution fastest.

hello, i have the same question, Have you solved this problem?

1 Like

Did you wind up solving this? I’m hoping to do the same thing to an existing, pre-trained network (VGG11), but I’m not sure I follow @apaszke’s suggested approach.

It sounds like I’ll need to implement a new mixed layer to replace an existing ReLU layer. In my new mixed layer I’ll need to generate a set of masks for each activation function I intend to use. Then I need to pass each masked input to the corresponding activation function, and assign the output of these to the corresponding masked output.

Does that all sound correct/possible?

This is what I wound up doing: