Suppose I want a layer to consist of 10 different activation functions. Right now the only way to do this seems to basically involve the creation of multiple layers and concatenating them. This obviously introduces some computational inefficiency.
Is there some clever hack to apply point-wise nonlinearities based on masks or something?
The way I envision this possibly happening is by doing the linear tranformation as usual but then stacking multiple nonlinearities on top such that each nonlinearity layer ignores all but its own portion of the layer. Not sure how to approach that.
Activation functions or general pointwise functions? Everything applied at the same time on the same tensor or a string of pointwise functions on subtensors?
Would that be an efficient solution? If I understood you correctly, you would have a lot of function operations with input as zero, or would that be optimized away?
That won’t be optimized away. If you really want to optimize it this should work:
nonlinearities = [F.relu, F.tanh, F.sigmoid]
masks = generate_masks(input) # a list of masks
preactivation = module(input)
# preallocate output
# IMPORTANT: it shouldn't require grad!
# you don't care about the grad w.r.t the original content!
output = Variable(preactivation.new(preactivation.size()))
for nonlinearity, mask in zip(nonlinearities, masks):
output[mask] = nonlinearity(preactivation[mask])
Did you wind up solving this? I’m hoping to do the same thing to an existing, pre-trained network (VGG11), but I’m not sure I follow @apaszke’s suggested approach.
It sounds like I’ll need to implement a new mixed layer to replace an existing ReLU layer. In my new mixed layer I’ll need to generate a set of masks for each activation function I intend to use. Then I need to pass each masked input to the corresponding activation function, and assign the output of these to the corresponding masked output.