reshape the input to [batch_num * temporal_length, channel_dim], then do nn.BatchNorm1d(num_features=channel_dim
, will this work correctly?
I would permute
the input to the shape [batch_size, channels, seq_len]
and apply the batchnorm layer.
This would normalize the values in the temporal dimension using the channel stats.
yeah, but in this way, the feature is normalized within a seq_len not a batch ?
Iâm not sure what âwithin a seq_len not a batchâ means, but this code snippet would show how the normalization is applied internally:
N, C, L = 2, 3, 4
x = torch.randn(N, C, L) * 10 + 5
bn = nn.BatchNorm1d(C)
out = bn(x)
out_manual = (x - x.mean([0, 2], keepdims=True)) / x.std([0, 2], unbiased=False, keepdims=True)
print(torch.allclose(out_manual, out))
> True
Hi ,
Is there any way to use functional like F.relu
in sequential case. Like i am running on problem on implementing vgg, Like in official implementation,
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
but i dont want to use nn.ReLU insted i want to use F.relu, How can i do that,
does this would work or any idea to do this
if batchnorm:
layers+= [conv2d, F.relu(nn.BatchNorm2d(v), inplace=True)]
else:
layers += [F.relu(conv2d, inplace=True)]
Thanks for idea.
No this will most likely not work, as an nn.Module
would be needed.
You could wrap the functional call into a custom module or just use nn.ReLU()
(which would be the same).
Why donât you want to use nn.ReLU()
directly?
I am passing those activation functions in tuple (F.relu,F.selu,âŚ) as an function argument so that it could be used in conv layers.
But problem with nn module classes, is custom defined activation classes would be messy to write , instead others person could just pass function name in argument, that will be helpful in future and experimenting with new custom function.
so , please tell how can i solve this.
I donât understand the use case. How would you like to pass e.g. F.relu
as an argument to a conv layer?
I had used it before in other architectures, Here is snippet
class LeNet():
....
def forward(self, x, activations=None):
"""
Parameters:
x: input tenosor
activations: set of 5 activation functions for each conv and linear layer.
"""
act1 = act2 = act3 = act4 = act5 = F.relu
if activations is not None:
(act1, act2, act3, act4, act5) = activations
x = act1(self.conv1(x))
x = self.maxpool1(x) if self.pool1=='max' else self.avgpool1(x)
....
and i had called train func which takes LeNet as
model, train_acc = train(train_loader, LeNet, epochs=10, lr=rate, use_cuda=True, pools=pool, activations=activation,
In which train function takes model and takes activations and applies to model on given combinations of activations ,
def train(trainloaer, model, epochs, activations):
model = model(1, pools)
pred = model(image, activations)
but problem i am getting is how can i use similar technique to use in make_layers
function which works in sequential, so that it can be used in any vgg family.
As
def make_layers(...., activations=(act1, act2....))
else:
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
and that arguments activations function could be used replacing nn.RelU
or any custom made activation function.
I wanted to use Functional because when using other activations in functional it was working than using nn actviations function. Although nn.ReLU can be passed in this case but some other activation function wouldnt work
How can i solve it.
Help!! @ptrblck
You could try to assign the object to act
and create the module in the list
before wrapping it in an nn.Sequential
container:
act = nn.ReLU
layers += [conv2d, nn.BatchNorm2d(v), act(inplace=True)]
Thanks, but that means i cant use Functional , you told earlier
How can i do that?
You would just recreate the nn.ReLU()
module, which wouldnât make any difference (or maybe it fits perfectly your use case and I misunderstand it):
class MyAct(nn.Module):
def __init__(self, act):
super(MyAct, self).__init__()
self.act = act
def forward(self, x):
x = self.act(x)
return x
layers += [conv2d, nn.BatchNorm2d(v), MyAct(F.relu)]
Thanks , It will sove problem
Is there a way to take the mean/std across only the batch dimension, and thus elementwise normalize every element in each tensor of shape [C, L]
? I.e., the equivalent of this:
out_manual = (x - mean(x, dim=0)) / std(x, dim=0))
Would reshaping to [N, C*L]
, and then nn.BatchNorm1d(num_features=C*L)
do the trick?
Or, although itâs an abuse of the concept of layer normalization, would this be better/more performant:
x = x.transpose([1, 2, 0]) # [C, L, N]
nn.LayerNorm(N)
The problem in this latter case is that the model has to be initialized with the batch size (and thus this must stay constant for the entire training).
Reshaping the batchnorm might work, if I understand the use case correctly.
Note that since you are normalizing each element individually, the stats would be only calculated from the batch size. For small batch sizes your stats might be very shaky, but maybe it fits your use case.
Let us know, how your experiments went
Hi, thanks for the answer
Is there a point doing a batch norm layer directly on the input? even instead of a normalizing transform
I havenât seen a lot of implementation using this approach and, as so often, it might depend on your use case, but if the Normalize
transformation already normalizes the inputs to a zero mean and unit variance, the batchnorm layer wouldnât do much more besides adding its affine parameters (if used).
If the normalize transform doesnât normalize the inputs, for example if the input is an online stream, or if preprocessing is expensive, would it make sense to use batch norm on the input?
It could work and you should definitely experiment with it.
Note that the running stats will be updated in each forward pass using the current batch statistics. If your input data changes the stats after a while, the running stats of the batchnorm layer would also âtrackâ these changes and might thus perform badly on the data from the first iterations.