Hey All. (First post, maybe you have to be patient)

I am trying to do prototype-based metric learning, using prototorch a torch extension, where the prototypes live in a subspace.

An nn.parameter variable projects these prototypes and samples into this subspace, however it is not updated during training, although there are gradients for it.

The network is:

```
class Model(torch.nn.Module):
def __init__(self, num_classes,init_data,tangent_projection_type="local",
prototypes_per_class=2, bottleneck_dim=128,):
super().__init__()
# Feature Extractor
self.tpt = tangent_projection_type
super(Model, self).__init__()
self.fe = nn.Sequential(
nn.Conv2d(1, 32, 3, 1),
nn.ReLU(),
nn.Conv2d(32, 64, 3, 1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Dropout(0.25),
nn.Flatten(),
nn.Linear(9216, 128),
nn.ReLU(),
nn.Dropout(0.5))
# inital subspace is right singular values of inital batch with dxd shape
self.subspaces = torch.nn.Parameter(
self.init_gobal_subspace(init_data).
clone().detach().requires_grad_(True))
self.glvq = Prototypes1D(input_dim=128,
prototypes_per_class=prototypes_per_class,
nclasses=num_classes,
prototype_initializer='zeros')
```

and the feed forward pass is:

```
def forward(self, x):
# Feature Extraction
x = self.fe(x)
# Tangent projection and distance
x = x @ self.subspaces
projected_prototypes, self.glvq.prototypes @ self.subspaces
dis = euclidean_distance(x, projected_prototypes)
```

Using a prototype based approach, the distance beteween correct samples is minimized.

Currently i learn it via the the loop function:

```
for epoch in range(n_epochs):
for batch_idx, (x_train, y_train) in enumerate(train_loader):
# Compute loss.
distances, plabels = model(x_train)
loss = criterion([distances, plabels], y_train)
control = model.subspaces.clone()
# Take a gradient descent step
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.subspaces = nn.Parameter(orthogonalization(model.subspaces))
```

Note that the orthogonalization is necessary and removing it does not change the behaviour.

When removing the subspace from the forward pass the network learns fine. However, if added, the difference between the subspace before and after the optimizer step is zero:

```
Epoch: 01/50 Epoch Progress: 1.07 % Loss: 20.11 Subspace Difference: 0.00
Epoch: 01/50 Epoch Progress: 1.60 % Loss: 21.93 Subspace Difference: 0.00
Epoch: 01/50 Epoch Progress: 2.13 % Loss: 21.23 Subspace Difference: 0.00
Epoch: 01/50 Epoch Progress: 2.67 % Loss: 21.22 Subspace Difference: 0.00
Epoch: 01/50 Epoch Progress: 3.20 % Loss: 20.94 Subspace Difference: 0.00
Epoch: 01/50 Epoch Progress: 3.73 % Loss: 23.43 Subspace Difference: 0.00
Epoch: 01/50 Epoch Progress: 4.27 % Loss: 21.33 Subspace Difference: 0.00
```

Am i doing something wrong generally? Or is this the wrong use of nn.Parameter?

I would be very glad if you could help me further.

Thank you very much