What is wrong with my model?

witl · October 26, 2020, 7:36pm

Edit: Sorry, I’m a bit confused about all concepts and the code.

Could you guys help me figure out what I am doing wrong?

Note the entire code is here: https://gist.github.com/Willtl/4250d6391d40397b7d7d335510190802

I’m working with reinforcement learning, so at each step t, I am feeding the current state (input), given the output, I pick an action and assume the new state, then I feed the new state again to get maximal value from this new state (this is to calculate the target).

input: tensor([[2., 0., 1., 1.]])
output: tensor([[-0.7809,  0.4925,  0.1809, -0.4934]], grad_fn=<AddmmBackward>)
target tensor([[-0.7809,  0.4357,  0.1809, -0.4934]], grad_fn=<CopySlices>)
q_target tensor(0.4357, grad_fn=<AddBackward0>)
loss tensor(0.0008, grad_fn=<MeanBackward0>)

After ANN.backward() (see below in the ANN class), If I print the output obtained with the same “example” that I just used to train the model, the values are exactly is the same. See below

input: tensor([[2., 0., 1., 1.]])
output: tensor([[-0.7809,  0.4925,  0.1809, -0.4934]], grad_fn=<AddmmBackward>)

Q1: Is it ok in this situation to use MSELoss?

Q2: Why the results are not changing? Is it ok to feed multiple times (like I am doing, the second feed is not supposed to be considered a training step, it is just to get the results given the new state)?

Q3: MSE should not be used here, right? It would lead to a single value that would be propagated for all the outputs? Would not be correct to have a tensor like loss = target - output? In this way we know the error with respect to each output?

class ANN(nn.Module):
    # ANN's layer architecture
    def __init__(self):
        # Initialize superclass
        super().__init__()
        # Fully connected layers
        self.inputs = 4
        self.outputs = 4
        self.l1 = nn.Linear(self.inputs, 4)  # To disable bias use bias=False
        self.l2 = nn.Linear(4, 4)
        self.l3 = nn.Linear(4, 4)
        self.l4 = nn.Linear(4, self.outputs)

        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
        self.loss_criterion = nn.MSELoss()

    # Define how the data passes through the layers
    def foward(self, x):
        # Passes x through layer one and activate with rectified linear unit function
        x = F.relu(self.l1(x))
        x = F.relu(self.l2(x))
        x = F.relu(self.l3(x))
        # Linear output layer
        x = self.l4(x)
        return x

    def feed(self, x):
        outputs = self.foward(x)
        return outputs

    # Train the network with one state
    def backward(self, output, target):
        # Zero gradients
        self.optimizer.zero_grad()
        # Calculate loss
        loss = self.loss_criterion(output, target)
        # Perform a backward pass, and update the weights.
        loss.backward()
        self.optimizer.step()
        return loss

ptrblck · October 29, 2020, 10:23am

I’m not familiar with your use case and cannot comment if the used methods are right or not.
However, if you are concerned about the correctness of the gradient calculation, you could check the .grad attribute of all parameters after the backward call.
Before the first backward they should be set to None, afterwards they should contain a valid value.
If all param.grad attributes are filled, then the computation graph is not detached and the gradients are calculated.

witl · October 31, 2020, 6:08pm

I manage to fix it. Honestly, I rewrote the code and now it is working. It looks the same, except that now it works. I really appreciate your reply.

I have one or two more questions.

Additionally to Q-learning, and Deep Q-learning, I am planning to use metaheuristics (e.g., differential evolution, genetic algorithms, etc.) instead of gradient descent to train the network.

As you probably know, in this context, each metaheuristic’ solution is represented by a d dimensional array, where d is the number of parameters of the ANN. Given that I have a few questions:

How can I transform all the parameters of the ANN to two one-hot vectors (one vector for weights and one for biases)? Is there any more elegant way than:

    def get_parameters(self):
        one_hot = [] 
        for key in self.state_dict():
            for tensor in self.state_dict()[key]:
                if not tensor.size():
                    one_hot.append(float(tensor))
                else:
                    for value in tensor:
                        one_hot.append(float(value))
        return one_hot

How can I define a straight forward function that sets the ANN parameters given two one-hot vectors (weights, biases)?
Are there any bounds with respect to the weights or biases? Should I constraint it between a specific range? What do you suggest?

This is more related to your experience in the field.

I wonder when it is recommended (or mandatory) to use metaheuristics instead of gradient descent? I believe that it is related to the cases where you don’t have an input, output, and target. If you dont have a target you cannot calculate the loss, and then you cannot perform gradient descent. Is it right?
Do you have any example of problems where the model can only be trained by the usage of metaheuristics?

The idea of this project was for me to learn exactly that, the trade-offs between the usage of each strategy. With deep Q-learning, I am trying to approximate the Q-values. However, I can easily formulate it as an optimization problem where the metaheuristic is used to optimize the weights of the ANN and the objective function is to maximize the score (or it can be a multi-objective function such as maximize score and minimize moves, etc.).

For me, it looks like there is much more freedom in the metaheuristic case. Maybe it is because I am more used to formulate problems like that. Do you have any experience with that?

ptrblck · November 1, 2020, 8:56am

Unfortunately, I’m not experienced with RL and cannot be of much help here.

However, for the first question regarding the creation of one-hot encoded tensors:
I don’t know which shape tensor and one_hot should have in your example exactly, but you could try to use F.one_hot or one_hot.scatter to create a one-hot encoded tensor, which might be faster than a loop.

witl · November 11, 2020, 3:23pm

Thank you for the feedback. I end up using something like:

    def flatten_params(self):
        l = [torch.flatten(p) for p in self.parameters()]
        indices = []
        s = 0
        for p in l:
            size = p.shape[0]
            indices.append((s, s+size))
            s += size
        flat = torch.cat(l).view(-1, 1)
        return {"params": flat, "indices": indices}

    def recover_flattened(self, flat_params, indices):
        l = [flat_params[s:e] for (s, e) in indices]
        for i, p in enumerate(self.parameters()):
            l[i] = l[i].view(*p.shape)
            p.data = l[i]

hopeful-coder · January 29, 2021, 6:49pm

Hi @witl

First day on the forum because of your post. This is something I am experienced with and interested in.

I wonder when it is recommended (or mandatory) to use metaheuristics instead of gradient descent? I believe that it is related to the cases where you don’t have an input, output, and target. If you dont have a target you cannot calculate the loss, and then you cannot perform gradient descent. Is it right?

Good question, I will look more into this. Learning the theory along the journey!

Do you have any example of problems where the model can only be trained by the usage of metaheuristics?

Any test subjects will work. GLM’s, Deep learning, image-video-text analysis, reinforcement learning.

For me, it looks like there is much more freedom in the metaheuristic case. Maybe it is because I am more used to formulate problems like that. Do you have any experience with that?

Yeah, metaheuristics are dope AF.