[resolved] Actor Critic with a large amount of possible actions

I’m working on a hexapod (18x servos) that teaches itself how to walk. I’ve gotten pretty familiar with actor critic implementations and how it works, but just about everything I’ve seen applies to a situation with a small set of possible actions (ie. left, right, forward). In my situation, there are 18 servos with at least 100 different possible values for each, so that would be a massive output on the network (10^36).

My ideal output would be just one vector of 18 servo values. I know that the probability distribution of the actions is crucial, but what if I were to remove it? What if I took out the softmax from the last layer and just use the 18x output values? Would this work? If not, what other solutions could I try? Should I move to a different algo?

could you first assign say 50 specific types movements for the hexapod then use that as possible actions

Someone recommended that to me earlier on and I thought about. The issue I have with doing that is because my goal for the project was for it to learn the servo values on its own. I thought that it might end up doing better than a program written by hand would. If I can’t find out any solutions, I may end up doing that but ideally not.

So the robot will only end up making a few actions (turn left or right, move forward or backward). Ideally turning would be synchronous with the movements but that might be a later problem. What if I were to create an output of 4 18x1 vectors for each movement? Would it end up learning that each one is a specific movement? Or is this too bold to assume?

If I understand correctly, instead of learning the full 18-dimensional categorical distribution, you could learn an 18-dimensional Gaussian distribution by learning the mean and the diagonal of the covariance matrix instead. It seems like you would need to round the output of the Gaussian to the nearest valid discrete values but that should work.

@mjacar So instead of choosing an action from the distribution, I would find the covariance matrix of the distribution and use that?

Right now your model outputs a softmax that represents a categorical distribution. Instead of doing that, have your model output the mean and standard deviation of a Gaussian that you can then sample from to choose your action.


On the A3C paper, section 9 has a great description. This seems to work.

How do you calculate the Action from this mu, sigma?

This was a long time ago and I forgot most of this, but I will say the best I could get going was based off of GitHub - ikostrikov/pytorch-a3c: PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".. Hope thats helpful