I’m working on a hexapod (18x servos) that teaches itself how to walk. I’ve gotten pretty familiar with actor critic implementations and how it works, but just about everything I’ve seen applies to a situation with a small set of possible actions (ie. left, right, forward). In my situation, there are 18 servos with at least 100 different possible values for each, so that would be a massive output on the network (10^36).
My ideal output would be just one vector of 18 servo values. I know that the probability distribution of the actions is crucial, but what if I were to remove it? What if I took out the softmax from the last layer and just use the 18x output values? Would this work? If not, what other solutions could I try? Should I move to a different algo?
Someone recommended that to me earlier on and I thought about. The issue I have with doing that is because my goal for the project was for it to learn the servo values on its own. I thought that it might end up doing better than a program written by hand would. If I can’t find out any solutions, I may end up doing that but ideally not.
EDIT
So the robot will only end up making a few actions (turn left or right, move forward or backward). Ideally turning would be synchronous with the movements but that might be a later problem. What if I were to create an output of 4 18x1 vectors for each movement? Would it end up learning that each one is a specific movement? Or is this too bold to assume?
If I understand correctly, instead of learning the full 18-dimensional categorical distribution, you could learn an 18-dimensional Gaussian distribution by learning the mean and the diagonal of the covariance matrix instead. It seems like you would need to round the output of the Gaussian to the nearest valid discrete values but that should work.
Right now your model outputs a softmax that represents a categorical distribution. Instead of doing that, have your model output the mean and standard deviation of a Gaussian that you can then sample from to choose your action.