I'm working on a hexapod (18x servos) that teaches itself how to walk. I've gotten pretty familiar with actor critic implementations and how it works, but just about everything I've seen applies to a situation with a small set of possible actions (ie. left, right, forward). In my situation, there are 18 servos with at least 100 different possible values for each, so that would be a massive output on the network (10^36).
My ideal output would be just one vector of 18 servo values. I know that the probability distribution of the actions is crucial, but what if I were to remove it? What if I took out the softmax from the last layer and just use the 18x output values? Would this work? If not, what other solutions could I try? Should I move to a different algo?