I am trying to train a production network, where the agent has to decide on 3 actions:
0-continue production mode
I used a Linear network to process the state signal containing of:
x - Product Mass - from 0 to 4
y - side product (currently not important) from 1 to 3
z - Filter level - from 1 to 0.2
d - Time after production start - bounded to 9 days
t - overall time - max 183 days
So basically you have to continue production for some days, until you reach the perfect harvest point, where you get the reward for the accumlated product mass.
Starting state is : (0,1,1,0,0). After each action we have a state transition.
When you harvest, you come back to initial state (0,1,z,0,t) and need to start a new production. When you harvest you get reward for the product and you have to use your filter, which will result in a decay.
When you continue production, you can grow your product for one day.
If the filter is below the 0.2 treshold, you should exchange the filter. Which will reset z to 1.
Should I just use the vector input? Or would a convolution make more sene? Why?
I try to train the perfect production mode over the time horizon of 183 days.
I tried PPO and rainbow Reinforcement Learning implementations. Using Linear input with the size of the vector shape.
While the PPO learns to harvest at the best time, he did not manage to learn to exchange the filters efficiently. The rainbow implementations finds a local optimum after some time, which is way beow of the baseline.
My rainbow-implementation gets stucked on a local minimum. I am not sure, if it will find out of this. There should be reward in the range of 4,0e9
Any suggestions how to tackle the problem? I would appreciate every help