How would I build a deep Q network where each observation is a 2D matrix input?

Hi how’s it going?

I’m familiar with supervised learning in pytorch where each observation is a row and the input node is just the number of columns of the data point.

However, I’m trying to build a network that takes in a “state” and returns a choice of “action”. The state is how close a bunch of drivers are to the pick up restaurant and to the delivery address, and the action returned should be the best driver for the job given their current location. I will take care of the reward function logic and what not, but I’m just trying to figure out the model architecture in pytorch

Here is an example of what would be one “state” that I would like as an input to the network:

Screenshot (12)

So every observation would be a 4x2 matrix (excluding name as a feature), and I would like the output layer to return the row index of which driver is best fit for the job. Obviously at first it’ll have no clue but as I build the reward function it will learn that selecting the driver closest to the restaurant and closest to the delivery address lead to maximized reward.

Any help would be greatly appreciated!