Help with model architecture for a racing game

I’m working on a model for a racing game using pytorch. The model gets frame from the game as input and produces a controller state as output. The dataset consists of frames from the game and corresponding controller state (buttons state + joystick state)
Until now I’ve been using a cnn connected to linear layers to predict the output. The problem is that this model cannot establish connection between buttons state output and the joystick state output so it can’t learn how to drift or aim items. (when the player drifts using b, the joystick output need to be updated accordingly)
I thought about the following architecture:

I’m new to deep learning and I would like to know if someting like that even makes any sense because I haven’t seen anything like it before. If it does not make sense , can you suggest another model architectre?
Thanks in advance.