Transfer learning in reinforcement learning environment for different observation and action spaces

Hi Pytorch community!

I have a problem that I am currently tasked to carry out, which is to execute transfer learning in a reinforcement learning environment for different observation and action spaces. In other words, I would like to transfer the weights from one model (PPO) to another model with a different input and output layer from the original model (different number of nodes, discrete observation and action spaces).

I am using Stable-Baselines3’s library and it is implemented using Pytorch (that’s why I am here :slight_smile:), and I asked a similar question on their repository regarding this issue [Question] Does SB3 support transfer learning for a different environment with different observation and actions spaces?.

I was advised to carry out model surgery using the state_dict or imitation learning/filling the replay buffer using the pre-trained agent, but I have little to no experience in doing these things. I am unsure as to how imitation learning/filling the replay buffer using the pre-trained agent works (since the agent still has the same observation and action space), so the first choice might be a better approach for me.

Question
Could someone advise/show me how I can carry out this “model surgery using the state_dict in Stable-Baselines3”? The input and output layers are the main points of modifications.

If your input and output layers are not the bulk of the model, you could just replace them with a new layer of appropriate input and output size. And then freeze the intermittent layers during training.

Here is a tutorial:
https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

However, if you’re wanting to retain the input and output layers but expand/prune their sizes or the sizes of intermittent layers(this is what I would refer to as model surgery), that can be a bit more complicated both in code and mathematics, but doable. Not easily explainable in a post, though.

1 Like

Hi @J_Johnson thank you for your reply!

If your input and output layers are not the bulk of the model, you could just replace them with a new layer of appropriate input and output size. And then freeze the intermittent layers during training.

For clarity, the input and output layers are just simple discrete layers, but together they possess a much greater number of nodes as compared to the intermittent layers. Proximal Policy Optimisation (PPO) usually has 2 hidden layers with 64 nodes each, but in my application, the input and output layers can have a combination of over 10,000 nodes due to the combinations of actions that can be taken.

Let me have a look at the tutorial you suggested thanks!

You may need to consider using a larger hidden_size for your intermediate layers if you’re not able to simplify the input/output size.

But it sounds like you’re separating the inputs/outputs by combinations of actions. Usually, each input/output would be designated for a given action. I.e. “Left” ,“Right”, “Up”, “Down”, “Jump”, etc. would all have their own probability output.

Hey @J_Johnson, I was actually trying to stick to the original proposed PPO architecture (2 layers of 64 nodes each), that being said, I should look into different architectures for PPO.

You mentioned:

If your input and output layers are not the bulk of the model, …

Could I ask what happens if it is the bulk of the model?

Perhaps I wasn’t clear on your prior post. What are the in_features of the first layer and out_features of the last layer?

https://pytorch.org/docs/stable/generated/torch.nn.Linear.html

I apologise for that :sweat_smile:. I am new to this field of Reinforcement Learning, let me try to put some code for clarity. As Stable-Baselines3 is a complex package, I will put some code I had previously when I tried to implement PPO in Pytorch alone.

Here is a code block in Pytorch code:

import torch.nn as nn

# Actor Network (Non-continous space)
nn.Sequential(
                            nn.Linear(state_dim, 64),
                            nn.Tanh(),
                            nn.Linear(64, 64),
                            nn.Tanh(),
                            nn.Linear(64, action_dim),
                            nn.Softmax(dim=-1)
                        )

# Critic Network
nn.Sequential(
                            nn.Linear(state_dim, 64),
                            nn.Tanh(),
                            nn.Linear(64, 64),
                            nn.Tanh(),
                            nn.Linear(64, 1)
                        )

# Currently
# state_dim --> 8503
# action_dim --> 11027
# These dimensions changes with different environments that it encounters, still figuring out how to simplify such that the dimensions are consistent and constant.

Hope this helps!

The traditional 64 unit hidden size with 2 hidden layers was based on a very limited number of moves per time step and possible states.

On the other extreme, AlphaZero was a chess program developed by Google that had 40 convolutional layers with 256 filters per layer. And even in that case, the input state at each timestep couldn’t possibly be greater than 64 unique values(i.e. one value for each square). That took an estimated $25 million to train on the best TPUs at the time.

So if your state size is 8k, that should give you some notion of what kind of model might be commensurate to the task.

1 Like

Thank you for your insights on this! That’s a pretty huge architecture that they used.

Simplifying the inputs and outputs of the network seems to be of necessary action on my side :slight_smile: