Concatenating observations that include image, pose and sensor readings

edowson · March 28, 2019, 7:01am

@ptrblck What would the correct way be to concatenate observations (image (84x84x1), pose(x,y,z,r,p,y), sonar(range)), first using numpy, and then converting it to a torch tensor?

I have to process first using numpy, so that there are no PyTorch depedencies added to the OpenAI Gym get_obs() method. And then convert the observations to a Tensor once I get it from the Gym environment.

Omegastick · March 28, 2019, 7:24am

You probably shouldn’t concatenate them directly, but put each one through an appropriate feature extractor (CNN for the image, and MLP for the others) before concatenating the outputs of the feature extractors.

edowson · March 28, 2019, 7:36am

@Omegastick The example that I’m working on is a reinforcement learning example, using DQN. The observations from the environment are the visual camera input and the drone’s sensor reading. Since this is an end-to-end learning example, the inputs are the observations, and the outputs are the actions, with rewards guiding the agent’s learning process.

It is true, that for another example, I would use a CNN to do some sort of feature extract, like the location of a 3D waypoint marker. I could then concatenate the location of the way-point marker observation to the list I mentioned earlier. But for now, I am working on a simpler version of the problem, that is the next step after the standard DQN reinforcement example of only using visual inputs as the observations.

At the moment, I am trying to do something like this:

# concatenate observations
image = camera.processed_image
print("image.shape: {}".format(image.shape))

x, y, z, r, p, y = 1.1, 1.2, 1.3, 2.1, 2.2, 2.3
pose = [x, y, z, r, p, y]
sensor_range = [3.1]

obs = np.concatenate((image, pose, sensor_range), axis=0)

But I am having a problem with arranging the array dimentions correctly:

image.shape: (180, 320)
    obs = np.concatenate((image, pose, sensor_range), axis=0)
ValueError: all the input arrays must have same number of dimensions

Omegastick · March 28, 2019, 7:49am

Yes, I’ve used the exact same technique in reinforcement learning (using PPO, rather than DQN, but the concepts still apply).

Concatenating a 2D image and a 1D vector isn’t going to work unless you flatten the image. But you don’t want to do that, because then you lose all the benefits of CNNs.

Here’s some code to show what I mean (I haven’t ran it, no guarantees it wont throw errors):

image_extractor = nn.Sequential(
    nn.Conv2d(1, 32, 8, stride=4),
    nn.ReLU(),
    nn.Conv2d(32, 64, 4, stride=2),
    nn.ReLU(),
    nn.Conv2d(64, 32, 3, stride=1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(32 * 7 * 7, 64),  # Assuming 80x80x1 image
    nn.ReLU()
)

linear_extractor = nn.Sequential(
    nn.Linear(7, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU()
)

q_network = nn.Sequential(
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(128, 4)  # Assuming 4 actions
)

image_obs = torch.Tensor(camera.processed_image)
linear_obs = torch.from_numpy(np.concatenate(pose, sensor_range), axis=0))

image_features = image_extractor(image_obs)
linear_features = linear_extractor(linear_obs)
features = torch.cat([image_features, linear_features])

q_values = q_network(features)

edowson · March 28, 2019, 9:06am

@Omegastick Do you have a link to a paper that does what you have suggested?

It shouldn’t be the case that you’re using a CNN for the image and an MLP for the pose and sensor values, just to make the inputs compatible to an RL algorithm’s function approximator.

Omegastick · March 28, 2019, 10:10am

@edowson I think I miscommunicated. The CNN and MLP aren’t making the inputs compatible for the function approximator, the are a part of the function approximator. Putting each separate input through an embedding is pretty standard practice (see OpenAI 5).

If you have a 2 dimensional image, and you don’t want to flatten it (which is very understandable, computation quickly becomes infeasible if you do) then your only remaining option (outside of a few, very experimental techniques) is to use a 2D CNN. However, 2D CNNs simply aren’t compatible with 1D inputs, so they need be converted to compatible shapes (usually using a CNN to project the 2D image onto a 1D embedding) before they can be concatenated and used together.

edowson · March 28, 2019, 10:18am

@Omegastick Thanks, the OpenAI 5 architecture diagram is just the sort of thing that I need for my experiments, albeit with a fewer number of observations. Is there a reference pytorch implementation that you know of for OpenAI 5, so that I can learn from that?

edowson · March 28, 2019, 10:35am

On a related note, this post by @ptrblck talks about concatenating a layer output with additional input data. I need to think and read a bit more, for what I need to do though.

ptrblck · March 28, 2019, 12:08pm

While this approach might generally work, I had some trouble concatenating the outputs of a pre-trained CNN and a Fully-Connected model in the past maybe due to different output value stats. It seemed the whole model just ignored the FC part and just used the CNN outputs.
After carefully rescaling the outputs it was working, so you might also want to consider this.
Let me know, how it works out.

edowson · March 28, 2019, 5:30pm

@ptrblck If we set aside running the observations through a NN, for the moment, what sort of data-structure should I use to treat the image, pose and sonar senor reading as a single sample, using numpy?

The word ‘Tensor’ comes to mind, since at one data point, you have multiple quantities being represented: image (84x84x1), pose (6x1) + sensor reading (1x1).

All but the last dimension of these three quantities match.

How can I create a ‘Tensor’ using numpy, as a first step? This is so that I treat observations as tensors within the OpenAI Gym environment.

ptrblck · March 28, 2019, 7:12pm

As @Omegastick explained, you would need to flatten all data so that you could concatenate the inputs in the feature dimension.
The structure of the image tensor will get lost, if you are using linear layers. On the other hand while nn.Conv1d might capture some of the image structure, the “temporal” dimension between the different data samples might cause other issues.