I have a dataset of B x T x C, where B is batches, T is timestep (uneven), and C is characters (uneven). I would like to use EmbeddingBag to get a mean-embedding of each timestep of characters.
For example, lets say I have three datapoints in my batch:
- [[], [0, 4], [1, 1], [5]]
- This has 4 time steps, and 0, 2, 2, 1, respectively, characters in each timestep.
- [[1], [2, 3]]
- This has 2 time steps, and 1, 2, respectively, characters in each timestep.
- [[2, 4, 5], []]
- This has 2 timesteps, and 3, 1, respectively, characters in each timestep
So let’s init that:
all_tensors = [[[], [0, 4], [1, 1], [5]], [[1], [2,3]], [[2, 4, 5], []]]
And I know this is what I want my embedder to look like:
embedder = torch.nn.EmbeddingBag(num_embeddings = 6, embedding_dim = 2, mode = 'mean')
And…this is where I am stuck. Is there a good tutorial for how this problem should be approached?
Edit: Think I got a bit closer…
def pad_array(base_input):
for index1, datapoint in enumerate(base_input):
base_input[index1] = torch.LongTensor(np.asarray([np.pad(a, (0, 5 - len(a)), 'constant',
constant_values=0) for a in datapoint]))
return base_input
all_tensors = [[[], [0, 4], [1, 1], [5]], [[1], [2,3]], [[2, 4, 5], []]]
paddedchar_tensors = pad_array(all_tensors)
paddedchar_tensors = rnn_utils.pad_sequence(padded_tensors, batch_first=True)
This gives me paddedcode_tensors
as:
tensor([[[0, 0, 0, 0, 0],
[0, 4, 0, 0, 0],
[1, 1, 0, 0, 0],
[5, 0, 0, 0, 0]],
[[1, 0, 0, 0, 0],
[2, 3, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]],
[[2, 4, 5, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]]])
But once again, I am stuck, running this through the EmbeddingBag gives me this error: ValueError: input has to be 1D or 2D Tensor, but got Tensor of dimension 3