Encoding names for ML

MagaMakhauri · June 28, 2022, 2:19pm

Hi!
This may be a stupid question, but if I am manually encoding string data (such as city names, airports, class of the aircraft)
I later use them for ML.

Does it matter what number I chose for any given ‘city’ of no?

Model should calculate coefficient from a city to city
While I have dates, most of my input data isnt numerical, so I encoded it.

Now running this model I am coming back to the question if I should do some sort of sorting before encoding?
Major cities such as Dubai, Moscow and London are ‘responsible’ for higher coefficients. Does it mean that their encoding number must be higher than of, lets say, Wellington or Dublin?

ptrblck · June 28, 2022, 11:47pm

I don’t know what “coefficients” refer to, but the encoding and sorting depends on the use case and how the loss is calculated.
E.g. if you are working on a multi-class classification use case the order wouldn’t matter as each class corresponds to an integer label without any relation to each other.
E.g. while “Dubai - 0” and “Moscow - 1” would be close to each other, the actual “distance” between these labels doesn’t matter in e.g. nn.CrossEntropyLoss (only how wrong/right your model predictions are for the current class index).
However, if you are using e.g. nn.MSELoss and are encoding the cities also with labels, then of course the distance would matter, but I wouldn’t know if this would fit your use case.

MagaMakhauri · June 29, 2022, 7:28am

Thank you! This is very informative and helpful.

and are encoding the cities also with labels

What does that mean?

In my calculation there are almost a 1000 cities. Is it too much for multi-class?

ptrblck · June 29, 2022, 8:12am

This sentence refers to the usage of nn.MSELoss which would show a larger loss between a prediction of 2 - Dublin while the target is 0 - Dubai compared to an also wrong prediction of 1 - Moscow to 0 - Dubai. However, I would not recommend to use nn.MSELoss for a classification use case and also don’t know what you are working on.

MagaMakhauri · June 29, 2022, 12:20pm

I did try to use nn.CrossEntropyLoss.

I get error
*IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)*

I tried reading other posts regarding this error. But cant figure out where I went wrong

criterion = torch.nn.CrossEntropyLoss()
before_train = criterion(y_pred.squeeze(), test_output)

Now for the variables:

y_pred:
tensor([0.5098, 0.5024, 0.5043,  ..., 0.5023, 0.5060, 0.5058], device='cuda:0',
       grad_fn=<SqueezeBackward0>)
torch.Size([4608])

test_output:
tensor([1., 0., 1.,  ..., 0., 1., 1.], device='cuda:0')
torch.Size([4608])

They are the same size and both tensors, so what is causing the error?

ptrblck · June 29, 2022, 6:33pm

Your prediction shape is wrong as [batch_size, nb_classes] is expected.

MagaMakhauri · July 1, 2022, 10:34am

Sorry ones again.

My inputs are:

'month', 'day', 'week_day', 'classname_ml', 'departure_country', 'departure_city',
                     'departure', 'arrival', 'arrival_city', 'arrival_country'

month, day and week_day are int type and pretty straightforward.
I understand that I need to encode the cities as int too and that’s fine.

But since I have same cities in departure and arrivals.
I did it by simply numbering each city (and country and airport).
Is it a good solution? (apparently not according to article I have read) and they suggest using OneHotEncoder, but my issue with it is that it will encode departure Moscow differently to arrival Moscow (or it doesnt matter?)
Also since it doesnt rember the values how can I convert values back for testing? (imagine I want to get the output for 7-11-2022 Dubai-Moscow)