Feature importance / Regression trees

bsalanon · October 3, 2019, 7:21am

Good morning all,

I am making some machine learning on the metro traffic volume from there: http://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

The aim is to predict an hourly traffic volume according to various features that are:

I made a random forest and get a 20% error on traffic prediction, which is OK for me. Then I have plotted the FEATURES IMPORTANCE and got:

Feature ranking:

holiday (0.736229)
temp (0.137736)
rain_1h (0.029498)
snow_1h (0.028497)
clouds_all (0.023778)
weather_main (0.019016)
hours (0.004792)
wd_0 (0.004365)
…
wd_6 0 (0.000018)

Comments :

‘holiday’ is very important because of course, on holidays, traffic is very slow the whole day !
‘hours’ seems to be almost insignificant.

Action: I have REMOVED ‘hours’ from the features because it was so insignificant and then I run random forest again.

Result: result is a disaster (error jumped from 20 to 60%) and YES, it is normal, because traffic volume depends so much on the hour of the day !

Question:

why was ‘hours’ put with a so low ranking on the features importance analysis ?
it seems not being a good idea at all to withdrawn features based on there .feature_importances_ method of RandomForestRegressor(). Is that correct ?
So what does .feature_importances_ precisely measure ?

Thanks in advance for your comments !

ptrblck · October 3, 2019, 11:19am

Which library are you using?
The lib should have some information, how the feature importance is calculated.

You are posting in the PyTorch forum, which might not be the best place to ask about other toolkits.

bsalanon · October 3, 2019, 2:03pm

Yes you are correct. I was confused… I am using SKLEARN so it is not the right place !