Good morning all,
I am making some machine learning on the metro traffic volume from there: http://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume
The aim is to predict an hourly traffic volume according to various features that are:
- holiday : Yes/No
- temperature in C
- rain and snow fall in mm
- clouds cover %
- hours of the day : integer from 0 to 23
- day of the week (wd_i, i=0…6).
I made a random forest and get a 20% error on traffic prediction, which is OK for me. Then I have plotted the FEATURES IMPORTANCE and got:
Feature ranking:
- holiday (0.736229)
- temp (0.137736)
- rain_1h (0.029498)
- snow_1h (0.028497)
- clouds_all (0.023778)
- weather_main (0.019016)
- hours (0.004792)
- wd_0 (0.004365)
… - wd_6 0 (0.000018)
Comments :
- ‘holiday’ is very important because of course, on holidays, traffic is very slow the whole day !
- ‘hours’ seems to be almost insignificant.
Action: I have REMOVED ‘hours’ from the features because it was so insignificant and then I run random forest again.
Result: result is a disaster (error jumped from 20 to 60%) and YES, it is normal, because traffic volume depends so much on the hour of the day !
Question:
- why was ‘hours’ put with a so low ranking on the features importance analysis ?
- it seems not being a good idea at all to withdrawn features based on there .feature_importances_ method of RandomForestRegressor(). Is that correct ?
- So what does .feature_importances_ precisely measure ?
Thanks in advance for your comments !