Feature importance / Regression trees

I am making some machine learning on the metro traffic volume from there: http://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

The aim is to predict an hourly traffic volume according to various features that are:

  • holiday : Yes/No
  • temperature in C
  • rain and snow fall in mm
  • clouds cover %
  • hours of the day : integer from 0 to 23
  • day of the week (wd_i, i=0…6).

I made a random forest and get a 20% error on traffic prediction, which is OK for me. Then I have plotted the FEATURES IMPORTANCE and got:

Feature ranking:

  1. holiday (0.736229)
  2. temp (0.137736)
  3. rain_1h (0.029498)
  4. snow_1h (0.028497)
  5. clouds_all (0.023778)
  6. weather_main (0.019016)
  7. hours (0.004792)
  8. wd_0 (0.004365)
  9. wd_6 0 (0.000018)

Comments :

  • ‘holiday’ is very important because of course, on holidays, traffic is very slow the whole day !
  • ‘hours’ seems to be almost insignificant.

Action: I have REMOVED ‘hours’ from the features because it was so insignificant and then I run random forest again.

Result: result is a disaster (error jumped from 20 to 60%) and YES, it is normal, because traffic volume depends so much on the hour of the day !


  • why was ‘hours’ put with a so low ranking on the features importance analysis ?
  • it seems not being a good idea at all to withdrawn features based on there .feature_importances_ method of RandomForestRegressor(). Is that correct ?
  • So what does .feature_importances_ precisely measure ?

