Feature importance / Regression trees

Good morning all,

I am making some machine learning on the metro traffic volume from there: http://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

The aim is to predict an hourly traffic volume according to various features that are:

  • holiday : Yes/No
  • temperature in C
  • rain and snow fall in mm
  • clouds cover %
  • hours of the day : integer from 0 to 23
  • day of the week (wd_i, i=0…6).

I made a random forest and get a 20% error on traffic prediction, which is OK for me. Then I have plotted the FEATURES IMPORTANCE and got:

Feature ranking:

  1. holiday (0.736229)
  2. temp (0.137736)
  3. rain_1h (0.029498)
  4. snow_1h (0.028497)
  5. clouds_all (0.023778)
  6. weather_main (0.019016)
  7. hours (0.004792)
  8. wd_0 (0.004365)
  9. wd_6 0 (0.000018)

Comments :

  • ‘holiday’ is very important because of course, on holidays, traffic is very slow the whole day !
  • ‘hours’ seems to be almost insignificant.

Action: I have REMOVED ‘hours’ from the features because it was so insignificant and then I run random forest again.

Result: result is a disaster (error jumped from 20 to 60%) and YES, it is normal, because traffic volume depends so much on the hour of the day !

Question:

  • why was ‘hours’ put with a so low ranking on the features importance analysis ?
  • it seems not being a good idea at all to withdrawn features based on there .feature_importances_ method of RandomForestRegressor(). Is that correct ?
  • So what does .feature_importances_ precisely measure ?

Thanks in advance for your comments !

Which library are you using?
The lib should have some information, how the feature importance is calculated.

You are posting in the PyTorch forum, which might not be the best place to ask about other toolkits. :wink:

Yes you are correct. I was confused… I am using SKLEARN so it is not the right place !