By Phone or Online

Access the help you need to use our software from representatives who are knowledgeable in data mining and predictive analytics

  • Banner 201707

    By Phone or Online

    Access the help you need to use our software from representatives who are knwoledgeable in data mining and predictive analytics

Download Now Instant Evaluation
Get Price Quote

Can TEST partition performance better than “LEARN” in RF

Can TEST partition performance better than “LEARN” in RF

One of the great benefits of RandomForests(R) is that you do not need to reserve any data for testing purposes.  The built-in bootstrap resampling automatically holds back about 37% of the data (the OOB or “Out of Bag” data) when growing each tree. and naturally data that is not used to grow a tree can be used to evaluate it.  Given that OOB data is perfectly fine for evaluating a model we really do not need a test partition and every RF model is gauranteed to have OOB performance stats.

But what if we also hold back an explicit test sample? What kind of performance should we expect to see? What is fairly surprising to most users is that the TEST partition performance is sometimes better than on OOB data.  Our discussion here is intended to explain why.

 Below we show a table in which each row corresponds to a record in our data set and each column represents a RandomForests tree.  A value of 1 indicates that the record (row) was used to train the tree and a value of 0 indicates that the record was OOB and thus not used.

Record

Tree_1

Tree_2

...

Tree_500

001

1

0

 

1

002

0

1

 

1

003

1

1

 

0

...

       

1999

0

0

 

1

To make a prediction for record 001 we do not use all of the trees grown in the forest.  Instead we only use the approximately 37% of the trees for which the record was OOB 

[J#414:1602]

Get In Touch With Us

Contact Us

9685 Via Excelencia, Suite 208, San Diego, CA 92126
Ph: 619-543-8880
Fax: 619-543-8888
info (at) salford-systems (dot) com