Data Mining for Statisticians
An approach to data mining from a statistical point of view.
Data Mining for Statisticians Pt. I
This video begins with some definitions of data mining and machine learning. We take a look at some well known classical approaches. We then move on to running a conventional regression model on Boston housing data using MARS and examine the shortcomings of conventional regression.
Data Mining for Statisticians Pt. II
This video is dedicated to MARS (Multivariate Adaptive Regression Splines). It starts with an overview of splines, specifically smoothing splines. We discuss the MARS built in automation for finding ‘knots’ and how MARS is able to accomplish this through the use of basis functions. We then move onto the three main stages of the MARS algorithm: forward stage, backwards stage, and the selection stage. This is followed up with a live run of MARS to compare with the results obtained from the previous video using linear regression
Data Mining for Statisticians Pt. III
This video introduces the concept of regression trees as recursive, piece-wise constant fits in order to identify an underlying response surface. It then walks through an example of a CART model setup using the Boston housing data and discusses the results. Finally, a summary of how this approach simplifies the underlying structure by constructing piece-wise constant models, by segmenting the underlying population into a set of mutually exclusive smaller segments, such that within each segment the overall prediction is a constant and as you go from segment to segment, the prediction changes by a fixed amount.
Data Mining for Statisticians Pt. IV
This video introduces the concept of tree ensembles. It starts with an overview of the ensemble modeling process and discusses the different methods for growing multiple trees, the bootstrap sampling procedure, and the RandomForests algorithm. The following video details how to work within RandomForests.
Data Mining for Statisticians Pt. V
This video focuses on RandomForests, which can be described as independently grown, large regression trees. It begins with RandomForest settings and the building of a model. We discuss the results and highlight the advantages and disadvantages of using multiple trees and the rationale for building multiple trees.
Data Mining for Statisticians Pt. VI
The last video in this series goes over stochastic gradient boosting, or TreeNet. It is also referred to as Multiple Additive Regression Trees. It begins with an introduction to the concept of loss functions, an integral part of the TreeNet methodology. We then go through an overview of the TreeNet algorithm without delving into too much detail. A TreeNet model is set up and run. We observe the results then run another model, pushing the performance and comparing the two models. We explore the variable dependence plots produced by TreeNet. We finish by noting the strengths of TreeNet and how utilizing data mining tools can give you an advantage over conventional modeling.