(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)
Classification on the German Credit Database. As expected, a single has a lower performance, compared with a logistic regression. And a natural idea is to grow several trees using some boostrap procedure, and then to agregate those predictions. Credit Risk Analysis Using Logistic Regression Modeling Introduction A loan officer at a bank wants to be able to identify characteristics that are indicative of people who are likely to default on loans, and then use those characteristics to discriminate between good and bad credit risks.
In our data science course, this morning, we’ve use random forrest to improve prediction on the German Credit Dataset. The dataset is
Almost all variables are treated a numeric, but actually, most of them are factors,
(etc). Let us convert categorical variables as factors,
Let us now create our training/calibration and validation/testing datasets, with proportion 1/3-2/3
The first model we can fit is a logistic regression, on selected covariates
Based on that model, it is possible to draw the ROC curve, and to compute the AUC (on ne validation dataset)
An alternative is to consider a logistic regression on all explanatory variables
We might overfit, here, and we should observe that on the ROC curve
There is a slight improvement here, compared with the previous model, where only five explanatory variables were considered.
Consider now some regression tree (on all covariates)
We can visualize the tree using
The ROC curve for that model is
As expected, a single has a lower performance, compared with a logistic regression. And a natural idea is to grow several trees using some boostrap procedure, and then to agregate those predictions.
Here this model is (slightly) better than the logistic regression. Actually, if we create many training/validation samples, and compare the AUC, we can observe that – on average – random forests perform better than logistic regressions,
To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
If you got this far, why not subscribe for updatesfrom the site? Choose your flavor: e-mail, twitter, RSS, or facebook...
(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)
In our data science course, this morning, we’ve use random forrest to improve prediction on the German Credit Dataset. The dataset is
Almost all variables are treated a numeric, but actually, most of them are factors,
(etc). Let us convert categorical variables as factors,
Let us now create our training/calibration and validation/testing datasets, with proportion 1/3-2/3
The first model we can fit is a logistic regression, on selected covariates
Based on that model, it is possible to draw the ROC curve, and to compute the AUC (on ne validation dataset)
An alternative is to consider a logistic regression on all explanatory variables
We might overfit, here, and we should observe that on the ROC curve
There is a slight improvement here, compared with the previous model, where only five explanatory variables were considered.
Consider now some regression tree (on all covariates)
We can visualize the tree using
The ROC curve for that model is
As expected, a single has a lower performance, compared with a logistic regression. And a natural idea is to grow several trees using some boostrap procedure, and then to agregate those predictions.
Here this model is (slightly) better than the logistic regression. Actually, if we create many training/validation samples, and compare the AUC, we can observe that – on average – random forests perform better than logistic regressions,
To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... If you got this far, why not subscribe for updatesfrom the site? Choose your flavor: e-mail, twitter, RSS, or facebook...