Acta Oeconomica Pragensia 2015, 23(3):3-17 | DOI: 10.18267/j.aop.472

Data representativeness problem in credit scoring

Josef Ditrich
VŠE v Praze, Fakulta informatiky a statistiky (email: pepa.ditrich@seznam.cz).

When building models, it is common to split the whole dataset into a development and a validation sample. In some cases, using random sampling instead of stratified sampling can lead to loss of representativeness of final samples. In such cases, a model built on these data gives different or unexpected results when its performance is measured on the validation sample. In the business area, a lack of representativeness can cause interpretative problems and can have a huge financial impact when a biased model is involved in the credit granting process. The aim of this paper is to examine and understand why representativeness should be checked before the start of modelling. The paper deals with methods of identification of selection bias in time. It recommends using three tests as a common part of the data preparation process.

Keywords: credit scoring, credit risk models, selection bias, random sampling, stratified sampling, data splitting
JEL classification: C18, C80, C83

Published: June 1, 2015  Show citation

ACS AIP APA ASA Harvard Chicago Chicago Notes IEEE ISO690 MLA NLM Turabian Vancouver
Ditrich, J. (2015). Data representativeness problem in credit scoring. Acta Oeconomica Pragensia23(3), 3-17. doi: 10.18267/j.aop.472
Download citation

References

  1. ANDREWS, D., 2000. Inconsistency of the Bootstrap when a Parameter is on the Boundary of the Parameter Space. Econometrica. Issue 2, pp.399-405. ISSN 0012-9682. Go to original source...
  2. BOROVICKA, T; JIRINA, M. jr.; KORDIK, P; JIRINA, M., 2012. Selecting Representative Data Sets. In: KARAHOCA, A., ed. Advances in Data Mining, Knowledge, Discovery and Applications. InTech. ISBN 978-953-51-0748. Go to original source...
  3. BOWDEN, G.; MAIER, H.; DANDY, G., 2002. Optimal Division of Data for Neural Network Models in Water Resources Applications. Water Resources Research. Issue 2, pp.1-11. Go to original source...
  4. ELSAYIR, H. A., 2014. Comparison of Precision of Systematic Sampling with Some Other Probability Samplings. American Journal of Theoretical and Applied Statistics. Issue 4, pp.111-117. Go to original source...
  5. FARAWAY, J. J., 1998. Data Splitting Strategies for Reducing the Effect of Model Selection on Inference. [online]. [accessed January 20, 2015]. Available at: http://www.maths.bath.ac.uk/~jjf23/papers/interface98.pdf.
  6. FARAWAY, J. J., 2014. Does Data Splitting Improve Prediction? [online]. [accessed January 20, 2015]. Available at: http://arxiv.org/abs/1301.2983v2.
  7. GEOFF, D.; EVERITT, B. S. 2001. Handbook of Statistical Analyses using SAS. 2nd edition. Chapman&Hall/CRC Press. ISBN 9781584882459.
  8. KOHAVI, R., 1995. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. International Joint Conference on Artificial Intelligence. Vol. 14, pp.1137-1145.
  9. LOHR, S. L., 1999. Sampling: Design and Analysis. 2nd edition. Cengage Learning. ISBN 9780495105275.
  10. MAY, R. J.; MAIER, H. R.; DANDY, G. C., 2010. Data Splitting for Artificial Neural Networks using SOM-based Stratified Sampling. Neural Networks. Issue 2, pp.283-294. Go to original source...
  11. MENG, X.; XIE, X., 2014. I Got More Data, My Model is More Refined, but My Estimator Is Getting Worse! Am I Just Dumb? Econometric Reviews. Issue 1-4, pp. 218-250. Go to original source...
  12. MOLINARO, A.; SIMON, R.; PFEIFFER, R., 2005. Prediction Error Estimation: a Comparison of Resampling Methods. Bioinformatics. Issue 15, pp.3301-3307. Go to original source...
  13. PECK, R.; OLSEN, CH.; DEVORE, J. L., 2012. Introduction to Statistics and Data Analysis. 4th edition. Cengage Learning. ISBN 9780840054906.
  14. PICARD, R. R.; COOK, R. D., 1984. Cross-validation of Regression Models. Journal of the American Statistical Association. Issue 387, pp.575-583. Go to original source...
  15. REITERMANOVÂ, Z., 2010. Data Splitting. Proceedings of the WDS'10 Nineteenth Annual Conference of Doctoral Students, Prague, Czech Republic, Part I - Mathematics and Computer Sciences, pp.31-36.
  16. SHAO, J., 1993. Linear Model Selection by Cross-validation. Journal of the American Statistical Association. Issue 422, pp.486-494. Go to original source...
  17. SNEE, R., 1997. Validation of Regression Models: Methods and Examples. Technometrics. Issue 4, pp.415-428. Go to original source...
  18. STONE, M., 1974. Cross-validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society, series B. Issue 2, pp.111 -147. Go to original source...
  19. TIBSHIRANI, R. J.; EFRON, B., 1996. An Introduction to the Bootstrap. Journal of Economic Literature. Issue 3, pp.1340-1342.

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.