C80 - Data Collection and Data Estimation Methodology; Computer Programs: GeneralReturn

Results 1 to 2 of 2:

Data representativeness problem in credit scoring

Josef Ditrich

Acta Oeconomica Pragensia 2015, 23(3):3-17 | DOI: 10.18267/j.aop.472

When building models, it is common to split the whole dataset into a development and a validation sample. In some cases, using random sampling instead of stratified sampling can lead to loss of representativeness of final samples. In such cases, a model built on these data gives different or unexpected results when its performance is measured on the validation sample. In the business area, a lack of representativeness can cause interpretative problems and can have a huge financial impact when a biased model is involved in the credit granting process. The aim of this paper is to examine and understand why representativeness should be checked before the start of modelling. The paper deals with methods of identification of selection bias in time. It recommends using three tests as a common part of the data preparation process.

Identifying the most Informative Variables for Decision-Making Problems - a Survey of Recent Approaches and Accompanying Problems

Pavel Pudil, Petr Somol

Acta Oeconomica Pragensia 2008, 16(4):37-55 | DOI: 10.18267/j.aop.131

We provide an overview of problems related to variable selection (also known as feature selection) techniques in decision-making problems based on machine learning with a particular emphasis on recent knowledge. Several popular methods are reviewed and assigned to a taxonomical context. Issues related to the generalization-versus-performance trade-off, inherent in currently used variable selection approaches, are addressed and illustrated on real-world examples.