MODELLING OF INCOME DISTRIBUTION OF CZECH HOUSEHOLDS IN THE YEARS 1996 – 2005

After the 1989 Velvet Revolution, the transformation to a market economic system, and mainly the formation of new income sources and the process of a significant differentiation in wages, has caused crucial changes to the income distribution. There have been systematic changes in the model parameters as well as changes manifesting the increasing numbers of discrepancies between empirical and theoretical income distributions: these have caused a contamination of the model. This paper concentrates on the verification of the validity of the statistical model of income distribution used in the Czech Republic at present. In order to consider as precisely as possible the suitability of log-normal distribution as the proper model for income distribution used in the Czech Republic, the optimisation of parameters is necessary. The accuracy of the model construction is crucial and is given, to some extent, by the sensitivity and adaptability of procedures used for estimation of the model parameters. The paper shows methods usable for the achievement of this goal and presents results obtained by their application on data files from the years 1996–2005.

conclusions that were based on them.Household income distribution in some other transitional countries (e.g., Slovakia, Poland, and Hungary) experiences a similar situation.
Analyses and modelling of the state and development of incomes in transitional economies are the main objective of several PhD theses.Modelling of income distribution in Slovakia in 2002 by virtue of generalised lambda distribution (GLD) is the theme of the thesis by Sipková (2005), while the theses by Bartošová (2006a) and Bílková, (1996) are devoted to the complex analysis of properties of income distribution in the Czech Republic and its modelling by virtue of three-parameter log-normal distribution.

. A d v a n c e d M e t h o d s o f A n a l y s i s a n d M o d e l l i n g o f L i v i n g S t a n d a r d s
There are a lot of economical and statistical publications published in specialised and scientific journals and proceedings (domestic and international) dedicated to the topic of social situation of population and households.Several of them also contain results of professional statistical analyses based on available data on social situations and at the same time provide valuable information about applicability and methods of such analyses.Contemporary research aims in particular at analysis of the income development dynamics, its stability, and detection of important factors affecting the income levels.Some works concentrated on the description of income development dynamics are Paap and van Dijk (1998), Di Prete and McManus (2000); its stability in EU countries is considered in Longford and Pittau (2006).The articles by Kneip and Utikal (2001), and Pittau (2004) focus on identification of the most influential factors and regional diversities in the EU.
Present income distributions are rather complicated and thus many authors use more general classes of models for their modelling (see e.g.Longford and Pittau, 2006).Theoretical foundations of the modern mathematical and statistical classification methods (modelling with the use of mixtures, hidden Markov models, generalised linear and additive models) can be found in McCullagh and Nelder (1994), and McLachlan and Peel (2000).Many authors give priority to modelling with the use of kernel density estimates (Kneip and Utikal, 2001), which allows them to choose a suitable degree of curve smoothing and thus construct detailed models of density function.The progressive method of modelling based upon the use of properties of the generalised lambda distribution quantile function (RS GLD) (see Ramberg and Schmeiser, 1974, pp. 78-82;Sipková, 2005, pp. 90-164;Pacáková and Sodomová, 2003, pp. 30-44) or the generalised Pareto distribution (see Luceno, 2006) is also widely applied nowadays.

. I n c o m e M o d e l s
Income models may be used directly for assessment of the living standard or for comparison of living standards among individual areas or nations.For the purpose of statistical analyses of the living standard, the focus must be only on measurable elements of the living standard.In order to quantify this element of the living standard of a population which is directly dependent on income, it is necessary to elicit the correct level and structure of the population income, i.e., to find suitable statistical models of income distribution for individual social classes as well as for the population as a whole, irrespective of social classes.Knowledge of the statistical model, which in many cases represents a simple approximation of complicated survey sample distribution, and knowledge of the development trends of its parameters may be used to predict the behaviour of a particular variable in the upcoming period.
For the purpose of construction of the statistical model, it is important to first find a theoretic distribution function that would describe well the empirical frequency distribution, as well as to choose suitable methods of parameter estimates of the models.The methods used so far have been based on the assumption of homogeneity of a population's income inside and among social classes.The wage differentiation process, which is currently occurring in transitional economies, results in increasing inhomogeneity of income distribution.These changes require verification of methods used for the description of the state of income distribution and construction of its parametric model, or alternatively, choice of new methods.It must be noted that methods for estimation of distribution characteristics and parameters of the chosen model must satisfy not only the requirement of consistency and efficiency, but also that of robustness.
For the purpose of the choice of a parameter model of household income distribution, logical criteria and experience with behaviour of the given element will be given priority over the level of congruity of the model with empirical frequency distribution.This is not to mean, of course, that the chosen model should not at the same time reflect empirical income distribution to the highest possible extent.Important discrepancies always require the conduct of analyses for reasons of such discrepancies, or alternatively a change of the model (see Bartošová, 2006b).
To express a theoretical income distribution, it is possible to use some known distributions such as log-normal distribution, Pareto or Weibull distribution, and some others, or to approximate the empirical distribution by means of one of the theoretical curves within Pearson's or Johnson's system (see Pearson, 1895, pp. 343-414;Johnson, 1949, pp. 149-176).For the purpose of analytical illustration of empirical income distribution, a suitable convergent line may also be used.In economics, the log-normal model is the most widely used model of distribution of wages and income.The biggest "competitor" of the log-normal distribution in modelling of population income distribution is the Pareto distribution.The log-normal curve corresponds to empirical distribution in the central part; however, it diverges in extremes remarkably.In contrast, the Pareto curve represents a suitable model of income distribution in extremes (see Johnson, Kotz and Balakrishnan, 1994).
Empirical income distribution of all households, irrespective of social classes, is complicated and forms a mixture of two (or more) one-peaked curves (see Bartošová and Bína, 2007).This circumstance that is evidenced by increasing inconsistency between the empirical income distributions with the model may also be observed in the income distribution of some social classes.
Economic quantities, such as income, wages, turn-out, profits, expenses etc., are bounded below by nonnegative values.In the past the three-parameter log-normal distribution with parameters µ, s 2 and g, where g is the theoretical minimum, represented a good approximation of income distribution.Therefore, the probability distribution function of the chosen model is determined by the following relation 0 otherwise, where f (x; m, s 2 ,g) -probability distribution function, µ,s 2 ,g -parameters of the model.

. 2 C o n s t r u c t i o n
The basic aim for the construction of the theoretic model is its maximum correspondence to the empirical distribution (Bartošová, 2006a;Bartošová, 2006b;Bílková, 1996).Consequently, sufficient flexibility and elasticity belongs among conditions for the choice of the model.A three-parameter log-normal model is often used for modelling empirical income distribution.Since the sample files of households incomes in the years 1996-2005 are sufficiently large for the construction of log-normal models with parameters µ, s 2 and g, the maximum likelihood method was applied.It is based on the searching for an argument of the likelihood function suprema ( ) -values of net annual financial incomes of households in particular groups, n -sample size.
The system of likelihood equations for the estimation of a parameter vector r q m s g = ( , , ) 2 is derived by maximisation of the respective likelihood function and can only be solved numerically.Significant simplification can be achieved by substituting the maximum likelihood estimates ) m g ( ) and ) s g 2 ( ) into the respective log-likelihood function.Thus the reduced formula for the log-likelihood function is obtained: where maximum likelihood estimates of parameters for chosen value of parameter g, g -chosen value of theoretical minimum in the model, n -sample size.
The maximum likelihood estimate of parameter ã of the three-parameter log-normal distribution is calculated numerically using the following two methods: • search for the maximum of the modified log-likelihood function ~( ) l g • search for the minimum of the likelihood ratio (see Andìl, 1985;Bartošová, 2006a) Considering the character of the particular feature, we obtained the estimate by searching for the maximum value of the function l(g) (minimum of the function LR(g|n)) in the interval -x x max min , ), where x min and x max are the minimum and maximum values of incomes.The task was solved by iteration method on a grid, which is refined in each iteration step (see Bartošová and Bína, 2008).The corresponding iteration procedure was implemented by a script in the R language.
Since we treat only finite samples, the maximum likelihood estimates are not guaranteed to show a sufficient quality.More detailed information about the accuracy of the maximum likelihood estimates of the parameter vector r ) ) ) q m s g = ( , , ) 2 in a log-normal model of income distribution is presented in the dissertation thesis by Bartošová (2006a).
The level of conformity of the empirical household income distribution to the log-normal model was quantified by the likelihood ratio.The LR statistic was chosen because the method of maximum likelihood and its modifications were used to estimate parameters of the model (see Bartošová, 2006b).The results are also influenced by the number of classes where data are gathered during the calculation.The problem of the optimal number of classes m is the subject of many papers.In this case we chose m n = × 15 100 2 5 ( / ) , which is suitable for a sufficiently large sample, i.e., for n > 80 (see Williams, 2001).

. 3 N o n -p a r a m e t r i c
In order to consider the quality of the parametrical model, non-parametrical density estimates can also be used.Since the income distribution is continuous, we used a kernel estimate to fit its density: where K -kernel of estimate, H -curve smoothing coefficient, N -sample size.
For the estimate construction the cosine kernel was used, which is defined as follows: Since the income distribution is bounded on the left side, the cosine kernel was chosen (the commonly used normal kernel is not bounded).The decision was motivated by the effort to restrict undesirable nonzero density estimates for negative values of incomes (see Bartošová, 2006a).In 1996 (2002), a sample Mikrocensus survey was conducted in approximately 1% (0.3%) of the Czech households, which represented about 28,000 (8,000) households at that time.In 2005, a sample SILC survey was made in 0.15% of the Czech households, which represented about 4,000 households.The existence of a complete non-aggregated sample set enabled us to gain quality estimates of parameters for our distribution models.For the purposes of this research, the following data were chosen: n the social class of the head of the household, n the net income of the household (CZK per year).
In connection with the economical transformation being in progress, new sources of income have arisen and social structure of sample sets has been changing accordingly.Prior to the Velvet Revolution, households were divided into classes of workers, co-operative farmers, employees, and pensioners.The social structure of the sample sets of household incomes from the years 1996-2005 is shown in Table 1 and Figures 1 and 2. It reflects the changes in the sources of income as well as in the percentage representation of the classes.It is apparent from Table 1 (Figures 1 and 2) that the two largest social classes, i.e., employees and pensioners without economically active members, constituted 87.45% (83.12%, 82.12%, respectively) of all the households in the Czech Republic in 1996 (2002, 2005, respectively).New social classes that emerged due to the post-Velvet Revolution transformation, i.e., households of self-employed, unemployed, and other households, constituted only about 8.44% (13.39%, 13.79%, respectively) altogether.That is the reason why the new classes cannot significantly influence the total income distribution of households.Consequently, the character of the total income distribution of households will mainly be determined by the manner of the income distribution in the three major classes.

. 1 T h e r e s u l t s o b t a i n e d
For the sake of clarity, the behaviour the of log-likelihood function belonging to the estimates is also shown (see Figures 3 and 4).Although the shapes of the curves slightly differ, the uniqueness of the function maximum can be seen in all of them and it facilitates the identification of exactly one solution to the maximisation task on the given interval.The examples show the estimation of the log-normal model parameters for all households' income distribution in the years 1996 and 2005.Figures 3 and 4 show that the estimates faithfully converge to a single result.Whether the designed procedure effectively and faithfully converges to the unique result can be seen from the illustration.
Table 2 shows the values of the likelihood ratio LR for the constructed log-normal models with two and three parameters.Greater agreement of the empirical distribution with the model was achieved in all social groups for the three-parameter log-normal models.The use of the three-parameter log-normal models with the above described iterative estimation of the parameter g led to the improvement of the model validity.The method of likelihood ratio minimisation was used in the iterative procedure in order to estimate the value of the parameter g.
The estimates of the three parameters of the log-normal models of incomes in the Czech households in the years 1996-2005 are listed in Tables 3-5      We can infer from Tables 3-5 that the values of LR and c 0,95 2 ( ) m -4 are comparable in most cases.A strong discrepancy between the empirical distribution and the model only appears for pensioners without economically active members and all households (see Bartošová, 2006c).In both the above cases, the LR statistics significantly exceed the value of the corresponding c 2 quantile.In 1996, LR m )(see Figure 5).Those income sets have a bimodal distribution (see Figures 6-8) and could not be modelled using simple parametrical models.Modelling of such mixtures is the topic of our previous paper (Bartošová and Bína, 2007).
Figures 6 and 7 below show the three-parameter log-normal models, the empirical densities and the kernel estimates of the theoretical density of income distribution of pensioners without economically active members and all households in the years 1996-2005.The proposed iterative procedure was used for the estimation of the log-normal model parameters.That allows the estimation of the theoretical minimum using the likelihood ratio minimisation.The cosine kernel was used for the construction of the kernel estimates of density.The above figures show that the kernel estimates of the theoretical density of the households of pensioners without economically active members are obviously bimodal, while this effect is less significant for all households.One advantage to the approximation of the empirical distribution by a suitable parametrical model is the possibility of the prompt calculation of the basic characteristic estimates.Table 6 lists the values of mean and median annual incomes of the Czech households in the years 1996-2005, and Table 7 contains the relevant values of the standard deviation and the interquartile range (in Czech crowns).Those values were estimated using the above mentioned three-parameter log-normal models.

C o n c l u s i o n s
In most cases the presented results of modelling using the three-parameter model show an agreement with the empirical distribution.Only in the cases of pensioners without economically active members and all households the distribution is bimodal and a significant discrepancy shows.In all surveyed cases, the three-parameter models were better than the two-parameter one.The use of the three-parameter models always led to an improvement in the validity of the constructed models.The likelihood ratio minimisation was used for estimating g in the iterative procedure.Similar results could be achieved using the log-likelihood function maximisation.
The obtained results show that the estimates of the theoretical minimum in the three-parameter log-normal models are often negative.However, since the parameters µ and g are negatively correlated, the effects of the negative estimates of the parameter g are compensated for by the higher values of the parameter µ estimates.Thus, all the constructed log-normal models can well approximate the given empirical distribution in spite of the fact that their theoretical minima differ.
The analysis of the structure of households yields three facts.Firstly, it can be seen that three social classes -the most numerous before the Velvet Revolution (workers, employees, and pensioners without economically active (EA) members) -have kept their dominance, and the income distribution in the newly emerged social classes has a minimal influence on the total income distribution of households.Secondly, it can be seen that for the majority of the social classes the log-normal distribution can be considered a suitable model of household income distribution.Thirdly, the analysis shows the ageing of the population and the increasing influence of the income distribution of pensioners without economically active members.In this class, there is a significant discrepancy between the empirical distribution and the model.The same is true of all households, which is mainly influenced by this class.These income sets have a bimodal distribution and could not be modelled using simple parametrical models.In both cases, the bimodal nature of the income distribution could be overcome by splitting the files into two subgroups: households with one member and households with multiple members.
2 .M o d e l l i n g o f I n c o m e D i s t r i b u t i o n o f C z e c h H o u s e h o l d s i n 1 9 9 6 -2 0 0 5Data files from the years 1996-2005 were used for the modelling.Newer data were unavailable at the time of study.Samples of household incomes used for the modelling were collected in two different studies: the Mikrocensus and the SILC.The Mikrocensus sample survey is a periodical sample survey conducted once every 3-5 years since 1957.After the Czech Republic's accession to the EU, the former Mikrocensus was replaced by the SILC survey.

Figure 1
Figure 1 Structure of data sets in the years 1996-2002

Figure 2
Figure 2 Structure of data sets in the year 2005.
. For the consideration of the model validity, the estimates of m s , 2 and g are supplemented by the values of the LR statistic and the 95% quantiles of c 0where k = 3 is the number of estimated parameters.

Figure 3
Figure 3 Behaviour of the reduced form of the log-likelihood function -all households (Incomes per household in 1996-2002)

Figure 4
Figure 4 Behaviour of the reduced form of the log-likelihood function -all households (Incomes per household in 2005)

Figure 5
Figure 5Comparison of conformity of empirical distribution with three-parameter log-normal income models(Incomes per household in 1996(Incomes per household in  -2005)   )

Figure 7
Figure 7 Three-parameter model, empirical density and kernel estimate (Pensioners without EA members and all households in 2002)

Figure 8
Figure 8 Three-parametrical model, empirical density and kernel estimate (Pensioners without EA members and all households in 2005)

Table 4
Estimates of the parameters for the models (Incomes per household in 2002)

Table 7
Estimates of the chosen characteristics of variability (Incomes per household in the years 1996-2005