Gini Index 算法
(2010-03-07 22:00:15)
下一個
Calculation
The Gini index is defined as a ratio of the areas on the Lorenz curve diagram. If the area between the line of perfect equality and the Lorenz curve is A, and the area under the Lorenz curve is B, then the Gini index is A/(A+B). Since A+B = 0.5, the Gini index, G = A/(0.5) = 2A = 1-2B. If the Lorenz curve is represented by the function Y = L(X), the value of B can be found with integration and:
G = 1 - 2,int_0^1 L(X) dX.
In some cases, this equation can be applied to calculate the Gini coefficient without direct reference to the Lorenz curve. For example:
* For a population uniform on the values yi, i = 1 to n, indexed in non-decreasing order ( yi ≤ yi+1):
G = frac{1}{n}left ( n+1 - 2 left ( frac{Sigma_{i=1}^n ; (n+1-i)y_i}{Sigma_{i=1}^n y_i} right ) right )
This may be simplified to:
G = frac{2 Sigma_{i=1}^n ; i y_i}{n Sigma_{i=1}^n y_i} -frac{n+1}{n}
* For a discrete probability function f(y), where yi, i = 1 to n, are the points with nonzero probabilities and which are indexed in increasing order ( yi < yi+1):
G = 1 - frac{Sigma_{i=1}^n ; f(y_i)(S_{i-1}+S_i)}{S_n}
where
S_i = Sigma_{j=1}^i ; f(y_j),y_j, and S_0 = 0,
* For a cumulative distribution function F(y) that is piecewise differentiable, has a mean μ, and is zero for all negative values of y:
G = 1 - frac{1}{mu}int_0^infty (1-F(y))^2dy = frac{1}{mu}int_0^infty F(y)(1-F(y))dy
* Since the Gini coefficient is half the relative mean difference, it can also be calculated using formulas for the relative mean difference. For a random sample S consisting of values yi, i = 1 to n, that are indexed in non-decreasing order ( yi ≤ yi+1), the statistic:
G(S) = frac{1}{n-1}left (n+1 - 2 left ( frac{Sigma_{i=1}^n ; (n+1-i)y_i}{Sigma_{i=1}^n y_i}right ) right )
is a consistent estimator of the population Gini coefficient, but is not, in general, unbiased. Like, G, G(S) has a simpler form:
G(S) = 1 - frac{2}{n-1}left ( n - frac{Sigma_{i=1}^n ; iy_i}{Sigma_{i=1}^n y_i}right ) .
There does not exist a sample statistic that is in general an unbiased estimator of the population Gini coefficient, like the relative mean difference.
Sometimes the entire Lorenz curve is not known, and only values at certain intervals are given. In that case, the Gini coefficient can be approximated by using various techniques for interpolating the missing values of the Lorenz curve. If ( X k , Yk ) are the known points on the Lorenz curve, with the X k indexed in increasing order ( X k - 1 < X k ), so that:
* Xk is the cumulated proportion of the population variable, for k = 0,...,n, with X0 = 0, Xn = 1.
* Yk is the cumulated proportion of the income variable, for k = 0,...,n, with Y0 = 0, Yn = 1.
If the Lorenz curve is approximated on each interval as a line between consecutive points, then the area B can be approximated with trapezoids and:
G_1 = 1 - sum_{k=1}^{n} (X_{k} - X_{k-1}) (Y_{k} + Y_{k-1})
is the resulting approximation for G. More accurate results can be obtained using other methods to approximate the area B, such as approximating the Lorenz curve with a quadratic function across pairs of intervals, or building an appropriately smooth approximation to the underlying distribution function that matches the known data. If the population mean and boundary values for each interval are also known, these can also often be used to improve the accuracy of the approximation.
The Gini coefficient calculated from a sample is a statistic and its standard error, or confidence intervals for the population Gini coefficient, should be reported. These can be calculated using bootstrap techniques but those proposed have been mathematically complicated and computationally onerous even in an era of fast computers. Ogwang (2000) made the process more efficient by setting up a “trick regression model” in which the incomes in the sample are ranked with the lowest income being allocated rank 1. The model then expresses the rank (dependent variable) as the sum of a constant A and a normal error term whose variance is inversely proportional to yk;
k = A + N(0, s^{2}/y_k)
Ogwang showed that G can be expressed as a function of the weighted least squares estimate of the constant A and that this can be used to speed up the calculation of the jackknife estimate for the standard error. Giles (2004) argued that the standard error of the estimate of A can be used to derive that of the estimate of G directly without using a jackknife at all. This method only requires the use of ordinary least squares regression after ordering the sample data. The results compare favorably with the estimates from the jackknife with agreement improving with increasing sample size. The paper describing this method can be found here: http://web.uvic.ca/econ/ewp0202.pdf
However it has since been argued that this is dependent on the model’s assumptions about the error distributions (Ogwang 2004) and the independence of error terms (Reza & Gastwirth 2006) and that these assumptions are often not valid for real data sets. It may therefore be better to stick with jackknife methods such as those proposed by Yitzhaki (1991) and Karagiannis and Kovacevic (2000). The debate continues.
The Gini coefficient can be calculated if you know the mean of a distribution, the number of people (or percentiles), and the income of each person (or percentile). Princeton development economist Angus Deaton (1997, 139) simplified the Gini calculation to one easy formula:
G = frac{N+1}{N-1}-frac{2}{N(N-1)u}(Sigma_{i=1}^n ; P_iX_i)
where u is mean income of the population, Pi is the income rank P of person i, with income X, such that the richest person receives a rank of 1 and the poorest a rank of N. This effectively gives higher weight to poorer people in the income distribution, which allows the Gini to meet the Transfer Principle.