Elementary Data Description
Consider \(N\) numerical observations, \(x_1, x_2,\dots,x_N\), from some quantifiable outcome of interest (e.g., agent wealth, for \(N\) agents). For compactness, refer to the whole sequence as \(x\). We often desire to provide useful summaries of these observations.
Algebra of Counting
Permutations
Consider the numbers in the integer interval \([1\,..\,n]\). How many sequences of length \(k \le n\) can we create from these? That is, how many \(k\)-permutations can we create. The product rule of counting is that if we can do one thing \(n\) and and independently do another thing \(m\) ways, then we can do the two things in \(n \times m\) ways. Here we can pick the first number \(n\) ways, we can pick the first number \((n-1)\) ways, and so on. So the number of \(k\) permuations is \(n \times (n-1) \times \dots \times (n-k+1)\). Using factorial notation, this is \(n!/(n-k)!\).
Combinations
Sequences are ordered; sets are not. A set of size \(k\) drawn from a set of size \(n\) is called a \(k\)-subset or \(k\)-combination. A sequence of length \(k\) has \(k!\) permutations, so we can compute the number of \(k\)-combinations by deflating the number \(k\)-permuations by \(k!\). Using factorial notation, this is \(n!/\big((n-k)!k!\big)\).
Describing Univariate Data
Sample Mean
Consider \(N\) observations of a single variable. The sample mean is the arithmetic average of the observations.
A capital sigman (\(\sum\)) is often used to indicate a sum over values. So the following expression is equivalent.
The mean difference of the observations from their mean value is \(0\). For this reason, the mean is often used as a measure of the central tendency of the distribution producing the observations. However, also for this reason, the mean is also sensitive to the presence of a few large values in the data.
Sample Standard Deviation
The sample variance (\(s^2\) or \({\hat\sigma}^2\) is the mean squared deviation of observation from the sample mean. For example, given observations \(x_1,\dots,x_N\), we may use the following formula.
The sample variance is often a good measure of how dispersed the data are: a low variance means that the data are more tightly clustered around the mean. The sample standard deviation (\({\hat\sigma}\)) is the square root of the sample variance.
We usually present the standard deviation rather than the variance, because it has the same units as the mean. This makes it easier to interpret.
Sorting the Data: Minimum, Maximum, and Range
It is common to sort numerical data from smallest to largest, which is an ascending sort. After sorting, we have \(x_{(1)}, x_{(2)},\dots,x_{(N)}\), where \(x_{(k)}\) is called the \(k\)-th order statistic. Note that \(x_{(1)}\) is the smallest observation and \(x_{(N)}\) is the largest observation. The smallest observation is called the minimum. The largest observation is called the maximum. Then the range is the difference between the maximum and the minimum: \(x_{(N)} - x_{(1)}\).
Fractiles
Consider a collection of numbers \(x=x_{1},\dots,x_{n}\). For any postive fraction \(f\), let a \(f\)-fractile for the collection be the smallest of these numbers that is greater than or equal to at least \(f n\) of these numbers. In standard mathematical notation, the \(f\)-fractile is \(x_{(\lceil f n \rceil)}\), where \(\lceil x \rceil\) denotes the smallest integer that is at least \(x\). (I.e., non-integer values are rounded up.) So for example, if the values are \(1,\dots,10\), the \(0.90\) fractile is \(9\) but the \(0.91\) fractile is \(10\). In this course, a fractile is always an element of the collection of values.
Quantiles
For the rest of this section, to simplify notation, assume the \(x_i\) are already sorted. Now try to partition the sorted numerical data into \(q\) equal-sized groups, such as the first half and the second half (\(2\) groups), or the first, second, and last thirds (\(3\) groups). Numbers that can be used as cutpoints to produce these groups are called \(q\)-quantiles. Note that a quantile may lie between values of the data set.
Common numbers of groups are \(2\) (separted at the median), \(4\) (separted at quartiles), \(5\) (separted at quintiles), and \(10\) (separted at deciles). Unfortunately these numbers are not unique, but we deal with that by adopting simple rules. To illustrate, consider the median.
Median
There is only one \(2\)-quantile, which is called the median. The core idea is that the median is the middle value of the sorted data: half of the data lies on either side of the median.
If N
is an odd number,
we cannot partition the dataset into two equal sized parts.
In this case, define the median to be a point in the dataset.
When there are no duplicates in the dataset,
this means that the number of data points larger than the median
equals the number of data points that are smaller.
For example, if the data points are the first five nonegative integers,
the median is \(2\).
As another example, if the data points are the first five positive integers,
the median is \(3\).
If there are are an even number of data points, define the median to lie halfway between the two middle-most points. For example, if the data points are the first six nonegative integers, the median is \(2.5\). As another example, if the data points are the first six positive integers, the median is \(3.5\). As a more elaborate example, suppose the observations are \([95, 7, 55, 33, 52, 46, 1, 21, 65, 58]\). Then the sorted observations are \([1, 7, 21, 33, 46, 52, 55, 58, 65, 95]\). So the two middle-most observations are \(46\) and \(52\), and the median (\(49\)) lies halfway between these.
Hint
Find the median of a NetLogo list with the builtin median
reporter.
Quartiles
It is particularly common to divide a collection of numerical data into four groups, whose three cutpoints are called quartiles. The quartiles mark quarter points in the sorted data. Note that the second quartile is the median.
We will use Tukey's method to produce the lower and upper quartiles. This means that, from the sorted data, collect the data up to the median and then find its median; this is called the first quartile (or lower quartile). Then from the sorted data, collect the data from the median onwards and then find its median; this is called the third quartile (or upper quartile). Sometimes the maximum is called the fourth quartile, in which case one may call the minimum the zero-th quartile. The five numbers together are called a five-number summary of the data.
Write a fivenum
function that takes and list of numerical data
and returns (as a list) the five-number summary.
Keep it simple: rely on the definition,
and do not worry about computational efficiency.
Five-Number Summary
Here is a particularly simple implementation of a five-number summary in NetLogo.
Making it more efficient is a good additional exercise.
(However, if efficiency becomes a real concern,
consider using the quantiles
function in
Charles Staelin’s free stats
extension for NetLogo.)
to-report fivenum [#xs] let _q2 (median #xs) let _q1 median (filter [? -> ? <= _q2] #xs) let _q3 median (filter [? -> ? >= _q2] #xs) let _q0 (min #xs) let _q4 (max #xs) report (list _q0 _q1 _q2 _q3 _q4) end
Box Plot
The five number summary is often displayed graphically as a box plot, also known as a box-and-whiskers chart. This is an extremely popular chart for summarizing univariate data. A basic box plot usually includes a box that designates the quartiles with a line within the box that marks the median. Box plots typically include whiskers that encompass the full range of the data (possibly excluding outliers). For this reason, they are sometimes called box-and-whisker charts. There are many variations on the basic box plot.
As a simple example, suppose the data is the following: \(95, 7, 55, 33, 52, 46, 1, 21, 65, 58\). The the minimum is \(1\), the maximum is \(95\), and the quartiles are \(21,49,58\). The following figure provides a box plot of this information.
Hint
It is fairly standard for spreadsheets to exclude outliers from the whisker range, representing them as separate dots. Outliers typically lie more than 1.5 times the interquartile range past the box limits.
Distribution Tables and Charts for Univariate Data
Another window on univariate data is provided by distribution tables and charts. Distribution tables and charts provide information about the distribution of the values we observe for a variable. The simplest distribution table is the absolute frequency table, which when applied to univariate data is also known as a one-way table. Absolute frequencies are often called counts. The simplest distribution chart, the absolute frequency plot, provides the same information as a plot. These simple devices are relevant when we the data must take on a finite number of values.
One-Way Tables
One-way absolute frequency tables simply list data categories and associated frequencies of occurence. For example, suppose that in a party-allegiance survey, 100 voters reported their allegiance as \(a_1\) (37), \(a_2\) (26), or \(a_3\) (37). Then a simple one-way table of absolute frequencies could look like the following:
\(a_1\) |
\(a_2\) |
\(a_3\) |
37 |
26 |
37 |
Hint
NetLogo’s table
extension provides a counts
function.
When applied to a list of categorical outcomes,
it returns a table that maps categories to their frequencies of occurence.
See the discussion in the NetLogo Programming supplement.
Univariate Frequency Charts
A frequency plot displays the number of occurences of each value in a dataset. One-way tables are frequently illustrated with frequency plots.
For categorical data, the three most common choices are pie charts, bar charts, and pin plots. Figure categoricalFrequencyCharts illustrates our one-way table in these three ways. While the first two are the most popular due to excellent spreadsheet support, the last is generally considered to be most communicative. This is because it contains the fewest irrelevant visual cues, such as color or bar width.
Pearson’s Chi-square Test Statistic
Based on this data, the best guess is that voters are a bit more likely to report themselves as \(a_1\) or \(a_3\) than as \(a_2\). How sure are we that they are not equally likely to report themselves in any of the categories? An answer to this may be provided in terms of the following statistic.
Here \(i\) indexes the categories, the \(c_i\) are the observed counts, and the \(e_i\) are the expected counts. In the present example, \(i\) is the reported allegiance (\(a_1\), \(a_2\), or \(a_3\)), and the observed counts are :math`(37,26,37)`. When every category is equally likely to be reported, the expected counts are \((100/3,100/3,100/3)\). So in this case we get a Pearson statistic of \(2.42\).
Hint
Pearson Chi-square Statistic
to-report z2pearson [ #ofreqs ;list, the observed absolute frequencies by category #ps ;the assumed probabilities for each category ] let _n sum #ofreqs let _efreqs map [? -> ? * _n] #ps report sum (map [[?1 ?2] -> (?1 - ?2) ^ 2 / ?2] #ofreqs _efreqs) end
This value is the Pearson chi-square (\(\chi^2\)) test statistic. The value of this statistic will be small when the observed frequencies are close to the expected frequencies. In order to turn this statistic into a test, we need to determine how likely it is we would see such a big value. The value is always positive, but it gets bigger when the observed frequencies fall further from the expected frequencies.
Suppose the expected counts come from the true distribution, so that \(e_i=p_i N\), where the \(p_i\) are the probabilities and \(N\) is the total number of observations. It turns out that the distribution of this Pearson statistic is well approximated by a widely used distribution, called the \(\chi^2\) distribution. The chi-square distribution depends on a degrees of freedom parameter, which is one less than the number of categories in our Pearson test.
In the present example, the degrees of freedom is \(2\), and the test statistic is \(2.42\). We can look in a chi-square table, such as the online table provided by [NIST.SEMATECH-2013-NIST], to determine how unlikely it is to see a value as big a \(2.42\). Such a table provides a few points from the cumulative distribution function (CDF) for the distribution. For each value \(x\), the CDF shows the probability that the distribution will produce a value no greater than \(x\). Figure basicStatZ2 illustrates the CDF for a chi-square distribution with \(2\) degrees of freedom. It also illustrates out test value on the \(x\)-axis and the associated CDF value on the \(y\)-axis. The distribution produces values lower than ours about 70% of the time and correspondingly produces higher values about 30% of the time.
Relative Rank
Data is called ordinal when it can be unambiguously sorted. Consider sorting an ordinal dataset \(X=(x_1,\dots,x_N)\) (from smallest to largest), producing the new ordered collection \(X=(x_{k_1},\dots,x_{k_N})\). Then \(r\) is the simple ascending rank of \(x_{k_r}\). For example, if \(X=(x_1,x_2,x_3)=(6,4,5)\), then the sorted collection is \((x_2,x_3,x_1)=(4,5,6)\). That is, \(x_2\) has rank \(1\) (it is the minimum), \(x_3\) has rank \(2\), and \(x_1\) has rank \(3\) (it is the maximum). The simple ascending rank of each item is its position in the sorted collection.
This notion of simple ranking gives every item a unique rank, so if the dataset contains duplicates, items with equal value have different ranks. We may address this issue in various ways. For example, popular spreadsheets replace the sequential ranks of duplicates with a single value (the smallest).
The approach in this course is to focus on a concept of relative rank that effectively assigns the largest possible rank to all equal valued items. The relative rank of \(x_i\) is \({}^\#(x \in X | x \le x_i)/N\). (The octothorpe prepended to the set braces produces the size of the collection, which is the total number of items in the collection.) That is, count up the observations that are not larger than \(x_i\), and divide by the total number of observations. So for example, if the values are \(1,\dots,10\), the relative rank of \(9\) is \(9/10\), When the data are numerical, this course uses the notation \(\mathrm{ecdf}[x_i]\) to denote this relative rank.
Two common alternative measures of relative rank are \(i/(N+1)\) and \((i-1)/(N-1)\). (These must be refined when there are duplicate observations.) The first, sometimes called the exclusive defintion, lies in the open interval \((0\,..\,1)\). The second, sometimes called the inclusive definition, lies in the closed interval \([0\,..\,1]\). So for example, if the values are \(1,\dots,10\), the exclusive relative rank of \(9\) is \(9/11\), while the inclusive relative rank of \(9\) is \(8/9\). Both of these measures are in common use (e.g., in spreadsheets). They are not used in this course, except as possible byproducts of spreadsheet exercises.
Cumulative Frequency
The cumulative frequency of a number is the count of dataset members no bigger than that number. In mathematical notation, this is \(x_i\) is \({}^\#\{x \in X | x \le x_i\}\).
Hint
A particularly simple approach to determining the cumulative frequency of a number is to count how many data points are left after filtering out those that are bigger than the number.
Cumulative Frequency
to-report cumfreq [#list #x] report length filter [? -> ? <= #x] #list end
CDF
For any real number \(t\), the cumulative distribution function (CDF) of a probability distribution reports the probability that the realized value \(T\) will be no larger than \(t\). That is, \(F[t] = P[T \le t]\).
Empirical CDF
The empirical cumulative distribution function (ECDF) is closely related, but it is built from the relative ranks of items in a numerical dataset. Recall that the relative rank of \(x_i\) in the dataset \(X\) is \({}^\#(x \in X | x \le x_i)/N\). That is, count up all the values in the dataset that are no bigger than \(x\), and then scale the result by dividing by \(N\).
Relative rank is defined for every value in the dataset. The empirical CDF extends the computation of relative rank to a consideration of all numbers, even those not in the dataset. The empirical cumulative distribution function (ECDF), \(F[t]\), is still based entirely on collected data. The value \(F[t]\) is the proportion of the data that is no bigger than \(t\). That is, given a data sample \(x=(x_1, \dots, x_N)\), define the associated empirical distribution function \(F[t]\) to be the proportion of sample elements with a value not greater than \(t\). The following function of the data describes the ECDF.
Recall that the prepended octothorpe produces the cardinality of the collection (i.e., the number of elements). The number of elements that satisfy the inequality is deflated by the total number of elements (\(N\)) to produce the desired proportion. Note that \(F[t]==0\) whenever \(t\) is below the minimum observation, and \(F[t]==1\) whenever \(t\) is above the maximum observation. Additionally, for any number \(x_i\) in the dataset, \(F[x_i]\) is just the relative rank.
Step Plot for Empirical CDF
The algebra of the ECDF, described above, comports with the following method for producing a line chart for the ECDF of the wealths in our model. First, sort the observed values. Next, decide where to start plotting—somewhere below the minimum value. When the sample contains only positive values; it may be appropriate to start plotting with the points \((0,0)\). This is typically the case with income. However, measured wealth may fall to zero, so starting a bit below \(0\) may be more appropriate. For each value of wealth (starting with the smallest), draw a horizontal line to that value at the current height, and then a vertical line of length \(1/N\).
Note that an ECDF is not continuous. There is a “jump” in the function at each sample value. A step plot can illustrate these jumps.
Create a procedure to plot an ECDF, given a list of numbers.
Hint
NetLogo’s sort
function sorts a numerical list in ascending order.
The first element of the result is the smallest element;
it can be retrieved with NetLogo’s first
function.
The last element of the result is the largest element;
it can be retrieved with NetLogo’s last
function.
Recall that the difference of the maximum and the minimum value
is the range of the data.
When drawing an ECDF, the range of the horizontal axis is
conventionally extended slightly beyond the data range.
It may be useful to introduce two local variables,
say x
and y
,
to store the next values to be plotted.
Use foreach
to loop over the data values,
and use plotxy
for plotting.
Here is one possible implementation.
ECDF Plot
to plotECDF [ #lst ;Real[1..*], the numbers for the ECDF ] set #lst (sort #lst) let _n length #lst let _dy (1 / _n) ;determine the data range let _min (first #lst) let _max (last #lst) ;extend xrange a bit past the data range let _xmin (_min - _dy) let _xmax (_max + _dy) ;create the plot clear-plot set-plot-x-range _xmin _xmax set-plot-y-range 0 1 ;draw uniform distribution (for reference) create-temporary-plot-pen "pen01" set-plot-pen-color gray plotxy _min 0 plotxy _max 1 ;draw step plot (ECDF) create-temporary-plot-pen "pen02" let _x _xmin let _y 0 plotxy _x _y foreach #lst [? -> if (? > _x) [set _x ? plotxy _x _y] ;horizontal segment set _y (_y + _dy) plotxy _x _y ;vertical segment ] plotxy _xmax 1 end
Empirical Survival Function
A probability distribution is characterzed by its CDF. A probability distribution is equally well characterzed by its survivial function. If the CDF is \(F[t]\), the surivial function is \(S[t] = P[T > t] = 1 - F[t]\). Similarly, if the ECDF of a dataset is \(F[t]\), the empirical surivial function is \(S[t] = 1 - F[t]\). Social scientists are equally likely to use a CDF or a survival function to characterize a probability distribution.
Example of an Empirical Survival Function
Consider a famous example: remission times, in weeks, for 21 leukemia patients not receiving any treatment. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/gehan.html.)
1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23
We often interpret the actual relative frequencies of survival as estimates of an underlying distribution. For example, the estimated probability that an individual survives more than 6 weeks is \(S[6] = 12/21\). And, the estimated probability that an individual survives more than 8 weeks is \(S[8] = 8/21\). (These are sometimes called Kaplan-Meier estimates.)
Lorenz Curve
[Lorenz-1905-JASA] proposed a graphical representation of inequality, now known as the Lorenz curve. When applied to disaggregated wealth data, a Lorenz curve shows the relationship between cumulative population shares and cumulative wealth shares. A Lorenz curve for a population is closely related to the empirical CDF. See the Gift World lecture for an introductory application. This section shows how to build up to a Lorenz curve from simple concepts.
Consider an ECDF of agent wealths. (The application to income or expenditure is exactly the same.) For any level of wealth, the ECDF reports the proportion of the population with that wealth or less. It is more common to the following related question: what proportion of total wealth is claimed by the poorest 20% of the households?
The ECDF can be transformed to represent this by taking two steps. First, accumulate the wealths on the horizontal axis as partial sums, and then normalize by total wealth. This axis now represents the cumulative wealth of the poorest, as a fraction of total wealth in the economy, so the hortizontal axis is rescaled to \([0..1]\). The second step is conventional but less important: plot the inverse of the result relation by swapping the axes. The result is called a Lorenz curve: it has proportions of the total population on the horizontal axis, with proportions of total wealth on the vertical axis. Points on this curve easily answer questions such as the following: what (poorest) proportion of the population owns 20% of total wealth?
From this description, the core computation can be decomposed into two pieces: producing the partial sums of a sorted lst, and deflating them by the list total. As an applied example, using 1993 data for Vietnam, [Haughton.Khandker-2009-Handbook] report total per capita expenditures (in thousands of dong per year) for each quintile to be \((518,756,984,1338,2540)\). The following table converts this data into cumulative proportional expenditures. It also includes a final line stating the inequality gap, which is the gap between the cumulative proportion expected under perfect inequality and that seen in the data.
1st |
2nd |
3rd |
4th |
5th |
|
---|---|---|---|---|---|
Quintile Expenditure |
518 |
756 |
984 |
1338 |
2540 |
Cumulative Expenditure |
518 |
1274 |
2258 |
3596 |
6136 |
Cumulative Proportion |
0.084 |
0.208 |
0.368 |
0.586 |
1.000 |
Inequality Gap |
0.11558 |
0.192373 |
0.232008 |
0.21395 |
[Lorenz-1905-JASA] proposed a graphical representation of of this information.
A Lorenz curve is essentially a plot of the cumulative shares.
Plot the \(N\) cumulative shares as follows.
First, plot \((0,0)\).
Then increment the x
coordinate by \(1/N\)
and plot the first cumulative share.
Next, again increment the x
coordinate by \(1/N\)
and plot the next cumulative share.
proceed in this manner until all are plotted.
Note that the last point plotted will be \((1,1)\).
Plot along one axis cumulated per cents. of the population from poorest to richest, and along the other the per cent. of the total wealth held by these per cents. of the population.
—Lorenz (1905, p. 217)
Lorenz curves are widely used in many research fields. Economists use them to represent inequality in income or wealth; ecologists use them to measure inequality in size or weight. For example, Figure hk2009lorenz01 displays a Lorenz curve based on the data above, where successive points are connected by straight line segments. The figure add a triangle, which always contains the Lorenz curve. The Lorenz curve partitions the triangle into two areas, labeled \(A\) and \(B\). When a Lorenz curve lies near the line of equality, there is little inequality, and the area \(A\) is small. When the Lorenz curve lies far below the equality line, there is much inequality, and the area \(A\) is large. We return to this below.
For this exercise, write a function that
produces the Lorenz-curve points from a numerical list.
The computation of partial sums is generally useful,
so it is worth a separate function.
Write a partialSums
function
that produces a list of partial sums from a list of numbers.
Write a cumulativeShares
function
that produces a list of cumulative shares from a list of numbers.
The cumulativeShares
function should depend on the partialSums
function.
Write a pointsLorenz
function
that produces a list of the points on the Lorenz curve.
The pointsLorenz
function should depend on the cumulativeShares
function.
Hint
Perhaps suprisingly,
the conventional approach to partial sums uses reduce
.
See the explanation and code in the NetLogo Programming supplement.
The cumulativeShares
function just needs to
produce the partial sums of the sorted list and then
deflate by the list total (or last partial sum).
The pointsLorenz
function must pairs these shares
with corresponding \(x\) coordinates.
This can be nicely handled by map
.
to-report cumulativeShares [ #ws ;list, nonnegative numbers ];-> list, the cumulative shares let _cumsumws (partialSums sort #ws) let _totalw last _cumsumws report map [? -> ? / _totalw] _cumsumws end
to-report pointsLorenz [ #ws ;list of nonnegative numbers ]; -> list of lists, 2-d coordinates (including [0 0]) let _n length #ws let _xs n-values _n [? -> (? + 1) / _n] let _ys cumulativeShares #ws let _pts (map [[?1 ?2] -> (list ?1 ?2)] _xs _ys) report fput [0 0] _pts end
Gini Coefficient
The Gini coefficient of inequality is the most commonly reported summary measure of inequality. In terms of the labeled regions in Figure hk2009lorenz01, define the Gini coefficient to be \(A/(A+B)\). By construction, area \(A+B\) is \(1/2\), so the Gini coefficient may also be written as \(2A\) or equivalently as \(1-2B\). Given a Lorenz curve constructed like this one, with straight line segments, the area \(B\) is a collection of \(N\) adjacent trapezoids. Realling the formula for the area of a trapezoid, write the area \(B\) in terms of the points \((x_i,y_i)\) produced for the Lorenz curve. (Here \((x_0,y_0)=(0,0)\), the extra point at the origin.)
This yield the following formula for the Gini.
In the current example, the distance between successive \(x_i\) is \(1/N\), allowing the following simplification of the formula. (Recall that \(y_0=0\) and \(y_N=1\).)
The last expression follows from the fact that the sum of the first \(N\) positive integers is \(N(N+1)/2\). Carefully compare it to the Lorenz curve in Figure hk2009lorenz01 to see that it is twice the area \(A\), as computed by the method of trapezoids. Better yet, it may be given an intuitive alternative statement. Define the Lorenz curve inequality gaps as the distance between cumulative proportions (i.e., the \(y_i\)) and the line of perfect inequality. The this final expression is twice the mean value of the gaps. When applied to the example data above, this produces a Gini coefficient of about \(0.30\).
Create a shares2gini
function that
produces a Gini coefficient from a list of cumulative proportional shares.
Produce a gini coefficient for the data above
by using this function in combination with the cumulativeShares
function.
Correspondingly, write a gini
function
that produces a gini coefficient from a numerical list.
This function should depend on the cumulativeShares
function
and the shares2gini
function.
Hint
The n-values
command may be helpful for this exercise.
Also, NetLogo provides a builtin mean
function.
to-report shares2gini [ #cumShares ;list of *sorted* numbers in unit interval ]; -> number, the Gini coefficient of inequalty let _n length #cumShares let _gaps n-values _n [? -> ((? + 1) / _n) - item ? #cumShares] report 2 * mean _gaps end
to-report gini01 [#xs] report shares2gini cumulativeShares #xs end
The following figure illustrate the evolution of the Lorenz Curve and the Gini coefficient over time in the United States.
Sequential Observations
Sometimes data observations have an inherent order, as with time-series observations. Sequential univariate data may be usefully displayed with a run-sequence plot.
Run-Sequence Plot
Chambers et al (1983) describe the run-sequence plot as a convenient summary of a univariate data sets. (Univariate data is records the outcomes of a single variable.) Outliers are easily detected. If the sequence order has inherent meaning (e.g., successive points in time or space), substantial shifts in mean or variability may be evident. When the run sequence represents observations at successive points in time, we often use the term time-series plot.
References:
Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/eda/section3/runseqpl.htm
Chambers, John, William Cleveland, Beat Kleiner, and Paul Tukey, (1983), Graphical Methods for Data Analysis, Wadsworth.
Line Chart (Wikipedia: https://en.wikipedia.org/wiki/Line_chart)
Run Chart (Wikipedia: https://en.wikipedia.org/wiki/Run_chart)
Time-Series Plot
This course uses the term time-series plot for any run-sequence plot where the abscissa values somehow indicate the passage of time (e.g., dates). (For an elementary overview of the concept of a time-series plot, see Chapter 4 of [Feldman-2012-OxfordUP].) These time values may be no more than the series indexes. Often we connect the plotted points with line segments, producing a line chart. This course uses the term time-series chart for a chart that contains one or more time-series plots, no matter whether the points are separately plotted or joined with line segements.
Probability Distributions
Pseudo-random numbers are algorithmically generated sequences of numbers that appear to be truly random. Following a common convention, this course simply refers these as to random numbers. Roughly speaking, a probability function for a distribution characterizes the likelihood of specific types of outcomes. Generating an actual outcome from a distribution is called drawing from the distribution. A number representing this draw is a random variate.
Most high-level programming languages provide easy ways to generate random variates from a standard uniform distribution. Depending on the language, users may be expected to generate other distributions from this underlying distribution.
Cumulative Distribution Function
The behavior of a random variable is fully characterized by its cumulative distribution function (CDF). The CDF of a random varaible \(X\) is the function \(F_X\) that for any real value \(x\) gives the probability that \(X\) will not be bigger than \(x\).
Bernoulli Distribution
A Bernoulli distribution has two possible outcomes:
0
(failure) or 1
(success).
A Bernoulli probability distribution has one parameter,
the probability of success,
which is traditionally represented as \(p\).
A simple way to summarize this is with the probability mass function (PMF),
which associates a probability with each possible outcome:
either \(0\) or \(1\).
A mathematical representation of a Bernoulli PMF follows.
Here, \(P[1]\) means the probability of drawing a \(1\), and \(P[0]\) means the probability of drawing a \(0\). This is the usual algebraic representation of the probability mass function (PMF) of the Bernoulli distribution. It says that there is a probability \(p\) of drawing a \(1\) and a probability \(1-p\) of drawing a \(0\). Implicitly, the probability of drawing any other value is \(0\).
Geometric Distribution
The geometric distribution is a discrete distribution on the nonnegative integers. It has a single parameter, the probability of a success, which is usually denoted by \(p\). The probability mass function (PMF) is the following. [1]
This course distinguishes between the PMF of a discrete distribution and the analagous probability density function (PDF) of a continuous distribution. Some authors refer to a probability density function (or sometimes, a probability distribution function) in both cases.
This means that adjacent values have a constant ratio (\((1-p)\)). We can take advantage of this to compute the distribution mean. To do so, first recall the following argument. Suppose \(f\) is a fraction, lying between \(0\) and \(1\). Let
Then
So that
Let us put this observation to work. Recall that the distribution mean (\(\mu\)) is the probability weighted sum of possible outcomes.
Assuming \(1>p>0\), multiply both sides by \((1-p)\).
Subtracting the two gives
The shifted geometric distribution shifts the entire probability mass function to the right by \(1\).
This naturally increases the mean by \(1\) as well, so the mean of the shifted geometric distribution is \(1/p\).
Standard Uniform Distribution
The standard uniform distribution draws only from the unit interval (\([0\,..\,1]\)). The standard uniform distribution is defined on \([0\,..\,1]\) as follows. A random variable \(U\) has the standard uniform distribution if
That is, on the unit interval, \(F_U[u] = u\). We often write this as \(U \sim U[0,1]\).
Functions of Random Variables
Suppose \(X\) is a random variable with CDF \(F_X\). If \(f\) is a real-valued function. Then \(Y=f[X]\) is also a random variable
If \(f\) is strictly increasing, it has a strictly increasing inverse. So we can write
For example, if \(Y=e^{X}\), then
Probability Transform
Suppose a continuous random variable \(X\) has a strictly increasing CDF \(F_X\). (This means it has a continuous inverse, \(F^{-1}_X\).) Create a new random variable \(U = F_X[X]\), which takes on values in \([0\,..\,1]\). Then for any real number \(u \in [0\,..\,1]\) we have:
This means that \(U\) has the standard normal distribution. This observation provides the basis for transforming standard uniform variates into draws from other distributions, making use of their CDFs. Given a standard uniform distribution \(U\), define \(X = F^{-1}_X[U]\).
That is, defining \(X = F^{-1}_X[U]\) produces a random variable with the distribution characterized by the CDF \(F_X\).
Continuous Uniform Distribution
Under the continuous uniform distribution \(U[a,b]\), values are equally likely to drawn from anywhere in the interval \([a\,..\,b]\). The continuous uniform distribution on the interval \([a\,..\,b]\) is characterized by impossibility of seeing values less than \(a\), certainty of seeing values less than or equal to \(b\), and
The inverse of the CDF is
Therefore, given a random variable \(U\) with the standard uniform distribution, \(F^{-1}_X[U]\) is uniform in the interval \([a\,..\,b]\).
Exponential Distribution
Distributional Differences
Pearson Chi-Square Test for One-Way Goodness of Fit
Recall that the Pearson goodness-of-fit test is based on the difference between the observed (\(c_i\)) frequency and the expected frequency (\(e_i\)) in each of the categories (\(i\in [1\,..\,N]\)).
Suppose you receive the following count data and are told it comes from a shifted geometric distribution with parameter \(p=0.6\). You want to check whether that is plausible by applying a Pearson test.
One problem is that a geometric distribution will have many unobserved possible outcomes. Another problem is that, as a rule of thumb, categories should include at least five observations when applying this test. Address both problems by recoding the data, pooling the last \(4\) categories.
With this data, we produce a Pearson statistic of \(2.93\). Since we are working with \(5\) categories, there are \(4\) degress of freedom. (Roughly, once you set the counts in four of the categories, the fifth is fully determined.) Under the null hypothesis that it is drawn from the claimed distribution, the distribution of this Pearson statistic is roughly \(\chi^2_4\). Examining the CDF for the \(\mathrm{ChiSqare}[4]\) distribution, as shown in Figure f:x2cdf4pearson01, we find that such a statistic at least this big occurs quite often. So we do not reject the suitability of the claimed distribution for this data.
Mann-Whitney \(U\) Test
The Mann-Whitney \(U\) test, also called the Wilcoxon rank-sum test, is a non-parametric test for the significance of differences in average outcomes. It is a common alternative to a two-sample \(t\) test that tests for a difference in population means. The \(t\) test makes three assumptions.
independence of the two samples
equal population variances
samples are drawn from normal distributions
With large samples, the \(t\) test can provide good guidance even when the assumptions of equal variances or even normality are violated. When the sample is small and the data are skewed, the \(t\) test can mislead.
As long as the samples are from continuous distributions, we can drop the assumption of normality and apply the non-parametric Wilcoxon rank-sum test for indepedent samples to test for a difference of the medians. The null hypothesis is that the distributions underlying the two samples have equal medians. If there is enough evidence that the medians differ, we reject the null hypothesis.
Consider the following data from Hogg and Tanis (2006, example 8.4-6). Two companies, A and B, sell the same product. There have 8 package-weight observations from each company.
sampleA = {117.1, 121.3, 127.8, 121.9, 117.4, 124.5, 119.5, 115.1} sampleB = {123.5, 125.3, 126.5, 127.9, 122.1, 125.6, 129.8, 117.2}
Determine if the distribution of weights is the same at each company. Start with a box plot. The data have a similar range but look skewed and thus possibly non-normal. A standard \(t\)-test produces a \(p\)-value of about \(0.058\).
Now compare that to a Wilcoxon test. Here is the idea: if the two samples are from the same distribution, then one independent draw from each company should have equal probability of one or the other being larger. (The difference should be distributed symmetrically around zero.) To illustrate this, make a list plot of all the possible pairings of the two, and add a \(45^\circ\) line. We should find about half the points on each side of the dividing line. However, the are only \(13\) instances where the A value is greater than the B value.
Create a test statistic based on this intuition. Define the following function:
This produces a value of \(1\) when \(x>y\), a value of \(0\) when \(x<y\), and a value of \(1/2\) when they are equal. Sum up over the possible pairs to produce the \(U\) statistic for samples \(X\) and \(Y\).
The maximum value of \(U\) is the product of the sample sizes. Under the null hypothesis that \(Y\) and \(Y\) are drawn from the same distribution, \(U\) should be about half of that. In the current example, there are 64 pairings and \(U\) is only \(13\). So we can ask, how likely is it to see 13/64 when we expect 32/64. Ordinarily we rely on software for this computation, which is provided by standard statistical packages. In this case, the \(p\)-value is about \(0.041\). So at the \(5%\) level, we would not reject the null with a \(t\) test but would with a \(U\) test.
Let \(n_x\) be the number of observations in \(X\), and let \(n_y\) be the number of observations in \(Y\). The total number of pairings is \(n_x n_y\), so the expected value of \(U\) is \(\mu_U = n_x n_y / 2\). Computing the standard deviation requires more work, but in the absence of ties,
As the sample sizes become larger, the distribution of \(U\) approaches a normal distribution. If the samples contain more than 20 values, we typically use a normal approximation to the exact \(U\) distribution. We can then look up \((U-\mu_u)/\sigma_U\) in a \(Z\) table. Doing so for the current example produces a (two-sided) \(p\)-value of about \(0.046\). The normal approximation is very good for large samples, and it is computationally faster as well.
Note that this \(U\) test does not provide information about how far about the different distributions are. Statistical sofware will often provide an estimate (with confidence interval) for the difference in location. The estimate is not just the difference in the sample medians, the rather the median of all of the differences between two samples. Expect the confidence interval to be fairly wide due when the small sample size is small.
References
Hogg, R.V. and Tanis, E.A., Probability and Statistical Inference, 7th Ed, Prentice Hall, 2006. R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
The Wilcoxon Rank Sum Test https://data.library.virginia.edu/the-wilcoxon-rank-sum-test/
Describing Bivariate Data
Scatter Plot
A scatter plot is the most common visual display of the relationship between two variables. When one variable is considered to depend on another, the dependent variable gets the vertical axis.
Covariance and Correlation
See https://onlinestatbook.com/2/describing_bivariate_data/calculation.html
Least-Squares Fit
The relationship between two variables is often summarized with a least-squares fit.
See https://onlinestatbook.com/2/regression/intro.html for a good introduction.
Often, the resulting regression line is plotted over a scatter plot.
Regression Standard Error
Scatter Plot with Regression Line
Often, the resulting regression line is plotted over a scatter plot.
Conditional Expectation Function (CEF)
Assume one random variable (\(Y\)) is considered to depend on another (\(X\)), Then if we knew \(X_i=x\), we should know something about \(Y_i\). This something is summarized by the conditional expectation function:
- CEF Theorem:
The linear least squares fit provides the best linear approximation of \(h[x]\). (I.e., it minimizes mean squared error.)
Binned Scatter Plot
Scatter plots are less useful for large datasets. In this case, binned scatter plots provide a useful alternative. For a very simple binned scatter plot, assume one variable (\(y\)) is considered to depend on another (\(x\)), and create equal sized bins based on the values of the indepedent variable. For each bin, create one point to plot. This is typically the mean of the \(x\) and \(y\) values in the bin.
A binned scatter plot provides a non-parametric estimate of the CEF. (See above.)
Binned Scatter Plot with Regression Line
We have that the regression line is the best linear approximation of the CEF, and that the binned scatter plot provides a non-parametric estimate of the CEF. So it is natural to combine them: plot the regression line for all the data over a binned scatter plot.
Dispersion of the binscatter around the regression line signals a high regression standard error. However, low dispersion of the binscatter around the regression line does not indicate low dispersion of the data around the regression line.
Functional Form
Systematic deviation of the binscatter from the linear regression line can suggest that a nonlinear functional form is more appropriate.
Resources and Explorations
Explorations
Consider a six-sided die that is not quite fair: it is twice as likely to roll \(1\) as to roll any other value. Roll this die 50 times and perform a Pearson chi-square test to determine whether the data allows you to reject the null hypothesis of a fair die.
Write a procedure that takes one argument, a list of wealths or incomes, and plots the corresponding Lorenz curve.
Explain how the
pointsLorenz
function could be used in the computation of a Gini coefficient of inequality.Use the income-share data at the Bureau of the Census to plot a Lorenz curve and to compute the associated Gini coefficient of inequality.
Resources
For an extended by accessible discussion of inequality measures, see chapter 6 of [Haughton.Khandker-2009-Handbook]. Note that if one is calculating the Gini index of a sample for the purpose of estimating the value for a larger population, a small correction factor to the method used here may be needed for small samples [Deltas-2003-REStat].
References
Deltas, George. (2003) The Small-Sample Bias of the Gini Coefficient: Results and Implications for Empirical Research. The Review of Economics and Statistics 85, 226--234.
Farris, Frank A. (2010) The Gini Index and Measures of Inequality. The American Mathematical Monthly 117, 851--864. http://www.jstor.org/stable/10.4169/000298910x523344
Feldman, David P. (2012) Chaos and Fractals: An Elementary Introduction. : Oxford University Press.
Haughton, Jonathan, and Shahidur R. Khandker. (2009) Handbook on Poverty and Inequality. Washington, DC: World Bank. http://documents.worldbank.org/curated/en/488081468157174849/Handbook-on-poverty-and-inequality
Lorenz, M.O. (1905) Methods of Measuring the Concentration of Wealth. Publications of the American Statitical Association 9, 209--219.
McGill, Robert, John W. Tukey, and Wayne A. Larsen. (1978) Variations of Box Plots. The American Statistician 32, 12--16. http://www.jstor.org/stable/2683468
Wilensky, Uri. (2017) NetLogo 6.01 User Manual.
National Institute of Standards and Technology, and SEMATECH. (2013) NIST/SEMATECH e-Handbook of Statistical Methods. : U.S. Department of Commerce. http://www.itl.nist.gov/div898/handbook
Copyright © 2016–2023 Alan G. Isaac. All rights reserved.