Elementary Data Description

Consider \(N\) numerical observations, \(x_1, x_2,\dots,x_N\), from some quantifiable outcome of interest (e.g., agent wealth, for \(N\) agents). For compactness, refer to the whole sequence as \(x\). We often desire to provide useful summaries of these observations.

Algebra of Counting

Permutations

Consider the numbers in the integer interval \([1\,..\,n]\). How many sequences of length \(k \le n\) can we create from these? That is, how many \(k\)-permutations can we create. The product rule of counting is that if we can do one thing \(n\) and and independently do another thing \(m\) ways, then we can do the two things in \(n \times m\) ways. Here we can pick the first number \(n\) ways, we can pick the first number \((n-1)\) ways, and so on. So the number of \(k\) permuations is \(n \times (n-1) \times \dots \times (n-k+1)\). Using factorial notation, this is \(n!/(n-k)!\).

Combinations

Sequences are ordered; sets are not. A set of size \(k\) drawn from a set of size \(n\) is called a \(k\)-subset or \(k\)-combination. A sequence of length \(k\) has \(k!\) permutations, so we can compute the number of \(k\)-combinations by deflating the number \(k\)-permuations by \(k!\). Using factorial notation, this is \(n!/\big((n-k)!k!\big)\).

Describing Univariate Data

Sample Mean

Consider \(N\) observations of a single variable. The sample mean is the arithmetic average of the observations.

\begin{equation*} \bar{X} = \frac{1}{N} (x_{1} + x_{2} + \dots + x_{N}) \end{equation*}

A capital sigman (\(\sum\)) is often used to indicate a sum over values. So the following expression is equivalent.

\begin{equation*} \bar{x} = \frac{1}{N} \sum_{i=1}^{N} x_{i} \end{equation*}

The mean difference of the observations from their mean value is \(0\). For this reason, the mean is often used as a measure of the central tendency of the distribution producing the observations. However, also for this reason, the mean is also sensitive to the presence of a few large values in the data.

Sample Standard Deviation

The sample variance (\(s^2\) or \({\hat\sigma}^2\) is the mean squared deviation of observation from the sample mean. For example, given observations \(x_1,\dots,x_N\), we may use the following formula.

\begin{align*} {\hat\sigma}^{2}_{x} &= \frac{1}{N} \sum_{i=1}^{N} (x_{i}-\bar{x})^2 \\ &= \frac{1}{N} \sum_{i=1}^{N} (x_{i}^2 - 2 x_{i} \bar{x}+\bar{x}^2) \\ &= \frac{1}{N} \sum_{i=1}^{N} x_{i}^2 - \bar{x}^2 \end{align*}

The sample variance is often a good measure of how dispersed the data are: a low variance means that the data are more tightly clustered around the mean. The sample standard deviation (\({\hat\sigma}\)) is the square root of the sample variance.

\begin{equation*} {\hat\sigma}_{x} = \sqrt{\sigma^{2}_{x}} \end{equation*}

We usually present the standard deviation rather than the variance, because it has the same units as the mean. This makes it easier to interpret.

Sorting the Data: Minimum, Maximum, and Range

It is common to sort numerical data from smallest to largest, which is an ascending sort. After sorting, we have \(x_{(1)}, x_{(2)},\dots,x_{(N)}\), where \(x_{(k)}\) is called the \(k\)-th order statistic. Note that \(x_{(1)}\) is the smallest observation and \(x_{(N)}\) is the largest observation. The smallest observation is called the minimum. The largest observation is called the maximum. Then the range is the difference between the maximum and the minimum: \(x_{(N)} - x_{(1)}\).

Fractiles

Consider a collection of numbers \(x=x_{1},\dots,x_{n}\). For any postive fraction \(f\), let a \(f\)-fractile for the collection be the smallest of these numbers that is greater than or equal to at least \(f n\) of these numbers. In standard mathematical notation, the \(f\)-fractile is \(x_{(\lceil f n \rceil)}\), where \(\lceil x \rceil\) denotes the smallest integer that is at least \(x\). (I.e., non-integer values are rounded up.) So for example, if the values are \(1,\dots,10\), the \(0.90\) fractile is \(9\) but the \(0.91\) fractile is \(10\). In this course, a fractile is always an element of the collection of values.

Quantiles

For the rest of this section, to simplify notation, assume the \(x_i\) are already sorted. Now try to partition the sorted numerical data into \(q\) equal-sized groups, such as the first half and the second half (\(2\) groups), or the first, second, and last thirds (\(3\) groups). Numbers that can be used as cutpoints to produce these groups are called \(q\)-quantiles. Note that a quantile may lie between values of the data set.

Common numbers of groups are \(2\) (separted at the median), \(4\) (separted at quartiles), \(5\) (separted at quintiles), and \(10\) (separted at deciles). Unfortunately these numbers are not unique, but we deal with that by adopting simple rules. To illustrate, consider the median.

Median

There is only one \(2\)-quantile, which is called the median. The core idea is that the median is the middle value of the sorted data: half of the data lies on either side of the median.

If N is an odd number, we cannot partition the dataset into two equal sized parts. In this case, define the median to be a point in the dataset. When there are no duplicates in the dataset, this means that the number of data points larger than the median equals the number of data points that are smaller. For example, if the data points are the first five nonegative integers, the median is \(2\). As another example, if the data points are the first five positive integers, the median is \(3\).

If there are are an even number of data points, define the median to lie halfway between the two middle-most points. For example, if the data points are the first six nonegative integers, the median is \(2.5\). As another example, if the data points are the first six positive integers, the median is \(3.5\). As a more elaborate example, suppose the observations are \([95, 7, 55, 33, 52, 46, 1, 21, 65, 58]\). Then the sorted observations are \([1, 7, 21, 33, 46, 52, 55, 58, 65, 95]\). So the two middle-most observations are \(46\) and \(52\), and the median (\(49\)) lies halfway between these.

Median of arbitrary data.

Median on a Number Line

Median is \(49\), based on the following artificial data: \(95, 7, 55, 33, 52, 46, 1, 21, 65, 58\).

Quartiles

It is particularly common to divide a collection of numerical data into four groups, whose three cutpoints are called quartiles. The quartiles mark quarter points in the sorted data. Note that the second quartile is the median.

We will use Tukey's method to produce the lower and upper quartiles. This means that, from the sorted data, collect the data up to the median and then find its median; this is called the first quartile (or lower quartile). Then from the sorted data, collect the data from the median onwards and then find its median; this is called the third quartile (or upper quartile). Sometimes the maximum is called the fourth quartile, in which case one may call the minimum the zero-th quartile. The five numbers together are called a five-number summary of the data.

Write a fivenum function that takes and list of numerical data and returns (as a list) the five-number summary. Keep it simple: rely on the definition, and do not worry about computational efficiency.

Box Plot

The five number summary is often displayed graphically as a box plot, also known as a box-and-whiskers chart. This is an extremely popular chart for summarizing univariate data. A basic box plot usually includes a box that designates the quartiles with a line within the box that marks the median. Box plots typically include whiskers that encompass the full range of the data (possibly excluding outliers). For this reason, they are sometimes called box-and-whisker charts. There are many variations on the basic box plot.

As a simple example, suppose the data is the following: \(95, 7, 55, 33, 52, 46, 1, 21, 65, 58\). The the minimum is \(1\), the maximum is \(95\), and the quartiles are \(21,49,58\). The following figure provides a box plot of this information.

Box plot of arbitrary data.

Box Plot

Based on the following artificial data: \(95, 7, 55, 33, 52, 46, 1, 21, 65, 58\).

Distribution Tables and Charts for Univariate Data

Another window on univariate data is provided by distribution tables and charts. Distribution tables and charts provide information about the distribution of the values we observe for a variable. The simplest distribution table is the absolute frequency table, which when applied to univariate data is also known as a one-way table. Absolute frequencies are often called counts. The simplest distribution chart, the absolute frequency plot, provides the same information as a plot. These simple devices are relevant when we the data must take on a finite number of values.

One-Way Tables

One-way absolute frequency tables simply list data categories and associated frequencies of occurence. For example, suppose that in a party-allegiance survey, 100 voters reported their allegiance as \(a_1\) (37), \(a_2\) (26), or \(a_3\) (37). Then a simple one-way table of absolute frequencies could look like the following:

Party Allegiance (Counts)

\(a_1\)

\(a_2\)

\(a_3\)

37

26

37

Univariate Frequency Charts

A frequency plot displays the number of occurences of each value in a dataset. One-way tables are frequently illustrated with frequency plots.

For categorical data, the three most common choices are pie charts, bar charts, and pin plots. Figure categoricalFrequencyCharts illustrates our one-way table in these three ways. While the first two are the most popular due to excellent spreadsheet support, the last is generally considered to be most communicative. This is because it contains the fewest irrelevant visual cues, such as color or bar width.

images/categoricalFrequencyCharts.svg

Pie Chart, Bar Chart, and Pin Plot

Pearson’s Chi-square Test Statistic

Based on this data, the best guess is that voters are a bit more likely to report themselves as \(a_1\) or \(a_3\) than as \(a_2\). How sure are we that they are not equally likely to report themselves in any of the categories? An answer to this may be provided in terms of the following statistic.

\begin{equation*} P = \sum_i \frac{(c_i - e_i)^2}{e_i} \end{equation*}

Here \(i\) indexes the categories, the \(c_i\) are the observed counts, and the \(e_i\) are the expected counts. In the present example, \(i\) is the reported allegiance (\(a_1\), \(a_2\), or \(a_3\)), and the observed counts are :math`(37,26,37)`. When every category is equally likely to be reported, the expected counts are \((100/3,100/3,100/3)\). So in this case we get a Pearson statistic of \(2.42\).

\begin{equation*} P = \frac{(37-100/3)^2}{100/3} + \frac{(26-100/3)^2}{100/3} + \frac{(37-100/3)^2}{100/3} = 2.42 \end{equation*}

This value is the Pearson chi-square (\(\chi^2\)) test statistic. The value of this statistic will be small when the observed frequencies are close to the expected frequencies. In order to turn this statistic into a test, we need to determine how likely it is we would see such a big value. The value is always positive, but it gets bigger when the observed frequencies fall further from the expected frequencies.

Suppose the expected counts come from the true distribution, so that \(e_i=p_i N\), where the \(p_i\) are the probabilities and \(N\) is the total number of observations. It turns out that the distribution of this Pearson statistic is well approximated by a widely used distribution, called the \(\chi^2\) distribution. The chi-square distribution depends on a degrees of freedom parameter, which is one less than the number of categories in our Pearson test.

In the present example, the degrees of freedom is \(2\), and the test statistic is \(2.42\). We can look in a chi-square table, such as the online table provided by [NIST.SEMATECH-2013-NIST], to determine how unlikely it is to see a value as big a \(2.42\). Such a table provides a few points from the cumulative distribution function (CDF) for the distribution. For each value \(x\), the CDF shows the probability that the distribution will produce a value no greater than \(x\). Figure basicStatZ2 illustrates the CDF for a chi-square distribution with \(2\) degrees of freedom. It also illustrates out test value on the \(x\)-axis and the associated CDF value on the \(y\)-axis. The distribution produces values lower than ours about 70% of the time and correspondingly produces higher values about 30% of the time.

CDF of chi-square with df=2, with test statistic.

CDF of Chi-square Distribution (df=2)

Relative Rank

Data is called ordinal when it can be unambiguously sorted. Consider sorting an ordinal dataset \(X=(x_1,\dots,x_N)\) (from smallest to largest), producing the new ordered collection \(X=(x_{k_1},\dots,x_{k_N})\). Then \(r\) is the simple ascending rank of \(x_{k_r}\). For example, if \(X=(x_1,x_2,x_3)=(6,4,5)\), then the sorted collection is \((x_2,x_3,x_1)=(4,5,6)\). That is, \(x_2\) has rank \(1\) (it is the minimum), \(x_3\) has rank \(2\), and \(x_1\) has rank \(3\) (it is the maximum). The simple ascending rank of each item is its position in the sorted collection.

This notion of simple ranking gives every item a unique rank, so if the dataset contains duplicates, items with equal value have different ranks. We may address this issue in various ways. For example, popular spreadsheets replace the sequential ranks of duplicates with a single value (the smallest).

The approach in this course is to focus on a concept of relative rank that effectively assigns the largest possible rank to all equal valued items. The relative rank of \(x_i\) is \({}^\#(x \in X | x \le x_i)/N\). (The octothorpe prepended to the set braces produces the size of the collection, which is the total number of items in the collection.) That is, count up the observations that are not larger than \(x_i\), and divide by the total number of observations. So for example, if the values are \(1,\dots,10\), the relative rank of \(9\) is \(9/10\), When the data are numerical, this course uses the notation \(\mathrm{ecdf}[x_i]\) to denote this relative rank.

Two common alternative measures of relative rank are \(i/(N+1)\) and \((i-1)/(N-1)\). (These must be refined when there are duplicate observations.) The first, sometimes called the exclusive defintion, lies in the open interval \((0\,..\,1)\). The second, sometimes called the inclusive definition, lies in the closed interval \([0\,..\,1]\). So for example, if the values are \(1,\dots,10\), the exclusive relative rank of \(9\) is \(9/11\), while the inclusive relative rank of \(9\) is \(8/9\). Both of these measures are in common use (e.g., in spreadsheets). They are not used in this course, except as possible byproducts of spreadsheet exercises.

Cumulative Frequency

The cumulative frequency of a number is the count of dataset members no bigger than that number. In mathematical notation, this is \(x_i\) is \({}^\#\{x \in X | x \le x_i\}\).

CDF

For any real number \(t\), the cumulative distribution function (CDF) of a probability distribution reports the probability that the realized value \(T\) will be no larger than \(t\). That is, \(F[t] = P[T \le t]\).

Empirical CDF

The empirical cumulative distribution function (ECDF) is closely related, but it is built from the relative ranks of items in a numerical dataset. Recall that the relative rank of \(x_i\) in the dataset \(X\) is \({}^\#(x \in X | x \le x_i)/N\). That is, count up all the values in the dataset that are no bigger than \(x\), and then scale the result by dividing by \(N\).

Relative rank is defined for every value in the dataset. The empirical CDF extends the computation of relative rank to a consideration of all numbers, even those not in the dataset. The empirical cumulative distribution function (ECDF), \(F[t]\), is still based entirely on collected data. The value \(F[t]\) is the proportion of the data that is no bigger than \(t\). That is, given a data sample \(x=(x_1, \dots, x_N)\), define the associated empirical distribution function \(F[t]\) to be the proportion of sample elements with a value not greater than \(t\). The following function of the data describes the ECDF.

\begin{equation*} F[t] = \frac{{}^{\#}(x_n | x_n \le t)}{N} \end{equation*}

Recall that the prepended octothorpe produces the cardinality of the collection (i.e., the number of elements). The number of elements that satisfy the inequality is deflated by the total number of elements (\(N\)) to produce the desired proportion. Note that \(F[t]==0\) whenever \(t\) is below the minimum observation, and \(F[t]==1\) whenever \(t\) is above the maximum observation. Additionally, for any number \(x_i\) in the dataset, \(F[x_i]\) is just the relative rank.

Step Plot for Empirical CDF

The algebra of the ECDF, described above, comports with the following method for producing a line chart for the ECDF of the wealths in our model. First, sort the observed values. Next, decide where to start plotting—somewhere below the minimum value. When the sample contains only positive values; it may be appropriate to start plotting with the points \((0,0)\). This is typically the case with income. However, measured wealth may fall to zero, so starting a bit below \(0\) may be more appropriate. For each value of wealth (starting with the smallest), draw a horizontal line to that value at the current height, and then a vertical line of length \(1/N\).

Note that an ECDF is not continuous. There is a “jump” in the function at each sample value. A step plot can illustrate these jumps.

Create a procedure to plot an ECDF, given a list of numbers.

Empirical Survival Function

A probability distribution is characterzed by its CDF. A probability distribution is equally well characterzed by its survivial function. If the CDF is \(F[t]\), the surivial function is \(S[t] = P[T > t] = 1 - F[t]\). Similarly, if the ECDF of a dataset is \(F[t]\), the empirical surivial function is \(S[t] = 1 - F[t]\). Social scientists are equally likely to use a CDF or a survival function to characterize a probability distribution.

Example of an Empirical Survival Function

Consider a famous example: remission times, in weeks, for 21 leukemia patients not receiving any treatment. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/gehan.html.)

1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23

We often interpret the actual relative frequencies of survival as estimates of an underlying distribution. For example, the estimated probability that an individual survives more than 6 weeks is \(S[6] = 12/21\). And, the estimated probability that an individual survives more than 8 weeks is \(S[8] = 8/21\). (These are sometimes called Kaplan-Meier estimates.)

Lorenz Curve

[Lorenz-1905-JASA] proposed a graphical representation of inequality, now known as the Lorenz curve. When applied to disaggregated wealth data, a Lorenz curve shows the relationship between cumulative population shares and cumulative wealth shares. A Lorenz curve for a population is closely related to the empirical CDF. See the Gift World lecture for an introductory application. This section shows how to build up to a Lorenz curve from simple concepts.

Consider an ECDF of agent wealths. (The application to income or expenditure is exactly the same.) For any level of wealth, the ECDF reports the proportion of the population with that wealth or less. It is more common to the following related question: what proportion of total wealth is claimed by the poorest 20% of the households?

The ECDF can be transformed to represent this by taking two steps. First, accumulate the wealths on the horizontal axis as partial sums, and then normalize by total wealth. This axis now represents the cumulative wealth of the poorest, as a fraction of total wealth in the economy, so the hortizontal axis is rescaled to \([0..1]\). The second step is conventional but less important: plot the inverse of the result relation by swapping the axes. The result is called a Lorenz curve: it has proportions of the total population on the horizontal axis, with proportions of total wealth on the vertical axis. Points on this curve easily answer questions such as the following: what (poorest) proportion of the population owns 20% of total wealth?

From this description, the core computation can be decomposed into two pieces: producing the partial sums of a sorted lst, and deflating them by the list total. As an applied example, using 1993 data for Vietnam, [Haughton.Khandker-2009-Handbook] report total per capita expenditures (in thousands of dong per year) for each quintile to be \((518,756,984,1338,2540)\). The following table converts this data into cumulative proportional expenditures. It also includes a final line stating the inequality gap, which is the gap between the cumulative proportion expected under perfect inequality and that seen in the data.

Expenditure by Population Quintile

1st

2nd

3rd

4th

5th

Quintile Expenditure

518

756

984

1338

2540

Cumulative Expenditure

518

1274

2258

3596

6136

Cumulative Proportion

0.084

0.208

0.368

0.586

1.000

Inequality Gap

0.11558

0.192373

0.232008

0.21395

[Lorenz-1905-JASA] proposed a graphical representation of of this information. A Lorenz curve is essentially a plot of the cumulative shares. Plot the \(N\) cumulative shares as follows. First, plot \((0,0)\). Then increment the x coordinate by \(1/N\) and plot the first cumulative share. Next, again increment the x coordinate by \(1/N\) and plot the next cumulative share. proceed in this manner until all are plotted. Note that the last point plotted will be \((1,1)\).

Plot along one axis cumulated per cents. of the population from poorest to richest, and along the other the per cent. of the total wealth held by these per cents. of the population.

—Lorenz (1905, p. 217)

Lorenz curves are widely used in many research fields. Economists use them to represent inequality in income or wealth; ecologists use them to measure inequality in size or weight. For example, Figure hk2009lorenz01 displays a Lorenz curve based on the data above, where successive points are connected by straight line segments. The figure add a triangle, which always contains the Lorenz curve. The Lorenz curve partitions the triangle into two areas, labeled \(A\) and \(B\). When a Lorenz curve lies near the line of equality, there is little inequality, and the area \(A\) is small. When the Lorenz curve lies far below the equality line, there is much inequality, and the area \(A\) is large. We return to this below.

Lorenz curve for Vietnam, based on 1993 data.

Lorenz Curve for Quintile Expenditures of (518,756,984,1338,2540)

Data source: [Haughton.Khandker-2009-Handbook], p. 103.

For this exercise, write a function that produces the Lorenz-curve points from a numerical list. The computation of partial sums is generally useful, so it is worth a separate function. Write a partialSums function that produces a list of partial sums from a list of numbers. Write a cumulativeShares function that produces a list of cumulative shares from a list of numbers. The cumulativeShares function should depend on the partialSums function. Write a pointsLorenz function that produces a list of the points on the Lorenz curve. The pointsLorenz function should depend on the cumulativeShares function.

Gini Coefficient

The Gini coefficient of inequality is the most commonly reported summary measure of inequality. In terms of the labeled regions in Figure hk2009lorenz01, define the Gini coefficient to be \(A/(A+B)\). By construction, area \(A+B\) is \(1/2\), so the Gini coefficient may also be written as \(2A\) or equivalently as \(1-2B\). Given a Lorenz curve constructed like this one, with straight line segments, the area \(B\) is a collection of \(N\) adjacent trapezoids. Realling the formula for the area of a trapezoid, write the area \(B\) in terms of the points \((x_i,y_i)\) produced for the Lorenz curve. (Here \((x_0,y_0)=(0,0)\), the extra point at the origin.)

\begin{equation*} B = \sum_{i=1}^{N} \frac{1}{2} (x_i - x_{i-1})(y_i + y_{i-1}) \end{equation*}

This yield the following formula for the Gini.

\begin{equation*} \textrm{gini} = 1 - \sum_{i=1}^{N} (x_i - x_{i-1})(y_i + y_{i-1}) \end{equation*}

In the current example, the distance between successive \(x_i\) is \(1/N\), allowing the following simplification of the formula. (Recall that \(y_0=0\) and \(y_N=1\).)

\begin{align*} \textrm{gini} &= 1 - \frac{1}{N} \sum_{i=1}^{N} (y_i + y_{i-1}) \\ &= 1 - \frac{1}{N} \big(-1 + 2 \sum_{i=1}^{N} y_i \big) \\ &= \frac{1}{N} \big( N + 1 - 2 \sum_{i=1}^{N} y_i \big) \\ &= \frac{1}{N} \big(2\frac{N(N + 1)/2}{N} - 2 \sum_{i=1}^{N} y_i \big) \\ &= 2 \frac{1}{N} \big( \sum_{i=1}^{N}(i/N - y_i) \big) \end{align*}

The last expression follows from the fact that the sum of the first \(N\) positive integers is \(N(N+1)/2\). Carefully compare it to the Lorenz curve in Figure hk2009lorenz01 to see that it is twice the area \(A\), as computed by the method of trapezoids. Better yet, it may be given an intuitive alternative statement. Define the Lorenz curve inequality gaps as the distance between cumulative proportions (i.e., the \(y_i\)) and the line of perfect inequality. The this final expression is twice the mean value of the gaps. When applied to the example data above, this produces a Gini coefficient of about \(0.30\).

Create a shares2gini function that produces a Gini coefficient from a list of cumulative proportional shares. Produce a gini coefficient for the data above by using this function in combination with the cumulativeShares function. Correspondingly, write a gini function that produces a gini coefficient from a numerical list. This function should depend on the cumulativeShares function and the shares2gini function.

The following figure illustrate the evolution of the Lorenz Curve and the Gini coefficient over time in the United States.

Lorenz curves for US, based on Census data.

Lorenz Curve for Household Income (U.S.)

Sequential Observations

Sometimes data observations have an inherent order, as with time-series observations. Sequential univariate data may be usefully displayed with a run-sequence plot.

Run-Sequence Plot

Chambers et al (1983) describe the run-sequence plot as a convenient summary of a univariate data sets. (Univariate data is records the outcomes of a single variable.) Outliers are easily detected. If the sequence order has inherent meaning (e.g., successive points in time or space), substantial shifts in mean or variability may be evident. When the run sequence represents observations at successive points in time, we often use the term time-series plot.

References:

Time-Series Plot

This course uses the term time-series plot for any run-sequence plot where the abscissa values somehow indicate the passage of time (e.g., dates). (For an elementary overview of the concept of a time-series plot, see Chapter 4 of [Feldman-2012-OxfordUP].) These time values may be no more than the series indexes. Often we connect the plotted points with line segments, producing a line chart. This course uses the term time-series chart for a chart that contains one or more time-series plots, no matter whether the points are separately plotted or joined with line segements.

Probability Distributions

Pseudo-random numbers are algorithmically generated sequences of numbers that appear to be truly random. Following a common convention, this course simply refers these as to random numbers. Roughly speaking, a probability function for a distribution characterizes the likelihood of specific types of outcomes. Generating an actual outcome from a distribution is called drawing from the distribution. A number representing this draw is a random variate.

Most high-level programming languages provide easy ways to generate random variates from a standard uniform distribution. Depending on the language, users may be expected to generate other distributions from this underlying distribution.

Cumulative Distribution Function

The behavior of a random variable is fully characterized by its cumulative distribution function (CDF). The CDF of a random varaible \(X\) is the function \(F_X\) that for any real value \(x\) gives the probability that \(X\) will not be bigger than \(x\).

\begin{equation*} F_X[x] = P[X \le x] \end{equation*}

Bernoulli Distribution

A Bernoulli distribution has two possible outcomes: 0 (failure) or 1 (success). A Bernoulli probability distribution has one parameter, the probability of success, which is traditionally represented as \(p\). A simple way to summarize this is with the probability mass function (PMF), which associates a probability with each possible outcome: either \(0\) or \(1\). A mathematical representation of a Bernoulli PMF follows.

\begin{align*} P[1] &= p \\ P[0] &= (1 - p) \end{align*}

Here, \(P[1]\) means the probability of drawing a \(1\), and \(P[0]\) means the probability of drawing a \(0\). This is the usual algebraic representation of the probability mass function (PMF) of the Bernoulli distribution. It says that there is a probability \(p\) of drawing a \(1\) and a probability \(1-p\) of drawing a \(0\). Implicitly, the probability of drawing any other value is \(0\).

Geometric Distribution

The geometric distribution is a discrete distribution on the nonnegative integers. It has a single parameter, the probability of a success, which is usually denoted by \(p\). The probability mass function (PMF) is the following. [1]

\begin{equation*} P_K[k] = (1-p)^{k} p \qquad\qquad k=0,1,2,\dots \end{equation*}

This means that adjacent values have a constant ratio (\((1-p)\)). We can take advantage of this to compute the distribution mean. To do so, first recall the following argument. Suppose \(f\) is a fraction, lying between \(0\) and \(1\). Let

\begin{equation*} S = \sum_{i=0}^{\infty}f^{i} \end{equation*}

Then

\begin{align*} S-fS &= \sum_{i=0}^{\infty}f^{i} - f \sum_{i=0}^{\infty}f^{i} \\ &= 1 + \sum_{i=1}^{\infty}f^{i} - \sum_{i=1}^{\infty}f^{i} \\ &= 1 \end{align*}

So that

\begin{equation*} S=1/(1-f) \end{equation*}

Let us put this observation to work. Recall that the distribution mean (\(\mu\)) is the probability weighted sum of possible outcomes.

\begin{equation*} \mu = \sum_{k=0}^{\infty} k (1-p)^{k} p \end{equation*}

Assuming \(1>p>0\), multiply both sides by \((1-p)\).

\begin{align*} (1-p) \mu &= \sum_{k=0}^{\infty} k (1-p)^{k+1} p \\ &= \sum_{k=1}^{\infty} (k-1) (1-p)^{k} p \end{align*}

Subtracting the two gives

\begin{align*} \mu - (1-p)S \mu= 0 + \sum_{k=1}^{\infty} (1-p)^{k} p \\ \mu &= (1-p)\sum_{k=0}^{\infty} (1-p)^{k} \\ \mu &= (1-p)/p \end{align*}

The shifted geometric distribution shifts the entire probability mass function to the right by \(1\).

\begin{equation*} P_K[k] = (1-p)^{k} p \qquad\qquad k=1,2,3,\dots \end{equation*}

This naturally increases the mean by \(1\) as well, so the mean of the shifted geometric distribution is \(1/p\).

Standard Uniform Distribution

The standard uniform distribution draws only from the unit interval (\([0\,..\,1]\)). The standard uniform distribution is defined on \([0\,..\,1]\) as follows. A random variable \(U\) has the standard uniform distribution if

\begin{equation*} P[U \le u] = u \qquad \forall u \in [0\,..\,1] \end{equation*}

That is, on the unit interval, \(F_U[u] = u\). We often write this as \(U \sim U[0,1]\).

Functions of Random Variables

Suppose \(X\) is a random variable with CDF \(F_X\). If \(f\) is a real-valued function. Then \(Y=f[X]\) is also a random variable

\begin{equation*} F_Y[y] = P[Y \le y] = P[f[X] \le y] \end{equation*}

If \(f\) is strictly increasing, it has a strictly increasing inverse. So we can write

\begin{equation*} F_Y[y] = P[X \le f^{-1}[y]] = F_X[f^{-1}[y]] \end{equation*}

For example, if \(Y=e^{X}\), then

\begin{equation*} F_Y[y] = F_X[\ln[y]] \end{equation*}

Probability Transform

Suppose a continuous random variable \(X\) has a strictly increasing CDF \(F_X\). (This means it has a continuous inverse, \(F^{-1}_X\).) Create a new random variable \(U = F_X[X]\), which takes on values in \([0\,..\,1]\). Then for any real number \(u \in [0\,..\,1]\) we have:

\begin{equation*} P[U \le u] = P[F_X[X] \le u] = P[X \le F^{-1}_X[u]] = F_X[F^{-1}_X[u]] = u \end{equation*}

This means that \(U\) has the standard normal distribution. This observation provides the basis for transforming standard uniform variates into draws from other distributions, making use of their CDFs. Given a standard uniform distribution \(U\), define \(X = F^{-1}_X[U]\).

\begin{equation*} P[X \le x] = P[F^{-1}_X[U] \le x] = P[U \le F_X[x]] = F_X[x] \end{equation*}

That is, defining \(X = F^{-1}_X[U]\) produces a random variable with the distribution characterized by the CDF \(F_X\).

Continuous Uniform Distribution

Under the continuous uniform distribution \(U[a,b]\), values are equally likely to drawn from anywhere in the interval \([a\,..\,b]\). The continuous uniform distribution on the interval \([a\,..\,b]\) is characterized by impossibility of seeing values less than \(a\), certainty of seeing values less than or equal to \(b\), and

\begin{equation*} F_X[x] = P[X \le x] = \frac{x - a}{b - a} \qquad a \le x \le b \end{equation*}

The inverse of the CDF is

\begin{equation*} F^{-1}_X[u] = a + (b - a) u \end{equation*}

Therefore, given a random variable \(U\) with the standard uniform distribution, \(F^{-1}_X[U]\) is uniform in the interval \([a\,..\,b]\).

Exponential Distribution

Distributional Differences

Pearson Chi-Square Test for One-Way Goodness of Fit

Recall that the Pearson goodness-of-fit test is based on the difference between the observed (\(c_i\)) frequency and the expected frequency (\(e_i\)) in each of the categories (\(i\in [1\,..\,N]\)).

\begin{equation*} P = \sum_{k=1}^{N} \frac{(c_i - e_i)^2}{e_i} \end{equation*}

Suppose you receive the following count data and are told it comes from a shifted geometric distribution with parameter \(p=0.6\). You want to check whether that is plausible by applying a Pearson test.

\begin{equation*} \begin{array}{l|cccccccc} \text{Category} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\ \text{Count} & 613 & 233 & 84 & 44 & 17 & 4 & 2 & 3 \end{array} \end{equation*}

One problem is that a geometric distribution will have many unobserved possible outcomes. Another problem is that, as a rule of thumb, categories should include at least five observations when applying this test. Address both problems by recoding the data, pooling the last \(4\) categories.

\begin{equation*} \begin{array}{l|cccccccc} \text{Category} & 1 & 2 & 3 & 4 & 5+ \\ \text{Count} & 613 & 233 & 84 & 44 & 26 \\ \text{Epected Count} & 600 & 240 & 96 & 38 & 26 \end{array} \end{equation*}

With this data, we produce a Pearson statistic of \(2.93\). Since we are working with \(5\) categories, there are \(4\) degress of freedom. (Roughly, once you set the counts in four of the categories, the fifth is fully determined.) Under the null hypothesis that it is drawn from the claimed distribution, the distribution of this Pearson statistic is roughly \(\chi^2_4\). Examining the CDF for the \(\mathrm{ChiSqare}[4]\) distribution, as shown in Figure f:x2cdf4pearson01, we find that such a statistic at least this big occurs quite often. So we do not reject the suitability of the claimed distribution for this data.

CDF for ChiSquare[1]

CDF for \(\mathrm{ChiSquare}[4]\)

Mann-Whitney \(U\) Test

The Mann-Whitney \(U\) test, also called the Wilcoxon rank-sum test, is a non-parametric test for the significance of differences in average outcomes. It is a common alternative to a two-sample \(t\) test that tests for a difference in population means. The \(t\) test makes three assumptions.

  1. independence of the two samples

  2. equal population variances

  3. samples are drawn from normal distributions

With large samples, the \(t\) test can provide good guidance even when the assumptions of equal variances or even normality are violated. When the sample is small and the data are skewed, the \(t\) test can mislead.

As long as the samples are from continuous distributions, we can drop the assumption of normality and apply the non-parametric Wilcoxon rank-sum test for indepedent samples to test for a difference of the medians. The null hypothesis is that the distributions underlying the two samples have equal medians. If there is enough evidence that the medians differ, we reject the null hypothesis.

Consider the following data from Hogg and Tanis (2006, example 8.4-6). Two companies, A and B, sell the same product. There have 8 package-weight observations from each company.

sampleA = {117.1, 121.3, 127.8, 121.9, 117.4, 124.5, 119.5, 115.1}
sampleB = {123.5, 125.3, 126.5, 127.9, 122.1, 125.6, 129.8, 117.2}

Determine if the distribution of weights is the same at each company. Start with a box plot. The data have a similar range but look skewed and thus possibly non-normal. A standard \(t\)-test produces a \(p\)-value of about \(0.058\).

Now compare that to a Wilcoxon test. Here is the idea: if the two samples are from the same distribution, then one independent draw from each company should have equal probability of one or the other being larger. (The difference should be distributed symmetrically around zero.) To illustrate this, make a list plot of all the possible pairings of the two, and add a \(45^\circ\) line. We should find about half the points on each side of the dividing line. However, the are only \(13\) instances where the A value is greater than the B value.

Create a test statistic based on this intuition. Define the following function:

\begin{equation*} s[x,y] = (1 + \mathrm{sign}[x-y]) / 2 \end{equation*}

This produces a value of \(1\) when \(x>y\), a value of \(0\) when \(x<y\), and a value of \(1/2\) when they are equal. Sum up over the possible pairs to produce the \(U\) statistic for samples \(X\) and \(Y\).

\begin{equation*} U[X,Y] = \sum_{X_i \in X} \sum_{Y_j \in Y} s[X_i,Y_j] \end{equation*}

The maximum value of \(U\) is the product of the sample sizes. Under the null hypothesis that \(Y\) and \(Y\) are drawn from the same distribution, \(U\) should be about half of that. In the current example, there are 64 pairings and \(U\) is only \(13\). So we can ask, how likely is it to see 13/64 when we expect 32/64. Ordinarily we rely on software for this computation, which is provided by standard statistical packages. In this case, the \(p\)-value is about \(0.041\). So at the \(5%\) level, we would not reject the null with a \(t\) test but would with a \(U\) test.

Let \(n_x\) be the number of observations in \(X\), and let \(n_y\) be the number of observations in \(Y\). The total number of pairings is \(n_x n_y\), so the expected value of \(U\) is \(\mu_U = n_x n_y / 2\). Computing the standard deviation requires more work, but in the absence of ties,

\begin{equation*} \sigma_U = \sqrt{\frac{(1+n_x+n_y) n_x n_y}{12} \end{equation*}

As the sample sizes become larger, the distribution of \(U\) approaches a normal distribution. If the samples contain more than 20 values, we typically use a normal approximation to the exact \(U\) distribution. We can then look up \((U-\mu_u)/\sigma_U\) in a \(Z\) table. Doing so for the current example produces a (two-sided) \(p\)-value of about \(0.046\). The normal approximation is very good for large samples, and it is computationally faster as well.

Note that this \(U\) test does not provide information about how far about the different distributions are. Statistical sofware will often provide an estimate (with confidence interval) for the difference in location. The estimate is not just the difference in the sample medians, the rather the median of all of the differences between two samples. Expect the confidence interval to be fairly wide due when the small sample size is small.

\begin{equation*} \sum_{X_i \in X} \sum_{Y_j \in Y} (X_i - Y_j) \end{equation*}

References

Hogg, R.V. and Tanis, E.A., Probability and Statistical Inference, 7th Ed, Prentice Hall, 2006. R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

The Wilcoxon Rank Sum Test https://data.library.virginia.edu/the-wilcoxon-rank-sum-test/

Describing Bivariate Data

Scatter Plot

A scatter plot is the most common visual display of the relationship between two variables. When one variable is considered to depend on another, the dependent variable gets the vertical axis.

Covariance and Correlation

See https://onlinestatbook.com/2/describing_bivariate_data/calculation.html

Least-Squares Fit

The relationship between two variables is often summarized with a least-squares fit.

See https://onlinestatbook.com/2/regression/intro.html for a good introduction.

Often, the resulting regression line is plotted over a scatter plot.

Regression Standard Error

Scatter Plot with Regression Line

Often, the resulting regression line is plotted over a scatter plot.

Conditional Expectation Function (CEF)

Assume one random variable (\(Y\)) is considered to depend on another (\(X\)), Then if we knew \(X_i=x\), we should know something about \(Y_i\). This something is summarized by the conditional expectation function:

\begin{equation*} h[x] = \mathcal{E}[Y_i | X_i=x] \end{equation*}
CEF Theorem:

The linear least squares fit provides the best linear approximation of \(h[x]\). (I.e., it minimizes mean squared error.)

Binned Scatter Plot

Scatter plots are less useful for large datasets. In this case, binned scatter plots provide a useful alternative. For a very simple binned scatter plot, assume one variable (\(y\)) is considered to depend on another (\(x\)), and create equal sized bins based on the values of the indepedent variable. For each bin, create one point to plot. This is typically the mean of the \(x\) and \(y\) values in the bin.

A binned scatter plot provides a non-parametric estimate of the CEF. (See above.)

Binned Scatter Plot with Regression Line

We have that the regression line is the best linear approximation of the CEF, and that the binned scatter plot provides a non-parametric estimate of the CEF. So it is natural to combine them: plot the regression line for all the data over a binned scatter plot.

Dispersion of the binscatter around the regression line signals a high regression standard error. However, low dispersion of the binscatter around the regression line does not indicate low dispersion of the data around the regression line.

Functional Form

Systematic deviation of the binscatter from the linear regression line can suggest that a nonlinear functional form is more appropriate.

Resources and Explorations

Explorations

  1. Consider a six-sided die that is not quite fair: it is twice as likely to roll \(1\) as to roll any other value. Roll this die 50 times and perform a Pearson chi-square test to determine whether the data allows you to reject the null hypothesis of a fair die.

  2. Write a procedure that takes one argument, a list of wealths or incomes, and plots the corresponding Lorenz curve.

  3. Explain how the pointsLorenz function could be used in the computation of a Gini coefficient of inequality.

  4. Use the income-share data at the Bureau of the Census to plot a Lorenz curve and to compute the associated Gini coefficient of inequality.

Resources

For an extended by accessible discussion of inequality measures, see chapter 6 of [Haughton.Khandker-2009-Handbook]. Note that if one is calculating the Gini index of a sample for the purpose of estimating the value for a larger population, a small correction factor to the method used here may be needed for small samples [Deltas-2003-REStat].

References

[Deltas-2003-REStat]

Deltas, George. (2003) The Small-Sample Bias of the Gini Coefficient: Results and Implications for Empirical Research. The Review of Economics and Statistics 85, 226--234.

[Farris-2010-AMM]

Farris, Frank A. (2010) The Gini Index and Measures of Inequality. The American Mathematical Monthly 117, 851--864. http://www.jstor.org/stable/10.4169/000298910x523344

[Feldman-2012-OxfordUP]

Feldman, David P. (2012) Chaos and Fractals: An Elementary Introduction. : Oxford University Press.

[Haughton.Khandker-2009-Handbook]

Haughton, Jonathan, and Shahidur R. Khandker. (2009) Handbook on Poverty and Inequality. Washington, DC: World Bank. http://documents.worldbank.org/curated/en/488081468157174849/Handbook-on-poverty-and-inequality

[Lorenz-1905-JASA]

Lorenz, M.O. (1905) Methods of Measuring the Concentration of Wealth. Publications of the American Statitical Association 9, 209--219.

[McGill.Tukey.Larson-1978-AmStat]

McGill, Robert, John W. Tukey, and Wayne A. Larsen. (1978) Variations of Box Plots. The American Statistician 32, 12--16. http://www.jstor.org/stable/2683468

[Wilensky-2017-CCL]

Wilensky, Uri. (2017) NetLogo 6.01 User Manual.

[NIST.SEMATECH-2013-NIST]

National Institute of Standards and Technology, and SEMATECH. (2013) NIST/SEMATECH e-Handbook of Statistical Methods. : U.S. Department of Commerce. http://www.itl.nist.gov/div898/handbook