GenietVanHetLeven!: Statistical Analysis Overview

Statistical Analysis Overview¹

Major Entities

There are two major entities in Applied Statistics:

• Data

• Meta Data

Data

Data comprise the raw sample information we collect as well as the results of analyzing the samples.

Canonical notation presents these as multivariate variables² in an array, e.g.:

A sequence or other collection of random variables is independent and identically distributed ("i.i.d.") if each random variable has the same probability distribution as the others and all are mutually independent.³

The results of a function of a variable are also data:

R = F(X)

These are our test data for the R Project for Statistical Computing⁴ program that we are using:

We will use these for demonstration in the remainder of this paper.

Metadata

Metadata present the overlying processes and methods that interrelate the data.

Processes, functions, and analytic parameters are metadata.

The processes of statistical analysis comprise models and tests on both the model and its results.⁵

Statistics

Descriptive Statistics

Descriptive Statistics comprise calculations that describe the sample set:

Models

We build a mathematical model (a “regression”) to describe the relationships between input variables and the observed results. Most frequently the model is a linear regression of the form:

[R] = [A]x[X]+[B]

in standard matrix algebra notation.

Correlation analysis⁶

The first step is to validate the model. We do this with correlation analysis to determine how closely the model matches the observed samples. The sample data are used to compute r, the correlation coefficient for the sample. The symbol for the population correlation coefficient is ρ, the Greek letter "rho":

• ρ = population correlation coefficient (unknown)

• r = sample correlation coefficient (known; calculated from sample data)

If the test concludes that the correlation coefficient is significantly different from 0, we say that the correlation coefficient is "significant".

Significance is indicated by the value:

α = 1-ρ

α = 5 indicates a 95% correlation and is considered “significant”.

Factor Analysis

The analyst must seek the causes if the model does not adequately match the samples. Factor Analysis is one tool for this purpose.

Statistical Control

The process of statistical quality control3⁷ is one of determining whether a process and its results are “under control” or “out of control”.

A process that is operating with only chance causes of variation present is said to be in statistical control.

A process that is operating in the presence of assignable causes is said to be an out-of-control process. A process is considered to be out of control when its results exceed the Upper Specification Limit (USL) or Lower Specification Limit (LSL):

Control Charts

The USL/LSL correspond to the Upper Control Limit (UCL) or Lower Control Limit (LCL) in a control chart. These limits typically are taken to be three standard deviations (3σ) above and below the process mean:

The latter chart is an example of the R program output.

Hypothesis Testing⁸

A hypothesis test examines two opposing hypotheses about a population: the null hypothesis and the alternative hypothesis. The null hypothesis is the statement being tested. Usually the null hypothesis is a statement of "no effect" or "no difference". The alternative hypothesis is the statement you want to be able to conclude is true.

Based on the sample data, the test determines whether to reject the null hypothesis (to decide that the second hypothesis is correct). You use a “p-value”, to make the determination. If the p-value is less than or equal to the level of significance α then you can reject the null hypothesis.

p-value

The p-value is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, when the null hypothesis is true. In layman's terms, it is the probability of being wrong by rejecting the null hypothesis. So we reject the null hypothesis when the p-value is sufficiently small, that is, less than the significance level α, which is a cut-off point that you define.

We cannot know the exact p-value, but there are a number of different tests for estimating the p-value depending on the known characteristics of the sample sets at hand:⁹

t-test¹⁰

This delivers a random variable t that approximates the p-value. A t-test is used for testing the mean of one population against a standard or comparing the means of two populations if you do not know the populations’ standard deviation and when you have a limited sample (n < 30).

R returns the following paired t-test result:

data: y and V2

t = -4.2636, df = 5, p-value = 0.007987

indicating a close correlation between y and V2

z-test

A z-test is used for testing the mean of a population versus a standard, or comparing the means of two populations, with large (n ≥ 30) samples whether you know the population standard deviation or not.

Other tests

We may wish to compare other statistics (characteristics) of different sample sets.

F-test

An F-test is used to decide if 2 populations’ variances are the same, assuming both populations are normally distributed. The samples can be any size.

Levene's test

Levene's test is used to decide if 2 populations’ variances are the same, assuming both populations are continuous but NOT normally distributed.

Anderson–Darling test¹¹

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution.

Confidence interval

A confidence interval is a range of likely values for a population parameter (such as the mean μ) that is based on sample data.

Use a confidence interval to make inferences about one or more populations from sample data, or to quantify the precision of your estimate of a population parameter, such as μ.

Test and CI for Two Variances

This calculates the ratio of the variances (Σ) of two sample sets.

This summarizes the various tests:¹²

21 Richard A. Johnson and Dean W. Wichern, Applied Multivariate Statistical Analysis, 6^th Ed., ISBN 0-13-187715-1

3 http://tuvalu.santafe.edu/~aaronc/courses/7000/csci7000-001_2011_L0.pdf

4 https://www.r-project.org/

5Douglas C Montgomery, Introduction to Statistical Quality Control, 6th Edition, ISBN 978-0-470-16992-6

6Barbara Illowsky. “Testing the Significance of the Correlation Coefficient.” Collaborative Statistics Boundless, 26 May. 2016. Retrieved from https://www.boundless.com/users/235422/textbooks/collaborative-statistics/linear-regression-and-correlation-13/testing-the-significance-of-the-correlation-coefficient-181/testing-the-significance-of-the-correlation-coefficient-424-15972/

7Montgomery, Section 5.2

8 http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/basics/what-is-a-hypothesis-test/

9 https://brandalyzer.wordpress.com/2010/12/05/difference-between-z-test-f-test-and-t-test/

10 http://www.real-statistics.com/students-t-distribution/two-sample-t-test-equal-variances/

Investopedia http://www.investopedia.com/terms/t/t-test.asp#ixzz4WsbyPq27

11 https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test

12 http://www.minitab.com/uploadedFiles/Documents/sample-materials/TrainingTTest16EN.pdf