Statistical Tests Selection

Selection of the proper statistical test is of the essential importance for analysing retrieved data of an experiment. Depending on the type of data points, different tests could more or less misleading or truthful. Here is presented a division in the interest of making the selection process easier and accurate.

		Type of Data
Goal	Measurement (from Gaussian Population)	Rank, Score, or Measurement (from Non- Gaussian Population)	Binomial,(Two Possible Outcomes)	Survival Time
Describe one group	mean, sd	median, interquartile range	proportion	Kaplan-Meier survival curve
Compare one group to a hypothetical value	one-sample t-test	Wilcoxon test	chi-square, or, binomial test **
Compare two unpaired groups	unpaired t-test	Mann-Whitney test	Fisher's test,(chi-square for large samples)	Log-rank test or Mantel-Haenszel*
Compare two paired groups	paired t-test	Wilcoxon test	McNemar's test	conditional proportional hazards regression*
Compare three or more unmatched groups	one-way ANOVA	Kruskal-Wallis test	Chi-square test	Cox proportional hazard regression**
Compare three or more matched groups	repeated-measures ANOVA	Friedman test	Cochrane Q**	conditional proportional hazards regression**
Quantify association between two variables	Pearson correlation	Spearman correlation	contingency coefficients**
Predict value from another measured variable	simple linear regression, or, nonlinear regression	nonparametric regression**	simple logistic regression*	Cox proportional hazard regression*
Predict value from several measured or binomial variables	multiple linear regression, or, multiple nonlinear regression*		multiple logistic regression*	Cox proportional hazard regression*

Statistical Tests

One-way ANOVA

Average value

Mean

The mean and expected value are used synonymously to refer to one measure of the central tendency either of a probability distribution or of the random variable characterized by that distribution.

Implementation in R: mean(x)
Implementation in Python: numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<class 'numpy._globals._NoValue'>)

Median

The median is the value separating the higher half of a data sample, a population, or a probability distribution, from the lower half.

Implementation in R: median(x)
Implementation in Python: numpy.median(a, axis=None, out=None, overwrite_input=False, keepdims=False)

Standard Deviation

The standard deviation is a statistical method that is used to quantify the amount of variation or dispersion of a set of data values.

Implementation in R: sd(x)
Implementation in Python: numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<class 'numpy._globals._NoValue'>)

Kaplan-Meier estimator

The Kaplan–Meier estimator is a non-parametric statistical method used to estimate the survival function from lifetime data.

Assumptions

The event status should consist of two mutually exclusive and collectively exhaustive states: "censored" or "event".
The time to an event or censorship (known as the "survival time") should be clearly defined and precisely measured.
Where possible, left-censoring should be minimized or avoided.
There should be independence of censoring and the event.
There should be no secular trends (also known as secular changes).
There should be a similar amount and pattern of censorship per group.

Implementations

Implementation in R: survfit(y_bmt ~ 1)
Implementation in Python: kmf.fit(T, event_observed=E)

Kolmogorov-Smirnov test

The Kolmogorov–Smirnov test is a nonparametric statistical method of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution, or to compare two samples.

Assumptions

The sample is a random sample
The theoretical distribution must be fully specified.
The theoretical distribution is assumed to be continuous.
The sample distribution is assumed to have no ties.

Implementations

Implementation in R: ks.test(x, y)
Implementation in Python: scipy.stats.kstest(rvs, cdf, args=(), N=20, alternative='two-sided', mode='approx')

Kruskal-Wallis test

The Kruskal-Wallis test* is a non-parametric statistical method for testing whether samples originate from the same distribution.

Assumptions

The samples drawn from the population are random.
The observations are independent of each other.
The measurement scale for the dependent variable should be at least ordinal.

Implementations

Implementation in R: kruskal.test(list(g1=a, g2=b, g3=c, g4=d))
Implementation in Python: scipy.stats.kruskal(*args, **kwargs)

Pearson correlation

The Pearson correlation is a statistical method used for testing the linear correlation between two variables X and Y.

Assumptions

The data sets to be correlated should approximate the normal distribution.
If the points lie equally on both sides of the line of best fit, then the data is homoscedastic.
The data follows a linear relationship.
The data is continuous.
Data are paired and come from the same population.
No outliers must be present in the data.

Implementations

Implementation in R: cor(df,method="pearson")
Implementation in Python: scipy.stats.pearsonr(x, y)

Spearman correlation

The Spearman correlation is a nonparametric statistical method used for testing the rank correlation (statistical dependence between the rankings of two variables).

Assumptions

The data sets to be correlated should approximate the normal distribution.
If the points lie equally on both sides of the line of best fit, then the data is homoscedastic.
The data follows a linear relationship.
The data is continuous.
Data are paired and come from the same population.
No outliers must be present in the data.

Implementations

Implementation in R: cor(df,method="spearman")
Implementation in Python: scipy.stats.spearmanr(a, b=None, axis=0)

One-sample t-test

The one-sample t-test is a statistical method used for testing the null hypothesis that the population mean is equal to a specified value mu_0.

Assumptions

The dependent variable must be continuous (interval/ratio).
The observations are independent of one another.
The dependent variable should be approximately normally distributed.
The dependent variable should not contain any outliers.

Implementations

Implementation in R: t.test(a, mu=mu_0)
Implementation in Python: scipy.stats.ttest_1samp(a, popmean, axis=0)

Two-sample t-test

The one-sample t-test is a statistical method used for testing the null hypothesis such that the means of two populations are equal.

Assumptions

The populations from which the samples have been drawn should be normal - appropriate statistical methods exist for testing this assumption (e.g. the Kolmogorov Smirnov non-parametric test).
The standard deviation of the populations is unknown. This assumption can be tested by the F-test.
Samples have to be randomly drawn independent of each other.

Implementations

Implementation in R: t.test(a,b, var.equal=TRUE, paired=FALSE)
Implementation in Python: scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')

Paired t-test

The paired t-test is a statistical method used for testing the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero. The dependent variable must be continuous (interval/ratio).

Assumptions

The observations are independent of one another.
The dependent variable should be approximately normally distributed.
The dependent variable should not contain any outliers.

Implementations

Implementation in R: t.test(a,b, paired=TRUE)
Implementation in Python: scipy.stats.ttest_rel(a, b, axis=0, nan_policy='propagate')

Unpaired t-test

The paired t-test is a statistical method used for testing whether the slope of a regression line differs significantly from 0.

Assumptions

The observations are independent of one another.
The dependent variable should be approximately normally distributed.
The dependent variable should not contain any outliers.
The data is continuous.
The groups should have equal variance.

Implementations

Implementation in R: t.test(x, y, alternative="two.sided", var.equal=FALSE)
Implementation in Python: scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')

Wilcoxon test

The Wilcoxon test is a non-parametric statistical method used to compare two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ.

Assumptions

Data are paired and come from the same population.
Each pair is chosen randomly and independently.
The data are measured on at least an interval scale when, as is usual, within-pair differences are calculated to perform the test (though it does suffice that within-pair comparisons are on an ordinal scale).

Implementations

Implementation in R: wilcox.test(a,b, paired=TRUE)
Implementation in Python: scipy.stats.wilcoxon(x, y=None, zero_method='wilcox', correction=False)

[1] Table summary is extracted from https://www.graphpad.com/support/faqid/1790/

For more information about the Triangle of Life concept visit http://evosphere.eu/.

_________________
/ Premature      \
| optimization   |
| is the root of |
| all evil.      |
|                |
\ -- D.E. Knuth  /
-----------------
    \   ^__^
     \  (oo)\_______
        (__)\       )\/\
            ||----w |
            ||     ||

Statistical Tests Selection

Statistical Tests

One-way ANOVA

Average value

Mean

Median

Standard Deviation

Kaplan-Meier estimator

Assumptions

Implementations

Kolmogorov-Smirnov test

Assumptions

Implementations

Kruskal-Wallis test

Assumptions

Implementations

Pearson correlation

Assumptions

Implementations

Spearman correlation

Assumptions

Implementations

One-sample t-test

Assumptions

Implementations

Two-sample t-test

Assumptions

Implementations

Paired t-test

Assumptions

Implementations

Unpaired t-test

Assumptions

Implementations

Wilcoxon test

Assumptions

Implementations

Clone this wiki locally