Basic Statistics in Python
Let’s create a dataset to work with and plot a histogram to visualise:
import numpy as np
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
np.random.seed(1)
data = np.round(np.random.normal(5, 2, 100))
plt.hist(data, bins=10, range=(0,10), edgecolor='black')
plt.show()
Measures of Central Tendency
Our measures of central tendency include mean, median or mode
The mean is calculated as
$$\mu = \dfrac{\sum_{i=1}^{N} x_i}{N}$$
The median value is the middlemost value, you take the value in the position of
$$\dfrac{n+1}{2}$$
of the sorted data.
The mode is the most frequent value.
Mean
Numpy implements a mean
function for calculating the mean:
mean = np.mean(data)
mean
Median
Numpy also implements a median
function for calculating the median:
np.median(data)
Mode
We can see from our histogram already that 5 is the modal value.
There is no in-built numpy mode function, but there is one from the scipy stats module we can use.
mode = stats.mode(data)
print("The modal value is {} with a count of {}".format(mode.mode[0], mode.count[0]))
Range
The range gives a measure of how spread apart the values are.
The range is simply calculated as the maximum value – minimum value
$$ Max(x_i) – Min(x_i) $$
Numpy implements this as a point to point function np.ptp
.
np.ptp(data)
Variance
Variance is a measure of how variable the data is, it is calculated as:
$$ \sigma^2 = \dfrac{\sum_{i=1}^{N} (x_i – \mu)^2}{N} $$
Numpy implements the variance as a function np.var()
np.var(data)
Standard Deviation
The variance can get very large for large data sets and so we will often use the standard deviation, which is the square root of the variance:
$$ \sigma = \sqrt{\sigma^2} $$
68.2% of the data falls within 1 standard deviation of the mean, 95.4% falls within 2 standard deviations of the mean, and 99.7% falls within 3 standard deviations.
This is implemented in Numpy as np.std()
np.std(data)
Standard Error
The standard error of the mean (SE of the mean) estimates the variability between sample means that you would obtain if you took multiple samples from the same population. The standard error of the mean estimates the variability between samples whereas the standard deviation measures the variability within a single sample.
It is calculated as:
$$ SE = \dfrac{s}{\sqrt{n}} $$
Where s is the sample standard deviation. Again Numpy doesn’t have an implementation of this (though it is easy to calculate), but we can use scipy’s stats module instead:
stats.sem(data)
So had we taken multiple samples from the same population we would expect the standard deviation of the means of those samples to be 0.176.