Introduction
This is the second half of my final assessment for the MSc module Programming for Data Science. The brief was to investigate a dataset of indicators for 50 regions, identify a pattern or trend, produce at least three data visualisations, identify and deal with any outliers, and run a hypothesis test.
My grades for this assignment are:
- Task 6 - 80%
- Task 7 - 85%
- Take 8 - 75%
Task 6 - Explore the dataset to identify an “interesting” pattern or trend
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
#read in data and observe
= pd.read_csv('part_2a.csv')
part_2a = pd.read_csv('part_2b.csv')
part_2b
part_2a.info() part_2b.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 region 51 non-null int64
1 median_household_income 51 non-null int64
2 share_unemployed_seasonal 51 non-null float64
3 share_population_in_metro_areas 51 non-null float64
4 share_population_with_high_school_degree 51 non-null float64
5 share_non_citizen 48 non-null float64
6 share_white_poverty 51 non-null float64
7 gini_index 51 non-null float64
8 share_non_white 51 non-null float64
9 share_voters_voted_to_the_right 51 non-null float64
dtypes: float64(8), int64(2)
memory usage: 4.1 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 region 51 non-null int64
1 avg_biascrimes_per_100k 50 non-null float64
dtypes: float64(1), int64(1)
memory usage: 948.0 bytes
#combine datasets
= pd.merge(left=part_2a, right=part_2b)
regional_data_combined
#rename some columns for simplicity
= (
regional_data_renamed ={
regional_data_combined.rename(columns'median_household_income':'med_income',
'share_unemployed_seasonal':'unemp_seasonal',
'share_population_in_metro_areas':'metro_pop',
'share_population_with_high_school_degree':'hs_degree',
'share_voters_voted_to_the_right':'votes_to_right',
'avg_biascrimes_per_100k':'bias_crimes',
'share_non_citizen':'non_citizen',
'share_white_poverty':'white_poverty',
'share_non_white':'non_white'
})
)
#make copies of data for exploration
= regional_data_renamed.copy()
rd
#view tidied dataframe
rd.head()
region | med_income | unemp_seasonal | metro_pop | hs_degree | non_citizen | white_poverty | gini_index | non_white | votes_to_right | bias_crimes | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 42278 | 0.060 | 0.64 | 0.821 | 0.02 | 0.12 | 0.472 | 0.35 | 0.63 | 1.806410 |
1 | 2 | 67629 | 0.064 | 0.63 | 0.914 | 0.04 | 0.06 | 0.422 | 0.42 | 0.53 | 1.656700 |
2 | 3 | 49254 | 0.063 | 0.90 | 0.842 | 0.10 | 0.09 | 0.455 | 0.49 | 0.50 | 3.413928 |
3 | 4 | 44922 | 0.052 | 0.69 | 0.824 | 0.04 | 0.12 | 0.458 | 0.26 | 0.60 | 0.869209 |
4 | 5 | 60487 | 0.059 | 0.97 | 0.806 | 0.13 | 0.09 | 0.471 | 0.61 | 0.33 | 2.397986 |
Initial exploration - visualisation 1 - correlation matrix heatmap
To begin, I produced a heatmap based on a correlation matrix. To make the correlation matrix, I used Pearson’s Correlation Coefficient, which quantified the relationships between all of the indicators in the dataset. This outputted a matrix of coefficients between -1 (strongest negative correlation) and 1 (strongest positive correlation).
Next, I rendered these coefficients into a heatmap, which visually represents the strength of the relationships. I used a diverging colour palette centred on 0; the darker the colour of the cell, the closer the coefficient to either end of the spectrum (-1 red to 1 blue) and, thus, the stronger the relationship between the relevant indicators.
The heatmap will enable me to identify a selection of indicators where strong and/or interesting linear relationships are likely to be evident.
#get matrix of pearson coefficients
= rd.corr(method = 'pearson')
corr
#render heatmap
= (9, 4))
plt.figure(figsize "Relationships between regional indicators", y=1.05, x=0.07, fontweight="bold", fontsize=12)
plt.title(= sns.heatmap(corr, annot=True, fmt='.2f', linewidths=0.5, cmap='RdBu', center=0)
hm =45, horizontalalignment='right') #code from Bobbitt (2022b)
hm.set_xticklabels(hm.get_xticklabels(), rotation plt.show()
Initial exploration - visualisation 2 - scatter and histogram pairplot
From the heatmap, I identified seven of the most highly correlated indicators that stood out to me as potentially interesting relationships to explore further. These do not necessarily represent the top seven strongest relationships: for example, I excluded non-citizen
vs non-white
(r = 0.73) as I felt that it would be difficult to conduct meaningful analysis on this relationship without context about the regions and demographics represented in the dataset.
The seven indicators that I chose to explore further are:
med_income
andwhite_poverty
: strongest correlation of the dataset (r = -0.82)metro_pop
andnon_citizen
: strong correlation (r = 0.75)votes_to_right
andhs_degree
: negligible correlation with each other, but coefficients of circa ±0.6 with multiple other indicatorsgini_index
: some moderate correlations and may add an interesting additional dimension
The next step that I took was to create a pairplot, comprising scatterplots (visualising the relationships between each indicator) and histograms (visualising the distribution of each individual indicator). It was important to view the shape of the data, as it could reveal non-linear relationships not indicated by the Pearson’s coefficients used above (which assumes linearity); it will also be useful to view the distribution of the data as this may reveal non-normality that could impact subsequent stages of the analysis.
#filter to seven indicators
= rd[['med_income','white_poverty','metro_pop','non_citizen','votes_to_right','hs_degree','gini_index']]
rd_viz2
#make a pairplot to view shape of correlations/frequency
"notebook")
sns.set_context(=True, height=2)
sns.pairplot(rd_viz2, corner"Pairplot of seven indicators", y=1.01, x=0.07, fontweight="bold", fontsize=12)
plt.suptitle( plt.show()
Initial exploration - visualisation 3 - scatterplots
Having scrutinised the pairplot, I identified a potential relationship between hs_degree
, med_income
and gini_index
. I decided to investigate whether the percentage of a region’s population who hold a high school degree is associated with the median income and the income equality of that region. When considered in tandem with the Pearson’s coefficients, the shapes of the scatterplots in the pairplot further suggest that the relationship between these indicators may be meaningful. I will investigate further but I suspect that the higher the educational attainment of citizens in a given society, the higher the median income and income equality. I acknowledge that it is not possible to determine causation from this correlation alone but it will be interesting to explore the relationship.
In order to investigate these three indicators more closely, I made a scatterplot for each relationship with its respective Pearson’s coefficient. I also rendered the trend line for each plot, which begins to visually explore the strength of the relationships; this will be investigated in more detail in Task 8.
The scatterplots indicate that some outliers may be present, which I will explore further in Task 7.
#initialise variables - list of indicators to compare and empty list to store value of pearson's coefficient
= [
rd_viz3 'hs_degree', 'gini_index'),
('hs_degree', 'med_income'),
('med_income', 'gini_index')
(
]= []
coefficients
#calculate r for each pair and store in list
for i in rd_viz3:
0], i[1], stats.pearsonr(rd[i[0]], rd[i[1]])[0]))
coefficients.append((i[
#make plots with trend lines and pearson's coefficient
"paper")
sns.set_context(= plt.subplots(1, 3, figsize=(18, 4))
fig, ax "Relationships between three indicators with Pearson's Coefficient", y=1.05, x=-1.85, fontweight="bold", fontsize=12)
plt.title(for i,el in enumerate(coefficients):
=rd, x=rd[el[0]], y=rd[el[1]], ax=ax[i], ci=None)
sns.regplot(data0.1, 0.9, f'r = {el[2]:.2f}', transform=ax[i].transAxes, fontsize=10, fontweight="bold") #code adapted from Chugani (2024) ax[i].text(
Task 7 - Detect and remove any outliers in the data used for your “interesting” pattern or trend
The first approach that I took to detecting outliers was to render a boxplot for each indicator. This provided a visualisation of the spread of the data, and quickly highlighted that there appears to be one outlier.
Next, I built on the boxplots by calculating the z-score for each value; this provided a way to detect outliers to a higher degree of precision by quantifying how many standard deviations the values are from the mean. I decided to use a threshold of z = 3 to detect outliers, as this is considered best practice when highlighting extreme values without being overly strict (Tharaka, 2025). I calculated the z-scores for all three indicators and printed the result. This process confirmed that there is one outlier in gini_index
.
I considered whether to retain the outlier because it is possible that it is a valid, albeit unusual, value. However, I felt that there was a risk that it could adversely affect the hypothesis test in Task 8, so I chose to remove it to ensure a more reliable result. As such, I concatenated the z-score column for gini_index
with the dataframe so that I could safely remove the row with the outlier.
Lastly, I created a scatterplot to combine and visualise all three indicators following the removal of the outlier. This provided a final opportunity to assess the relationship between all three indicators and its trend. I built my plot with med_income
on the y-axis (to reflect its position as the possible dependent variable in my hypothesis test); hs_degree
on the x-axis (independent variable); and gini_index
represented by both the size and colour of the scatterplot points. I felt that these decisions would contribute to a clear and easily interpretable visualisation of the data.
= rd[['hs_degree','med_income','gini_index']]
rd_outliers
#make boxplots
= plt.subplots(1, 3, figsize=(18, 4))
fig, ax "Boxplots of three indicators - detecting outliers", y=1.05, x=-1.95, fontweight="bold", fontsize=12)
plt.title(for i,el in enumerate(rd_outliers):
=rd[el], ax=ax[i]) sns.boxplot(x
#calculating z-score to detect outliers and decide whether to remove them
= np.abs((rd_outliers - rd_outliers.mean()) / rd_outliers.std()) #code from Eichner (2023)
z_scores
#confirm that only one outlier is present
= np.sum(z_scores > 3, axis=0).sum()
num_outliers print(f"The selected indicators contain {num_outliers} outlier(s).")
#add z-score column for gini_index
= pd.concat([rd_outliers, z_scores['gini_index'].rename('gi_z_score')], axis='columns') rd_outliers
The selected indicators contain 1 outlier(s).
#remove outlier from dataframe
= rd_outliers[rd_outliers['gi_z_score'] < 3]
rd_less_outliers
#get final version of data
= rd_less_outliers.copy()
rdlo
#visualise selected indicators with outlier removed
= (8, 6))
plt.figure(figsize = sns.scatterplot(x=rdlo['hs_degree'], y=rdlo['med_income'], hue=rdlo['gini_index'], size=rdlo['gini_index'])
g "Relationship between high school degree rate, median income, and Gini Index", y=1.05, x=0.5, fontweight="bold", fontsize=12)
plt.title('Share of population with high school degree')
plt.xlabel('Median income')
plt.ylabel(="Gini Index")
plt.legend(title plt.show()
Task 8 - Define a hypothesis to test your “interesting” pattern or trend an test your hypothesis with statistical significance level of 0.05
Hypothesis test - background
I will use a linear regression model to test my hypothesis.
Before conducting my hypothesis test, I decided to run a Shapiro-Wilk test to check the indicators for normality. The results showed that med_income
is likely normally distributed (p = 0.95), but there was some evidence that hs_degree
is non-normal (p = 0.01) (Choueiry, n.d.). For this reason, I next checked the skewness of hs_degree
; the socioeconomic context of the regions from where the data was collected is unknown but, in the UK and similar countries, it would be fair to expect that the proportion of people graduating high school would be higher than those who do not, i.e. a left skew. The skew test shows that there is a negative skew to this data (-0.38) but as it is mild and it will be the independent variable in my linear regression model, I decided not to apply any transformation (Bobbitt, 2022a).
To complete my exploration of this interesting trend, I decided to exclude Gini Index from the model. Compared to the correlation between the other two indicators, gini_index
has lower respective Pearson’s coefficients; as such, I felt that concentrating on the relationship between hs_degree
and med_income
would be a more impactful analysis.
#dictionary for saving results of shapiro-wilk tests
= {
sw_results 'hs_degree': 0,
'med_income': 0
}
#test normality
for key in sw_results.keys():
= f"p = {stats.shapiro(rdlo[key])[1]:.4f}"
sw_results[key]
print(f'The results of the Shapiro-Wilk test are: {sw_results}')
The results of the Shapiro-Wilk test are: {'hs_degree': 'p = 0.0101', 'med_income': 'p = 0.9485'}
#calculate skewness of hs_degree
print(f'Skew of hs_degree: {rdlo['hs_degree'].skew():.4f}')
Skew of hs_degree: -0.3841
Stating the hypothesis
\(H_0\): there is no relationship between the share of the population with a high school degree and the region’s median income (\(\beta = 0\))
\(H_1\): there is a relationship between these two indicators (\(\beta \ne 0\))
(Bobbitt, 2021)
Testing the hypothesis
To test my hypothesis, I will use the formula API from the statsmodels
library, to build an Ordinary Least Squares linear regression model. The API enabled me to use R-like syntax, which I found helpful because I was familiar with this from another module. The purpose of this is to use the data to test whether there is a statistically significant positive relationship between hs_degree
and med_income
.
Next I will fit the model and print a summary of the output.
#fit the linear regression model
= smf.ols(formula='med_income ~ hs_degree', data=rdlo)
mod = mod.fit()
res print(res.summary())
#code from Perktold et al. (2024)
OLS Regression Results
==============================================================================
Dep. Variable: med_income R-squared: 0.443
Model: OLS Adj. R-squared: 0.432
Method: Least Squares F-statistic: 38.20
Date: Wed, 07 May 2025 Prob (F-statistic): 1.33e-07
Time: 14:29:24 Log-Likelihood: -511.66
No. Observations: 50 AIC: 1027.
Df Residuals: 48 BIC: 1031.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -9.816e+04 2.48e+04 -3.959 0.000 -1.48e+05 -4.83e+04
hs_degree 1.762e+05 2.85e+04 6.181 0.000 1.19e+05 2.34e+05
==============================================================================
Omnibus: 6.306 Durbin-Watson: 1.923
Prob(Omnibus): 0.043 Jarque-Bera (JB): 5.956
Skew: 0.845 Prob(JB): 0.0509
Kurtosis: 3.021 Cond. No. 51.5
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Assessing the assumptions of my model
The output from the OLS linear regression model suggests that there is a dependency between the two indicators - the model explains approximately 43% of the variance in the independent variable med_income
. With a p-value for the f-statistic far smaller than 0.05, we can conclude that this relationship is statistically significant.
I also produced a Fitted vs Residuals plot and a Q-Q plot to explore the reliability of my linear regression model. These both show that my model is reasonably reliable: - The residuals (left) are generally randomly scattered around the fitted line and are largely homoscedastic (Moreno, 2019). There is some variance apparent, but I feel that this is not so extreme as to depreciate the usefulness of my model. - Whilst there is a slight curve apparent in the Q-Q plot, this is potentially due to the minor non-normality of the hs_degree
indicator (Waples, 2024).
In summary, the above tests provide enough evidence to enable me to reject the null hypothesis and conclude that there is a significant relationship between the share of the population with a high school degree and the region’s median income (\(\beta \ne 0\)) for this dataset.
#make fitted vs residuals and q-q plots
= plt.subplots(1, 2, figsize=(16, 5))
fig, ax "Diagnostic Plots for Linear Regression Model", fontsize=14, fontweight='bold', y=1, x=0.2)
fig.suptitle(=rdlo['hs_degree'], y=rdlo['med_income'], lowess=True, ax=ax[0])
sns.residplot(x="norm", plot=plt)
stats.probplot(res.resid, dist0].set_title("Fitted vs Residuals Plot")
ax[1].set_title("Q-Q Plot")
ax[
plt.show()
#code adapted from Perktold et al. (2024)
Final visualisation
Finally, I produced a scatterplot with regression line (including confidence bands), to visualise the relationship between the share of population with a high school degree as the independent variable, and median income as the dependent variable, for the 51 regions. I also calculated the Spearman’s Rank Coefficient (rho) for the relationship, both to provide a numerical metric to add context to the visualisation, and to support the findings of the hypothesis test whilst accounting for the fact that hs_degree
is slightly non-normal.
#calculate rho
= stats.spearmanr(rdlo['hs_degree'], rdlo['med_income'])
rho
#make plots to show linear regression model
= plt.subplots(figsize = (7, 5))
fig, ax "Regression model and Spearman's Rank Coefficient", y=1.05, x=0.3, fontweight="bold", fontsize=12)
plt.title(=rdlo, x=rdlo['hs_degree'], y=rdlo['med_income'])
sns.regplot(data0.1, 0.9, f'Rho = {round((rho[0]),2)}', transform=ax.transAxes, fontsize=10, fontweight="bold") #code adapted from Chugani (2024)
plt.text('Share of population with high school degree')
plt.xlabel('Median income')
plt.ylabel( plt.show()
Bibliography
Bobbitt, Z. (2021) Understanding the Null Hypothesis for Linear Regression [blog]. 14 May. Available from: https://www.statology.org/null-hypothesis-for-linear-regression/ [Accessed 30 April 2025].
Bobbitt, Z. (2022a) How to Interpret Skewness in Statistics (With Examples) [tutorial]. Available from: https://www.statology.org/skewness-interpretation/ [Accessed 30 April 2025].
Choueiry, G. (no date) How to Report the Shapiro-Wilk Test [tutorial]. Available from: https://quantifyinghealth.com/report-shapiro-wilk-test/ [Accessed 22 April 2025].
Chugani, V. (2024) Skewness Be Gone: Transformative Tricks for Data Scientists [tutorial]. Available from: https://machinelearningmastery.com/skewness-be-gone-transformative-tricks-for-data-scientists/ [Accessed 24 April 2025].
Moreno, A. (2019) Linear Regression Plots: Fitted vs Residuals [blog]. 27 March. Available from: https://boostedml.com/2019/03/linear-regression-plots-fitted-vs-residuals.html [Accessed 1 May 2025].
Perktold, J. et al. (2024) Fitting models using R-style formulas [online]. Available from: https://www.statsmodels.org/stable/example_formulas.html [Accessed 24 April 2025].
Tharaka, S. (2025) Z-Score Outlier Detection Explained: How to Identify Anomalies in Your Data? [tutorial] Available from: https://www.mlinsightful.com/z-score-outlier-detection-explained/ [Accessed 27 April 2025].
Waples, J. (2024) The Q-Q Plot: What It Means and How to Interpret It [tutorial]. Available from: https://www.datacamp.com/tutorial/qq-plot [Accessed 2 May 2025].