The first link is a link to the github.io site. The second link is to the github page where you can easily find all of our data.
This project is a collaboration between Matt Catalano and Eli Mendels for our Data Science class with Nick Mattei. We will be examining what metrics correspond to national happiness and then trying to predict which counties in the US will be the happiest.
The first few lines of code will install all of the libraries that we will use throughout this project.
pip install geopandas
pip install descartes
pip install PyShp
#Importing all the necesary libraries.
import sqlite3
import pandas as pd
import geopandas as geo
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy import stats
import re
import seaborn as sns
import sklearn
from sklearn import linear_model
import random
import shapefile
#The below is for formatting and taken from past labs written by Nick Mattei
#This lets us show plots inline and also save PDF plots if we want them
%matplotlib inline
from matplotlib.backends.backend_pdf import PdfPages
matplotlib.style.use('fivethirtyeight')
# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
#Show all the rows of a dataframe
pd.options.display.max_rows = 10000
pd.options.display.width = 1000
In looking at the country data, the first step is to read it into a dataframe. Once the data has been read, we will clean it to prepare it for analysis and add any other data that we deem important.
Most of our country data will come from the CIA world factbook. We got the data in csv format from https://github.com/thewiremonkey/factbook.csv. The data is formatted in a collection of spreadsheets, each of which contains three columns: a rank, the name, and the value. Each spreadsheet contains one measurement, for example population, so they must be iterated over and then each spreadsheet must be joined on name to the cumulative table.
Information about the dataset is found in the categories csv which gives a code and names the measurement for each spreadsheet. This is what will be iterated over to generate the factbook dataframe.
#Reading in a summary of the csv files
categories_df = pd.read_csv('categories.csv')
categories_df.head()
#A string to store part of the file names
csv_string = 'WorldFactBookData/c'
#Temporary dataframe to store csvs that need to be added to the dataframe.
#It's used here to ge the names of each country
df_to_add = pd.read_csv('WorldFactBookData/c2147.csv')
#Creating the dataframe with just the names
factbook_df = pd.DataFrame({'Name' : df_to_add['Name']})
#Iterating through each file code in the categories list and adding the csv to the dataframe
for index, row in categories_df.iterrows():
file_name = csv_string + str(row['Num']) + '.csv'
df_to_add = pd.read_csv(file_name)
df_to_add = df_to_add.drop(columns=['Pos'])
column_name = row['Name']
df_to_add = df_to_add.rename(columns={"Value": column_name})
factbook_df = factbook_df.merge(df_to_add, how = 'left')
factbook_df.head()
Now that we have a dataframe that stores our data about each country, it's time to look at the results of the happiness survey for each country. This process is fairly easy because it is already in one csv file ready to go.
happiness_df = pd.read_csv('2015.csv')
happiness_df.head()
The next step is to combine the two data sets with country data. This will be done with a simple left join on the happiness dataframe because we are only interested in the countries that have a happiness score.
The first thing we will do is find out which countries are in the happiness survey but not the world factbook. Once that is done, decisions will be made on a case by case basis to determine whether the data needs to be manipulated for the join to work or if those specific rows need to be dropped. Since some countries go by different names in the two different datasets, it is likely that some names will need to be changed in one data set so that the primary keys match up.
#An initial merge to find which countries names need to be manually changed for the match.
#Since there are only 8 countries they will be addressed manually
prelim_merge = happiness_df.merge(factbook_df, how = 'left', left_on = 'Country', right_on = 'Name')
countries_to_match = prelim_merge[prelim_merge['Area'].isnull()]
countries_to_match
The countries that are recognized in both dataframes are renamed below and the countries not universally recognized will be dropped. This is the first major decision we will make with regards to missing data. In the end, we decided to not look at North Cyprus, Somaliland, or Palestine as these regions are contested and not widely recognized.
#Renaming Korea, South to South Korea in the factbook_df
factbook_df.at[108, 'Name'] = 'South Korea'
#Renaming Burma to Myanmar in the factbook_df
factbook_df.at[39, 'Name'] = 'Myanmar'
#Renaming Cote d'Ivoire to Ivory Coast in the factbook_df
factbook_df.at[68, 'Name'] = 'Ivory Coast'
#Renaming Congo, Republic of the to Congo (Brazzaville) in the factbook_df
factbook_df.at[63, 'Name'] = 'Congo (Brazzaville)'
#Renaming Congo, Democratic Republic of the to Congo (Kinshasa) in the factbook_df
factbook_df.at[10, 'Name'] = 'Congo (Kinshasa)'
#Creating a final dataframe to store all of our country data
countries_df = happiness_df.merge(factbook_df, how = 'left', left_on = 'Country', right_on = 'Name')
countries_df.head()
Now that we have the country dataframe it is time to clean it. This means dropping any unnecessary columns, ensuring each column is of the right type, and performing operations on some columns to make them better suited for manipulation
The first step will be to remove all the columns that we won't be looking at. We will start by removing the columns that came with the happiness data as they were already selected as indicators of happiness and could not exist on a county level. Then, we will remove a variety of other columns that we know won't exist at a county level. Most columns removed here have to do with national finances which won't be relevant when we begin to look at county data.
#Creating a list of columns to be removed
columns_to_be_removed = ['Reserves of foreign exchange and gold',
'Standard Error', 'Dystopia Residual', 'Generosity', 'Trust (Government Corruption)', 'Area', 'Gross national saving',
'Inflation rate (consumer prices)', 'Central bank discount rate',
'Commercial bank prime lending rate', 'Stock of narrow money',
'Stock of broad money', 'Stock of domestic credit',
'Market value of publicly traded shares', 'Current account balance',
'Stock of direct foreign investment - at home',
'Stock of direct foreign investment - abroad','Economy (GDP per Capita)',
'Family', 'Health (Life Expectancy)', 'Freedom','Name','Railways', 'Waterways']
countries_df = countries_df.drop(columns_to_be_removed, axis = 1)
Once those columns are dropped, we will make sure each column is of the correct dtype. To do this, we will use the pd.DataFrame.dtypes method to look at the dtypes of each column and determine which columns are not correct. We will then convert them to the correct once. In this case, this involves removing dollar signs and commas from numbers then converting to float64s.
#Creating a list of all the columns with a leading dollar sign that need to be converted to int64
object_cols = ['GDP (purchasing power parity)','GDP - per capita (PPP)','Exports', 'Imports', 'Debt - external']
#Iterating through the list, removing any non numereical symbols and then casting as float64
for i in object_cols:
countries_df[i] = countries_df[i].str.replace(',', "")
countries_df[i] = countries_df[i].str.replace('$', "")
countries_df[i] = countries_df[i].astype('float64')
The next data conversion that needs to be done is converting gross metrics into per capita metrics. This is done by creating a list of all the gross metrics, iterating over them to create a column with the per capita version of the metric, and then removing the redundant gross metric column.
gross_measurements = ['Health expenditures', "HIV/AIDS - people living with HIV/AIDS", "HIV/AIDS - deaths",
'Children under the age of 5 years underweight', "Education expenditures",
'Telephones - fixed lines', 'Telephones - mobile cellular', "Internet users",
'Military expenditures', 'Labor force', 'Public debt', 'Electricity - production',
'Electricity - consumption','Electricity - exports', 'Electricity - imports',
'Electricity - installed generating capacity', 'Electricity - from fossil fuels',
'Electricity - from nuclear fuels','Electricity - from hydroelectric plants',
'Electricity - from other renewable sources',"Airports","Roadways"]
for i in gross_measurements:
countries_df[i + ' per capita'] = countries_df[i] / countries_df["Population"]
countries_df = countries_df.drop([i], axis=1)
Now that we have a clean dataset we will create a standardized version of it. This will allow us to compare across datasets which will be necessary when we start to look at the county data. It is easy to standardize a data frame of only numerical data so we will extract the string columns, standardize the numerical data frame and then add the string columns back.
#Creating the numerical dataframe
numerical_country_df = countries_df.drop(['Country', 'Region'], axis = 1)
#Standardizing it
standardized_country_df = (numerical_country_df-numerical_country_df.mean())/numerical_country_df.std()
#Adding back the names and countries
standardized_country_df.insert(0, 'Region', countries_df['Region'])
standardized_country_df.insert(0, 'Country', countries_df['Country'])
standardized_country_df.head()
With the cleaned and organized data we can finally start to look at what it says. This will be done in a couple ways. The first will be to look at the distribution of happiness. Then we will begin to look at what variables correlate to happiness.
To look at the distribution of happiness, we will use both a box plot and a histogram.
fig, ax = plt.subplots(1, 2, figsize=(20,5))
fig1 = countries_df['Happiness Score'].plot.box(ax = ax[0], title="Box Plot of the World's Happiness Scores")
fig2 = countries_df['Happiness Score'].plot.hist(bins = 15, ax = ax[1], title="Distribution of the World's Happiness Scores")
From these graphs, it's clear that the happiness scores follow a somewhat normal distribution. The mean value is about 5.2 with a range of 2.8 to 7.8. There doesn't appear to be much of a skew and there are not many outliers.
Below, we will begin to look at the relationship between certain regions and their happiness scores.
countries_df.groupby('Region')['Happiness Score'].mean().sort_values().plot.bar(title = 'Happiness Score by Region')
It's clear that there is a relationship between region and mean happiness score. The happiest regions also happen to be the wealthiest regions which is a relationship that will become more obvious as we do more analysis.
Below, we take a look at the correlation between each variable and happiness and then select the ones with high correlation values. Since a correlation can be negative, we look at large positive or negative correlation values and then plot them below to show which variables have the strongest correlation.
happiness_corr_df = countries_df.corr()
happiness_corr_df = happiness_corr_df[(happiness_corr_df['Happiness Score'] >= .4) | (happiness_corr_df['Happiness Score'] <= -.4)]["Happiness Score"]
happiness_corr_df = happiness_corr_df.drop('Happiness Score')
happiness_corr_df = happiness_corr_df.sort_values()
happiness_corr_df.plot.bar(title = 'Largest Correlation Values for Happiness Score')
This graph shows us that high values for life expectancy at birth, internet users per capita, and GDP per capita are good indicators of a happy country. Good indicators of an unhappy country are high infant mortality rate, total fertility rate, and unemployment rate. It is interesting to see birth rate and total fertility rate as indicators of an unhappy country and not a category like children under the age of 5 underweight. Internet users per capita was also a surprising one to us as one of the strongest indicators as we figured there would be some other categories that would be more telling.
Additionally, happiness rank is at -1 because as happiness score gets higher, the rank gets lower. The #1 country in rank would be have the highest happiness score. This confused us for a little bit when we were looking at this graph, so we figured it would be best to explain that here.
The following code will begin to prepare our data frame to display a choropleth map of happiness scores for the world.
world_map_df = geo.read_file('Countries_WGS84.shp')
world_map_key_df = pd.read_csv('Countries_WGS84.csv')
world_map_df = world_map_df.merge(world_map_key_df, how = 'left', left_index = True, right_index = True )
world_map_df.head()
#Countries to rename
world_map_df.at[25, 'Name'] = 'Myanmar'
world_map_df.at[27, 'Name'] = 'Belarus'
#Adding greenland as part of denmark
world_map_df.at[86, 'Name'] = 'Denmark'
world_map_df = world_map_df.merge(countries_df, left_on = 'CNTRY_NAME', right_on = 'Country', how = 'left')
world_map_df = world_map_df[world_map_df["CNTRY_NAME"] != "Antarctica"]
# set the range for the choropleth
vmin, vmax = 120, 220
# create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(20, 15))
world_map_df.plot(column='Happiness Score', cmap='YlGnBu', linewidth=0.8, ax=ax, edgecolor = '0.8', legend=True, legend_kwds = {'label': "Happiness Score by Country", 'orientation': "horizontal"})
ax.set_axis_off()
plt.show()
Wealthy countries like the United States, Canada, Australia, and a lot of northern Europe are very happy. Africa is the clear saddest continent which makes sense as there are a lot of poverty-ridden areas there. Asia is also not very happy.
Take note in the above graph that Greenland does not have a happiness score. Therefore, although it looks like a bleak place in this graph, it is only because there is just a NaN value there.
The county data was acquired from these sources:
https://www.ers.usda.gov/data-products/county-level-data-sets/
https://policyinformatics.asu.edu/broadband-data-portal/dataaccess/countydata
There is a lot of county data that we need and we will start loading and tidying csv files here. We will only use data from the most recent year we can find and then standardize the data over the year so we can compare the most recent data to the standardized 2015 data from the world factbook. First, we will load in a dataset that contains population estimates for counties.
state_pop_df = pd.read_csv('PopulationEstimates.csv')
state_pop_df.head()
There are some columns that are empty due to the way that the data was loaded in. First, we will drop those columns as well as the row containing the United States data (as that is irrelevant here).
After we filter these columns, we will get rid of all data except for the most recent year.
state_pop_df = state_pop_df[state_pop_df.columns.drop(list(state_pop_df.filter(regex='Unnamed')))]
state_pop_df = state_pop_df.drop(0)
lst = [i + 2010 for i in range(8)]
for i in lst:
i = str(i)
if i=="2010":
state_pop_df = state_pop_df.drop(['Pop Estimate ' + i], axis=1)
state_pop_df = state_pop_df.drop(['Estimates Base ' + i], axis=1)
state_pop_df = state_pop_df.drop(['Census Pop ' + i], axis=1)
else:
state_pop_df = state_pop_df.drop(['Pop Estimate ' + i], axis=1)
state_pop_df = state_pop_df.drop(['Net migration rate ' + i], axis=1)
state_pop_df = state_pop_df.drop(['Birth rate ' + i], axis=1)
state_pop_df = state_pop_df.drop(['Death rate ' + i], axis=1)
state_pop_df = state_pop_df.drop(['Natural increase rate ' + i], axis=1)
state_pop_df.head()
Now, we will look at a dataframe containing county wide unemployment data. We need to again drop the United States row as well as all of the unnamed columns. We will also remove all but the most recent data. Area_name in this data set is not initially conducive to later joins, so we will split the area name into two different columns containing the county name and the state name
state_unemployment_df = pd.read_csv('Unemployment.csv')
state_unemployment_df = state_unemployment_df[state_unemployment_df.columns.drop(list(state_unemployment_df.filter(regex='Unnamed')))]
state_unemployment_df = state_unemployment_df.drop(0)
lst = [i + 2007 for i in range(11)]
for i in lst:
i = str(i)
state_unemployment_df = state_unemployment_df.drop(['Unemployment_rate_' + i], axis=1)
state_unemployment_df[['Area_Name', 'State_Name']] = state_unemployment_df["Area_name"].str.split(", ", expand=True)
state_unemployment_df = state_unemployment_df.drop(["State", "Area_name"], axis=1)
state_unemployment_df.head()
We will now load in a dataset that contains county wide life expectancy data. We need to, again, perform similar tasks to drop the United States row, split the location into two columns, and drop all but the most recent data.
county_life_exp_df = pd.read_csv('County Life Expectancy.csv')
county_life_exp_df = county_life_exp_df.drop(0)
county_life_exp_df[['Area_Name', 'State_Name']] = county_life_exp_df["Location"].str.split(", ", expand=True)
county_life_exp_df = county_life_exp_df.drop(['Location'], axis=1)
lst = [i + 1980 for i in range(7)]
c=0
for i in lst:
i = c+i
i = str(i)
county_life_exp_df = county_life_exp_df.drop(['Life expectancy, ' + i], axis=1)
c+=4
county_life_exp_df.head()
We will now load in a dataset that contains county wide internet per capita. We will only take the data it has in its most recent year (2016), since we will be standardizing all of this anyway. It unfortunately gives us the state name as the full name, so to perform our first merge we will replace all of the full names with abbreviations. Don't worry, we found this dictionary value online so we did not need to type out everything ourselves.
county_internet_df = pd.read_csv('county_internet.csv')
county_internet_df = county_internet_df[county_internet_df["year"] ==2016]
county_internet_df = county_internet_df.rename(columns={'county': 'Area_Name', 'state': 'State'})
#Don't worry, we found this dictionary online, we didn't go through the pain of typing it ourselves
d = {
'Alabama': 'AL',
'Alaska': 'AK',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'District of Columbia': 'DC',
'Florida': 'FL',
'Georgia': 'GA',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Northern Mariana Islands':'MP',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Palau': 'PW',
'Pennsylvania': 'PA',
'Puerto Rico': 'PR',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virgin Islands': 'VI',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY',
}
county_internet_df = county_internet_df.replace(d)
Now, we can start merging the data. First, we will merge the internet users per capita data with the county population data. This will be merged on both the area name and the state name to make sure that there are no duplicate counties.
county_merge_df = pd.merge(county_internet_df, state_pop_df, on=["Area_Name", "State"], how="outer")
county_merge_df = county_merge_df.drop(['year'], axis=1)
county_merge_df = county_merge_df.rename(columns={'broadband_home': 'Internet users per capita'})
county_merge_df.head()
Now, we will first merge the county life expectancy data with the above (merged) dataframe. We will merge this on the FIPS column which is a unique identifier for every county in America.
county_merge1_df = pd.merge(county_life_exp_df, county_merge_df, on="FIPS", how="inner")
county_merge1_df = county_merge1_df.drop(['State_Name', 'Area_Name_y'], axis=1)
county_merge1_df = county_merge1_df.rename(columns={'Area_Name_x': 'Area_Name'})
county_merge1_df.head()
We can now merge the unemployment data with the previous data and tidy it up in the same way that we tidied the previous data
county_merge2_df = pd.merge(county_merge1_df, state_unemployment_df, on="FIPS", how="inner")
county_merge2_df = county_merge2_df.drop(['Area_Name_y', 'State_Name'], axis=1)
county_merge2_df = county_merge2_df.rename(columns = {'Area_Name_x': 'Area_Name'})
county_merge2_df['Life expectancy, 2014'] = county_merge2_df['Life expectancy, 2014'].str[:5]
county_merge2_df.head()
We need to turn all the columns that we want to standardize into floats with no commas or dollar signs. This will be done in the cell below
object_cols = ['Pop Estimate 2018','Median_Household_Income_2017', 'Life expectancy, 2014']
for i in object_cols:
county_merge2_df[i] = county_merge2_df[i].str.replace(',', "")
county_merge2_df[i] = county_merge2_df[i].str.replace('$', "")
county_merge2_df[i] = county_merge2_df[i].astype('float64')
The following code will standardize all the numerical columns in the graph so they can be used later to compare with the country data from the world factbook. We also rename the columns here as the previous column names would no longer be accurate.
standardized_county_df = county_merge2_df.drop(['Area_Name', 'State', 'FIPS'], axis = 1)
standardized_county_df = (standardized_county_df-standardized_county_df.mean())/standardized_county_df.std()
standardized_county_df.insert(0, 'Area_Name', county_merge2_df['Area_Name'])
standardized_county_df.insert(0, 'State', county_merge2_df['State'])
standardized_county_df.insert(0, 'FIPS', county_merge2_df['FIPS'])
standardized_county_df = standardized_county_df.rename(columns = {"Life expectancy, 2014": "Life expectancy", "Pop Estimate 2018": "Pop Estimate", "Birth rate 2018": "Birth rate", "Death rate 2018": "Death rate", "Natural increase rate 2018": "Natural increase rate", "Net migration rate 2018": "Net migration rate", "Unemployment_rate_2018": "Unemployment rate", "Median_Household_Income_2017": "Median household income", "Med_HH_Income_Percent_of_State_Total_2017": "Median household income percent of state total"})
standardized_county_df.head()
Upon closer examination of our dataframe, Washington DC was left out of several dataframes that we merged together. Due to a lack of data, we will remove all rows containing values for DC so as to not skew the rest of our data.
We also noticed that there were some NaN values in the internet users per capita column. To rectify this, we mean imputed for the NaN values. However, we did not just simply take the mean of the entire column. In hopes to be mildly more accurate, we took the mean of each state and imputed that mean for each NaN-containing county in that state. This is a more regional way to mean impute and hopefully gives us more realistic data.
standardized_county_df = standardized_county_df[standardized_county_df["State"] != "DC"]
standardized_county_df["Internet users per capita"] = standardized_county_df.groupby("State")["Internet users per capita"].transform(lambda x: x.fillna(x.mean()))
standardized_county_df.head()
There is not too much we can do with data analysis yet with just the county data (before we connect it to the happiness metric from the world factbook), but here is a little bit of data analysis to start.
If we wanted to look at how all the county data correlates with each other we could plot a heatmap of their correlation. It is interesting to us that when it is plotted it can be seen that life expectancy and median household income have a correlation of 0.65. Death rate and median household income are pretty negatively correlated -0.54. Life expectancy and internet users per capita have a relatively high correlation at 0.49. Other than these, the rest are either not very correlated or not interestingly correlated (like birth rate and natural increase rate).
corr = standardized_county_df.corr()
fig, ax = plt.subplots(figsize=(20,20))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=colormap, annot=True, fmt = ".2f")
plt.xticks(range(len(corr.columns)), corr.columns)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("County data correlations")
plt.show()
We can also make some bar charts with some of the more interesting data collected. Shown below is the median household income by state, life expectancy by state, and unemployment rate by state. All of this was found by finding the mean of the counties for each state.
fig, ax = plt.subplots(figsize = (20,10))
ax1 = county_merge2_df.groupby('State')['Median_Household_Income_2017'].mean().sort_values().plot.bar(title = 'Median household income by state')
fig, ax = plt.subplots(figsize = (20,10))
ax2 = county_merge2_df.groupby('State')['Life expectancy, 2014'].mean().sort_values().plot.bar(title = 'Life expectancy by state')
fig, ax = plt.subplots(figsize = (20,10))
ax2 = county_merge2_df.groupby('State')['Unemployment_rate_2018'].mean().sort_values().plot.bar(title = 'Unemployment rate by state')
It is clear that life expectancy is not very interesting at all, but household income and unemployment rate yield some more interesting figures. I find it interesting that DC is towards the top of both household income and unemployment.
The following code will set up different models to test. We will choose whatever model ends up giving us the smallest root mean squared error upon cross validation.
#The relevant columns for the model
parameters = ['Happiness Score','Life expectancy at birth', 'Birth rate',
'Death rate', 'Population', 'Unemployment rate', 'Net migration rate',
'GDP - per capita (PPP)', 'Internet users per capita']
#The columns to be used as in the model
x_cols = ['Life expectancy at birth', 'Birth rate', 'Death rate',
'Population', 'Unemployment rate', 'Net migration rate', 'GDP - per capita (PPP)', 'Internet users per capita']
#Creating a dataframe that just contains the necessary columns and dropping the na values
model_df = standardized_country_df[parameters].dropna()
#Isolating the X_values and y_values
X_values = model_df[x_cols].values
y_values = model_df['Happiness Score'].values
#Splitting the data into training and testing data
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_values, y_values, test_size=0.2, random_state=0)
#generating the three models to compare: lasso, linearRegression, and ridge
#For both lasso and ridge, various values of alpha were used
linear_model = sklearn.linear_model.LinearRegression()
lasso = sklearn.linear_model.Lasso()
lasso01 = sklearn.linear_model.Lasso(alpha = .01)
lasso0001 = sklearn.linear_model.Lasso(alpha = .0001)
ridge = sklearn.linear_model.Ridge()
ridge01 = sklearn.linear_model.Ridge(alpha = .01)
ridge100 = sklearn.linear_model.Ridge(alpha = 100)
#Function to get the rmse of a model based on 5-fold cross val analysis
def get_rmse_5_fold(model):
scores = sklearn.model_selection.cross_val_score(model, X_values, y_values, cv = 5, scoring="neg_mean_squared_error")
mse = np.mean(-scores)
rmse = np.sqrt(mse)
return rmse
#Getting the rmse for each model from the 10 fold cross val testing
linear_model_rmse = get_rmse_5_fold(linear_model)
lasso_rmse = get_rmse_5_fold(lasso)
lasso01_rmse = get_rmse_5_fold(lasso01)
lasso0001_rmse = get_rmse_5_fold(lasso0001)
ridge_rmse = get_rmse_5_fold(ridge)
ridge01_rmse = get_rmse_5_fold(ridge01)
ridge100_rmse = get_rmse_5_fold(ridge100)
#Displaying the rmse cross_val_scores of each model
print('Linear Model RMSE: ', linear_model_rmse)
print('Lasso RMSE: ', lasso_rmse)
print('Lasso with alpha = .01 RMSE: ', lasso01_rmse)
print('Lasso with alpha = .0001 RMSE: ', lasso0001_rmse)
print('Ridge RMSE: ', ridge_rmse)
print('Ridge with alpha = .01 RMSE: ', ridge01_rmse)
print('Ridge with alpha = 100 RMSE: ', ridge100_rmse)
The model that gives us the lowest root mean squared error is the lasso model with alpha .01. This is the one that we will fit to all of the countries.
#Fitting the ridge model to all the countries
lasso01.fit(X_train, y_train)
y_predicted = lasso01.predict(X_test)
y_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_predicted})
y_df.plot.bar(title = "actual vs predicted value for happiness score")
plt.legend(loc=1)
country_rmse = np.sqrt(((y_test - y_predicted) ** 2).mean())
country_rmse
The graph above shows the actual versus the predicted value for happiness score for each of the countries that the lasso model attempted to predict. The average root mean squared error here for our actual versus predicted score is around 0.5 which is not fantastic. However, because we are only able to train on a small amount of countries, it is not too awful. If there were thousands of countries in the world that we could test on, our prediction would most likely yield more accurate results.
We will now refit the lasso to take into account all of the countries in the dataframe (instead of fitting the lasso to 90% of the countries, we will be fitting it to 100% of the countries). We will use this re-fitted lasso to predict the happiness for individual counties in the United States.
lasso01.fit(X_values, y_values)
counties_test = standardized_county_df.rename(columns={"Life expectancy": 'Life expectancy at birth',
'Pop Estimate': "Population",
'Median household income': 'GDP - per capita (PPP)'})
counties_test = counties_test.dropna()
counties_test['Predicted Happiness'] = lasso01.predict(counties_test[x_cols])
counties_test.head()
We have now added a new column to our counties dataframe. This column is what our lasso model predicts will be the happiness of the county. Now we can create a choropleth map of the county wide data similar to the map we created for the world.
counties_map_df = geo.read_file('./UScounties.shp')
counties_map_df = counties_map_df.drop(['STATE_FIPS', 'CNTY_FIPS'], axis = 1)
counties_map_df = counties_map_df.astype({'FIPS': 'float64'})
We do not live in Alaska or Hawaii and they make the map look very stretched out and awkward, so we are going to drop these from the graphic.
counties_map_df = counties_map_df.merge(counties_test, how = 'left', on = 'FIPS')
counties_map_df = counties_map_df[counties_map_df.STATE_NAME != 'Alaska']
counties_map_df = counties_map_df[counties_map_df.STATE_NAME != 'Hawaii']
# set the range for the choropleth
vmin, vmax = 120, 220
# create figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(20, 20))
counties_map_df.plot(column='Predicted Happiness', cmap='YlGnBu', linewidth=0.8, ax=ax, edgecolor = '0.6', legend = True, legend_kwds = {"label": "Predicted Happiness Score by County", "orientation": "horizontal"})
ax.set_axis_off()
plt.show()
This graph represents the conclusion of our project. We have successfully predicted the happiness of each and every county in the United States using a lasso model fitted to country happiness. As we discussed earlier, the model has the inherent flaw on only being able to train on a minimal number of countries, but we believe that the knowledge to be gained from this graph are still insightful. As you can see, the Northeast, coastal California, and some areas with major cities nearby are generally the happiest. There is a low point in the country near the Southeast (towards Louisiana) and Mideast (around West Virginia). We now have a predicted happiness for every county in the United States.
As a top 5 happiest counties graph is not very insightful for the general public, below you will see two dataframes displaying the top 5 happiest states and the top 5 saddest states. We determined this by taking the mean of all of a state's counties.
happy_states = counties_test.groupby("State")["Predicted Happiness"].mean().sort_values(ascending=False)
pd.DataFrame(happy_states).iloc[:5]
pd.DataFrame(happy_states).iloc[-5:]
According to our prediction model, the top 5 happiest states (in order) are: Colorado, Massachusetts, New Hampshire, Connecticut, and Rhode Island. The top 5 saddest states according to our prediction model (in order of most sad first) are: Mississippi, Louisiana, New Mexico, Arkansas, and Alabama. 4 out of 5 of the happiest states are located in the Northeast and all 5 saddest states are located in the South.
counties_test.corr()['Predicted Happiness']
Shown above is the correlation between everything in our counties dataframe and the county's predicted happiness according to our model. We did not include FIPS or Median household income percent of state total in our predictive model but everything else was utilized. As we can see, life expectancy at birth, internet users per capita, and GDP - per capita were used heavily to predict happiness as they correlate strongly. In the negative direction, death rate and unemployment rate indicated strong negative correlations with happiness.
Linked below are some articles relating to this project that might be interesting reads after you have read through this entire tutorial. The first article is a link to the world happiness report from 2019. The second link is an article that has claimed to rank the happiness of states in the U.S. The last link is a very brief article posted by fast company on where Americans are happiest and why. We hope you enjoyed this tutorial!