#load dslabs packagelibrary("dslabs")#load ggplot2library(ggplot2)#look at help file for gapminder datahelp(gapminder)#get an overview of data structurestr(gapminder)
'data.frame': 10545 obs. of 9 variables:
$ country : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
$ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
$ infant_mortality: num 115.4 148.2 208 NA 59.9 ...
$ life_expectancy : num 62.9 47.5 36 63 65.4 ...
$ fertility : num 6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
$ population : num 1636054 11124892 5270844 54681 20619075 ...
$ gdp : num NA 1.38e+10 NA NA 1.08e+11 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
$ region : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
#get a summary of datasummary(gapminder)
country year infant_mortality life_expectancy
Albania : 57 Min. :1960 Min. : 1.50 Min. :13.20
Algeria : 57 1st Qu.:1974 1st Qu.: 16.00 1st Qu.:57.50
Angola : 57 Median :1988 Median : 41.50 Median :67.54
Antigua and Barbuda: 57 Mean :1988 Mean : 55.31 Mean :64.81
Argentina : 57 3rd Qu.:2002 3rd Qu.: 85.10 3rd Qu.:73.00
Armenia : 57 Max. :2016 Max. :276.90 Max. :83.90
(Other) :10203 NA's :1453
fertility population gdp continent
Min. :0.840 Min. :3.124e+04 Min. :4.040e+07 Africa :2907
1st Qu.:2.200 1st Qu.:1.333e+06 1st Qu.:1.846e+09 Americas:2052
Median :3.750 Median :5.009e+06 Median :7.794e+09 Asia :2679
Mean :4.084 Mean :2.701e+07 Mean :1.480e+11 Europe :2223
3rd Qu.:6.000 3rd Qu.:1.523e+07 3rd Qu.:5.540e+10 Oceania : 684
Max. :9.220 Max. :1.376e+09 Max. :1.174e+13
NA's :187 NA's :185 NA's :2972
region
Western Asia :1026
Eastern Africa : 912
Western Africa : 912
Caribbean : 741
South America : 684
Southern Europe: 684
(Other) :5586
#determine the type of object gapminder isclass(gapminder)
[1] "data.frame"
R code for data exploration and cleaning
#Write code that assigns only the African countries to a new object/variable called africadata. africadata =subset(gapminder, continent=='Africa')#Run str and summary on the new object you created.str(africadata)
'data.frame': 2907 obs. of 9 variables:
$ country : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
$ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
$ infant_mortality: num 148 208 187 116 161 ...
$ life_expectancy : num 47.5 36 38.3 50.3 35.2 ...
$ fertility : num 7.65 7.32 6.28 6.62 6.29 6.95 5.65 6.89 5.84 6.25 ...
$ population : num 11124892 5270844 2431620 524029 4829291 ...
$ gdp : num 1.38e+10 NA 6.22e+08 1.24e+08 5.97e+08 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
$ region : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
summary(africadata)
country year infant_mortality life_expectancy
Algeria : 57 Min. :1960 Min. : 11.40 Min. :13.20
Angola : 57 1st Qu.:1974 1st Qu.: 62.20 1st Qu.:48.23
Benin : 57 Median :1988 Median : 93.40 Median :53.98
Botswana : 57 Mean :1988 Mean : 95.12 Mean :54.38
Burkina Faso: 57 3rd Qu.:2002 3rd Qu.:124.70 3rd Qu.:60.10
Burundi : 57 Max. :2016 Max. :237.40 Max. :77.60
(Other) :2565 NA's :226
fertility population gdp continent
Min. :1.500 Min. : 41538 Min. :4.659e+07 Africa :2907
1st Qu.:5.160 1st Qu.: 1605232 1st Qu.:8.373e+08 Americas: 0
Median :6.160 Median : 5570982 Median :2.448e+09 Asia : 0
Mean :5.851 Mean : 12235961 Mean :9.346e+09 Europe : 0
3rd Qu.:6.860 3rd Qu.: 13888152 3rd Qu.:6.552e+09 Oceania : 0
Max. :8.450 Max. :182201962 Max. :1.935e+11
NA's :51 NA's :51 NA's :637
region
Eastern Africa :912
Western Africa :912
Middle Africa :456
Northern Africa :342
Southern Africa :285
Australia and New Zealand: 0
(Other) : 0
#We now have 2907 observations, down from 10545. Depending on how you do this, you might also notice that all the different categories are still kept in the continent (and other) variables, but show 0.#Take the africadata object and create two new objects (name them whatever you want)#Object 1 contains only infant_mortality and life_expectancymyvars1 <-c("infant_mortality","life_expectancy")object1 <- africadata[myvars1]str(object1)
'data.frame': 2907 obs. of 2 variables:
$ infant_mortality: num 148 208 187 116 161 ...
$ life_expectancy : num 47.5 36 38.3 50.3 35.2 ...
summary(object1)
infant_mortality life_expectancy
Min. : 11.40 Min. :13.20
1st Qu.: 62.20 1st Qu.:48.23
Median : 93.40 Median :53.98
Mean : 95.12 Mean :54.38
3rd Qu.:124.70 3rd Qu.:60.10
Max. :237.40 Max. :77.60
NA's :226
# Object2 contains only population and life_expectancy. myvars2 <-c("population","life_expectancy")object2 <- africadata[myvars2]str(object2)
'data.frame': 2907 obs. of 2 variables:
$ population : num 11124892 5270844 2431620 524029 4829291 ...
$ life_expectancy: num 47.5 36 38.3 50.3 35.2 ...
summary(object2)
population life_expectancy
Min. : 41538 Min. :13.20
1st Qu.: 1605232 1st Qu.:48.23
Median : 5570982 Median :53.98
Mean : 12235961 Mean :54.38
3rd Qu.: 13888152 3rd Qu.:60.10
Max. :182201962 Max. :77.60
NA's :51
#Plot the data as points.#Object 1ggplot(object1, aes(x=infant_mortality, y=life_expectancy)) +geom_point()
Plotting and Analyzing Data from African Countries in 2000
#New Object just for Year=2000africadata_2000 =subset(africadata, year==2000) str(africadata_2000)
'data.frame': 51 obs. of 9 variables:
$ country : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
$ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
$ infant_mortality: num 33.9 128.3 89.3 52.4 96.2 ...
$ life_expectancy : num 73.3 52.3 57.2 47.6 52.6 46.7 54.3 68.4 45.3 51.5 ...
$ fertility : num 2.51 6.84 5.98 3.41 6.59 7.06 5.62 3.7 5.45 7.35 ...
$ population : num 31183658 15058638 6949366 1736579 11607944 ...
$ gdp : num 5.48e+10 9.13e+09 2.25e+09 5.63e+09 2.61e+09 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
$ region : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...
summary(africadata_2000)
country year infant_mortality life_expectancy
Algeria : 1 Min. :2000 Min. : 12.30 Min. :37.60
Angola : 1 1st Qu.:2000 1st Qu.: 60.80 1st Qu.:51.75
Benin : 1 Median :2000 Median : 80.30 Median :54.30
Botswana : 1 Mean :2000 Mean : 78.93 Mean :56.36
Burkina Faso: 1 3rd Qu.:2000 3rd Qu.:103.30 3rd Qu.:60.00
Burundi : 1 Max. :2000 Max. :143.30 Max. :75.00
(Other) :45
fertility population gdp continent
Min. :1.990 Min. : 81154 Min. :2.019e+08 Africa :51
1st Qu.:4.150 1st Qu.: 2304687 1st Qu.:1.274e+09 Americas: 0
Median :5.550 Median : 8799165 Median :3.238e+09 Asia : 0
Mean :5.156 Mean : 15659800 Mean :1.155e+10 Europe : 0
3rd Qu.:5.960 3rd Qu.: 17391242 3rd Qu.:8.654e+09 Oceania : 0
Max. :7.730 Max. :122876723 Max. :1.329e+11
region
Eastern Africa :16
Western Africa :16
Middle Africa : 8
Northern Africa : 6
Southern Africa : 5
Australia and New Zealand: 0
(Other) : 0
#More Plotting#Infant Mortality and Life Expectancy in 2000;ggplot(africadata_2000, aes(x=infant_mortality, y=life_expectancy)) +geom_point()
#Population and Life Expectancy in 2000; p2<-ggplot(africadata_2000, aes(x=population, y=life_expectancy)) +geom_point()p2 +scale_x_continuous(trans ='log10')
#Statistics#fit linear regression model using 'x' as predictor and 'y' as response variablefit1 =lm(life_expectancy~infant_mortality, data=africadata_2000)summary(fit1)
Call:
lm(formula = life_expectancy ~ infant_mortality, data = africadata_2000)
Residuals:
Min 1Q Median 3Q Max
-22.6651 -3.7087 0.9914 4.0408 8.6817
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.29331 2.42611 29.386 < 2e-16 ***
infant_mortality -0.18916 0.02869 -6.594 2.83e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.221 on 49 degrees of freedom
Multiple R-squared: 0.4701, Adjusted R-squared: 0.4593
F-statistic: 43.48 on 1 and 49 DF, p-value: 2.826e-08
Call:
lm(formula = life_expectancy ~ population, data = africadata_2000)
Residuals:
Min 1Q Median 3Q Max
-18.429 -4.602 -2.568 3.800 18.802
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.593e+01 1.468e+00 38.097 <2e-16 ***
population 2.756e-08 5.459e-08 0.505 0.616
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.524 on 49 degrees of freedom
Multiple R-squared: 0.005176, Adjusted R-squared: -0.01513
F-statistic: 0.2549 on 1 and 49 DF, p-value: 0.6159
Conclusions
We determined that in a simple linear regression that increased life expectancy is associated with linear trend in decreased infant mortality for African countries in the year 2000. However, population size does not appear to have a linear relationship with life expectancy.
AK Edits
This section is added by Abbie Klinker to expand on Kelly’s findings.
I am interested at looking at infant mortality as it relates to fertility and population. The fertility measure is the number of children per woman. Infant mortality is recorded as the number of children <1yr old dead per 1000 live births.
I would predict that infant mortality is inversely related to fertility and population, while fertility and population are positively related to one another. This would mean that countries with higher rates of infant mortality have lower number of babies per woman and therefore a lower population. If this is proved untrue, then that would raise questions about access to resources like healthcare and quality of life.
Preparing the Data
First I’m going to see if all the data for these variables is still available for year 2000.
africadata%>%filter(year ==2000, is.na(fertility)) #no data is missing for fertility in 2000, so we can still use this data.
[1] country year infant_mortality life_expectancy
[5] fertility population gdp continent
[9] region
<0 rows> (or 0-length row.names)
head(africadata_2000)#Kelly's created Africa 2000 data
country year infant_mortality life_expectancy fertility population
7402 Algeria 2000 33.9 73.3 2.51 31183658
7403 Angola 2000 128.3 52.3 6.84 15058638
7418 Benin 2000 89.3 57.2 5.98 6949366
7422 Botswana 2000 52.4 47.6 3.41 1736579
7426 Burkina Faso 2000 96.2 52.6 6.59 11607944
7427 Burundi 2000 93.4 46.7 7.06 6767073
gdp continent region
7402 54790058957 Africa Northern Africa
7403 9129180361 Africa Middle Africa
7418 2254838685 Africa Western Africa
7422 5632391130 Africa Southern Africa
7426 2610945549 Africa Western Africa
7427 835334807 Africa Eastern Africa
Since they are, I want to first look at how fertility may affect population.
Call:
lm(formula = population ~ fertility, data = africadata_2000)
Residuals:
Min 1Q Median 3Q Max
-15760443 -12563579 -7667052 1295355 106245914
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10355760 11577157 0.894 0.375
fertility 1028697 2162449 0.476 0.636
Residual standard error: 22260000 on 49 degrees of freedom
Multiple R-squared: 0.004597, Adjusted R-squared: -0.01572
F-statistic: 0.2263 on 1 and 49 DF, p-value: 0.6364
Based on this regression model, as supported by the plot above, we don’t have evidence to support the associated between fertility alone and population. This is a bit surprising to me, because I would think that as the number of children per woman increased, so would the population.
Now I want to look at infant mortality as a determinant of population.
Call:
lm(formula = population ~ infant_mortality, data = africadata_2000)
Residuals:
Min 1Q Median 3Q Max
-16307667 -12769228 -7828854 733380 105710100
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12063474 8682734 1.389 0.171
infant_mortality 45564 102671 0.444 0.659
Residual standard error: 22260000 on 49 degrees of freedom
Multiple R-squared: 0.004003, Adjusted R-squared: -0.01632
F-statistic: 0.1969 on 1 and 49 DF, p-value: 0.6592
And this is supported with the regression model as well. This is also surprising as I would guess that a country with higher rates of infant mortality would have a more stagnant population, while countries with lower rates would have a growing population.
Combined Variable Interactions
However, I want to see if infant mortality and fertility have an interaction, which together may impact the population.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
It seems like there is a relationship here where they have a positive correlation, rather than the negative one which I had previously assumed. As the Gapminder documentation describes, this may be an indication of “many children and short lives.”
Call:
lm(formula = population ~ infant_mortality * fertility, data = africadata_2000)
Residuals:
Min 1Q Median 3Q Max
-16611567 -11852919 -8981133 1442415 105324669
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21941250 24539351 0.894 0.376
infant_mortality -194248 437131 -0.444 0.659
fertility -1845199 6236486 -0.296 0.769
infant_mortality:fertility 41894 78706 0.532 0.597
Residual standard error: 22660000 on 47 degrees of freedom
Multiple R-squared: 0.01073, Adjusted R-squared: -0.05242
F-statistic: 0.1699 on 3 and 47 DF, p-value: 0.9162
The combined effect still does not have an association with population, nor is the interaction between the two variables significant. Since they are correlated, this may be a case of multicolinearity, which I will double check just to be sure:
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
vif(fit6)
there are higher-order terms (interactions) in this model
consider setting type = 'predictor'; see ?vif
The standard cutoff for VIF use is 10, so since the interaction between infant mortality and fertility is a whopping 34, we can assume this is a redundant term and should not be used in a model.
Call:
lm(formula = population ~ infant_mortality + fertility, data = africadata_2000)
Residuals:
Min 1Q Median 3Q Max
-16010329 -12607659 -7840265 950724 105972070
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10533816 11864479 0.888 0.379
infant_mortality 16457 184056 0.089 0.929
fertility 742243 3877752 0.191 0.849
Residual standard error: 22490000 on 48 degrees of freedom
Multiple R-squared: 0.004763, Adjusted R-squared: -0.03671
F-statistic: 0.1149 on 2 and 48 DF, p-value: 0.8917
vif(fit62)
infant_mortality fertility
3.150547 3.150547
Now that we’ve removed the interaction, while the model still does not show any significant associations between population, infant mortality, and fertility, we can be confident in our answers and the validity of the model.
Conclusions
Based on this analysis, we can conclude that neither infant mortality or the number of children per woman influence a country’s population. However, countries with higher rates of infant mortality also have more children per woman. Based on this data, for every child dead per 1000 live births, the average woman is likely to have 0.04 more children. This is not a very interpretable number. When translated, this can also mean that when the infant mortality rate reaches 1 in 40 live births (0.025%), the average woman will have 1 child. While this may seem insanely high, based on the data, 16 countries have infant mortality rates over 100, or 1 death per in 10 births, and in these countries the average number of children per woman is over 6, and we see this average drop to around 4 in the countries with lower infant mortality rates. This is an indication of lack of access to resources like reproductive healthcare, education, and necessities like food and water.
Thanks
Thanks to Abbie Klinker for her awesome additional analyses!