The idea came up when i watched the news about the population growth in Vietnam and started wondering: The Vietnamese population in 2017 is 95.54 million people, what would that figure be in the next couple of years ?

I started looking up online for the Vietnamese population data and found out that World Bank Group provides data regarding population of various countries from 1960 till now. Intuitively, one can say that GDP and Life Expectancy can help predict the Population of a country. Based on that, i decided to analyze the data of Vietnam and try predicting the Vietnamese population based on GDP and Life Expectancy with the help of simple machine learning models.

Preparing Vietnamese Data

Firstly, GDP, Life Expectancy and Population are collected individually from multiple sources of World Bank Group.

import pandas as pd
from matplotlib import pyplot as plt
# GDP per capita
gdp = pd.read_csv('GDP/GDP.csv', header=2)
# Female population
popfe = pd.read_csv('POP.FE/POP.FE.csv', header=2)
# Male population
popma = pd.read_csv('POP.MA/POP.MA.csv', header=2)
# Life expectancy
le = pd.read_csv('LE/LE.csv', header=2)

A sample of GDP data:

gdp.head()

GDP

This dataset contains GDP of different countries from 1960 to 2017. This is the same for female/male population and life expectancy dataset.

Next thing is puting all the information above (male/female population, GDP, life expectancy) into one tabular dataset. This can be done with the dataframe of columns:

  • country
  • countryCode
  • year
  • lifeExpectancy
  • populationFemale
  • populationMale
  • gdpPerCapita
# Initialize dataframe with pre-defined columns
data = pd.DataFrame(columns=['country', 'countryCode', 'year', 'lifeExpectancy', 'population', 'populationFemale', 'populationMale', 'gdpPerCapita'])

countries = gdp['Country Name'].values
countriesCode = gdp['Country Code'].values
years = gdp.columns[4:-1].values
       
for index, country in enumerate(countries):
    for year in years:
        gdpValue = gdp[gdp['Country Name'] == country][year].values[0]
        popfeValue = popfe[popfe['Country Name'] == country][year].values[0]
        popmaValue = popma[popma['Country Name'] == country][year].values[0]
        population = popfeValue + popmaValue
        leValue  = le[le['Country Name'] == country][year].values[0]
        data = data.append({'country': country, 'countryCode': countriesCode[index], 
                        'year': int(year), 'lifeExpectancy': leValue, 'population':population, 
                        'populationFemale':popfeValue, 'populationMale':popmaValue, 'gdpPerCapita':gdpValue }, ignore_index=True)

This will result in the following dataframe:

data.head()

MergedData

Within this work, i only capture the data of Vietnam, so:

vietnam = data [data['country'] == 'Vietnam']
vietnam.head()

VietnamData

That's it, at this point, i successfully obtain data relating to Vietnamese population. Next, i try visualizing these features to observe if they contain any interesting patterns, maybe some of them can be explained with historical events.

Visualizing Data

First, let's plot the GDP data of Vietnam:

plt.plot(vietnam['year'], vietnam['gdpPerCapita'])
plt.legend(['Vietnam'])
plt.xlabel('year')
plt.ylabel('GDP per capita')
plt.grid()
plt.show()

VietnamGDP

The first thing that can be captured from this plot is it shows GDP from only 1985 to 2016, whereas the dataset actually contains data from 1960. This shows that from 1960 to 1985, Vietnamese GDP is not available.

Interestingly, there is a peak between 1986 and 1987. This can be explained by the fact that Vietnam initiated the economic innovation campaign (Doi Moi) at that point of time, which encouraged the establishment of private businesses and foreign investment, initiating the raise of Vietnamese GDP.

Next, let's check the Life Expectancy in Vietnam:

plt.plot(vietnam['year'], vietnam['lifeExpectancy'])
plt.legend(['Vietnam'])
plt.xlabel('Year')
plt.ylabel('Life Expectancy')
plt.grid()
plt.show()

VietnamLife
As it is shown in the graph that from 1965 to 1972, the life expectancy in Vietnam dropped significantly, this exactly corresponds to the historical event Vietnam War. From 1972, the figure started raising up, which is also the point of time at the end of Vietnam War. The Paris Peace Talks also began around this period of time.

Let's see if Vietnamese population contains any interesting pattern:

plt.plot(vietnam['year'], vietnam['population'])
plt.plot(vietnam['year'], vietnam['populationFemale'])
plt.plot(vietnam['year'], vietnam['populationMale'])
plt.legend(['Total', 'Female', 'Male'])
plt.xlabel('year')
plt.ylabel('Population')
plt.grid()
plt.show()

VietnamLife

It can be seen from this chart that female population is slightly higher compared to male population (approximately 1 million). Positive news for Vietnamese guys here.

Applying Machine Learning Models

Recall that the main goal is to predict the population of Vietnam based on GDP and Life Expectancy in the future. However, neither GDP and Life Expectancy are available in the future. Therefore, the most obvious solution is to build models to predict GDP and Life Expectancy, then use these predicted values to predict the population.

As it is illustrated in the line charts of GDP, Life Expectancy, Population, i decided to utilize following models:

  • Build Polynomial Regression for GDP
  • Build Linear Regression for Life Expectancy
  • Build Linear Regression for Population

How to validate the models?
The interesting idea to validate the models within this work is to train the models from 1960 to 2007 and use them to predict the population of 2016 and compare with true population provided from World Bank Group.

Remember the missing values of GDP?
Recall that form 1960 to 1984, Vietnamese GDP had not been available yet. Also, the Life Expectancy within this time period drops and raise significantly, which can make the model worse. A simple solution for this is to train the model with data only from 1985 to 2007.

Preparing the models:

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

#init 3 models
lr_life = LinearRegression()
pr_gdp = LinearRegression()
lr_population = LinearRegression()

poly_reg = PolynomialFeatures(degree = 10)

Let's obtain Vietnamese data after 1985 and before 2007:

year_after_1985_before_2007 = (vietnam['year']<=2016) & (vietnam['year']>=1985)

Now here comes the models, let's try first with the GDP:

X_train_gdp = vietnam[year_after_1985_before_2007][['year']]
X_train_poly_gdp = poly_reg.fit_transform(X_train_gdp)
y_train_gdp = vietnam[year_after_1985_before_2007]['gdpPerCapita']
pr_gdp.fit(X_train_poly_gdp, y_train_gdp)

predicted_train_gdp = pr_gdp.predict (X_train_poly_gdp)

plt.figure(figsize=(8, 5))
plt.plot(X_train_gdp, predicted_train_gdp)
plt.plot(X_train_gdp, y_train_gdp)
plt.legend(['Predicted values','True values'])
plt.xlabel('year')
plt.ylabel('GDP')
plt.grid()
plt.show()

VietnamGDPFit
This can illustrate how the model fit to the data. Next, let's try for Life Expectancy:

X_train_life = vietnam[year_after_1985_before_2007][['year']]
y_train_life = vietnam[year_after_1985_before_2007]['lifeExpectancy']
lr_life.fit( X_train_life, y_train_life)

predicted_train_life = lr_life.predict (X_train_life)

plt.figure(figsize=(8, 5))
plt.plot(X_train_life, predicted_train_life)
plt.plot(X_train_life, y_train_life)
plt.legend(['Predicted values','True values'])
plt.xlabel('year')
plt.ylabel('Life Expectancy')
plt.grid()
plt.show()

VietnamLifeFit

For population, the model is trained it with GDP and Life Expectancy as features:

population_features = ['year','gdpPerCapita','lifeExpectancy']
X_train_population = vietnam[year_after_1985_before_2007][population_features]
y_train_population = vietnam[year_after_1985_before_2007]['population']
lr_population.fit(X_train_population, y_train_population)

predicted_train_population = lr_population.predict (X_train_population)

plt.figure(figsize=(8, 5))
plt.plot(vietnam[year_after_1985_before_2007]['year'], predicted_train_population)
plt.plot(vietnam[year_after_1985_before_2007]['year'], y_train_population)
plt.legend(['Predicted values','True values'])
plt.xlabel('year')
plt.ylabel('Population')
plt.grid()
plt.show()

VietnamPopulationFit

Try making a prediction

At this point, the models are ready, now let's try making a prediction of population and comparing with the true values.
Recall that the predicted value of GDP and Life Expectancy are used here as eventually these are neccessary if one wants to predict the future. The target year that i used here is 2016:

target_year = 2016

poly_target_year =  poly_reg.fit_transform([[target_year]])

predicted_gdp = pr_gdp.predict (poly_target_year)
predicted_life = lr_life.predict ([[target_year]])

predicted_population = lr_population.predict ([[target_year, predicted_gdp, predicted_life]])

Results:

Predicted Values True Values
GDP 2306 2170
Life Expectancy 76 76
Population 96,605,786 94,569,072

As a result, although the predicted population is not closed to the real population, but this is an acceptable beginning as i only utilized simple models. Another thing is that i used predicted values of GDP and Life Expectancy, which may cause the predicted Population worse (worse than used the true value of GDP and Life Expectancy in 2016).

What's next ?

The whole work illustrated here is just a simple approach using basic machine learning techniques in order to satisfy my curiousity. Further exploration and improvements can be used to enhance the solution:

  • Visualizing and comparing GDP, Life Expectancy, Population of Vietnam with other countries (compared to US, or SEA countries...).
  • Adding more features such as Fertility Rate to see if it can help improve the prediction.
  • Using more advanced machine learning models.
  • ...