How to collect data from Statistics Canada using Python
A step by step tutorial on Statistics Canada API and python library
Statistics Canada provides a wide range of data for public use, including data for Canada’s economy, society and environment. In addition to manually browsing the webpage to access the data, Statistics Canada also provides Web Data Service, i.e. API access to their database. What’s even better, there is a python library stats_can that wraps up the API commands and allows users to write simple codes to read data automatically. This article will provide a step by step tutorial on how to achieve that.
Identify desired data item and its vectorID
Go to Statistics Canada subpage Data, input keywords, for instance, GDP, and click on the search icon.
Locate the desired item in returned results.
On the GDP page, click on Add/Remove data
.
Click on the tab Customize layout
.
Scroll down and tick the option Display Vector identifier and coordinate
, then click the button Apply
.
Now the table will be reorganized and you will find a new column Vector
, here we will select GDP for all industries, so the targeted vectorID
is v65201210
. This will be used later in the Python codes.
Install stats_can in Python
The library can either be installed with conda or pip:
conda install -c conda-forge stats_can
Or:
pip install stats-can
Collect the data
We will use the function stats_can.sc.vector_to_df
to read data from Statistics Canada. The syntax of the function is as below:
df = stats_can.sc.vectors_to_df(vectors, periods = n, start_release_date = None, end_release_date = None)
It returns DataFrame of vectors with n periods data or over range of release dates. The function wraps up two API commands, which are get_bulk_vector_data_by_range
and get_data_from_vectors_and_latest_n_periods
to turn the resulting list of JSONs into a DataFrame.
For instance, with the vectorID v65201210
for GDP, you can run following codes to collect GDP for the last 6 months:
import stats_candf = stats_can.sc.vectors_to_df('v65201210', periods = 6)
You can rename the index and column to make the table look nicer:
df.columns = ['GDP']
df.index.names = ['Date']
the returned table looks like below:
Date GDP
2020-05-01 1726385.0
2020-06-01 1833032.0
2020-07-01 1879281.0
2020-08-01 1896592.0
2020-09-01 1912085.0
2020-10-01 1920056.0
An example that collects and visualizes multiple economic indexes
In a recent project on forecasting sales with synthetic data, I collected multiple eco indexes such as GDP, CPI, unemployment rate, etc. to build a multi-variant regression model for the purpose of forecasting financial transaction counts.
Firstly, I figured out vectorIDs for all eco indexes I plan to use, including GDP, CPI, TSX, unemployment rate and exchange rate (USD to CAD), and put them into a dictionary:
eco_vec_map = {'CPI':'v41690973',
'Exchange_Rate_USD':'v111666275',
'GDP':'v65201210',
'Unemployment_Rate':'v91506256',
'TSX':'v122620'}
Secondly, I converted values from the dictionary into a list and run the function, with an input of periods = 36 (i.e. the last 3 years), the function automatically recognized the vector list and returned tables with multiple columns:
vectors = list(eco_vec_map.values())
df = stats_can.sc.vectors_to_df(vectors, periods = 36)
I then renamed the columns with their eco index name, note that a mapping trick is used here, so that you won’t have to worry about the order of of the columns:
inv_map = {v: k for k, v in eco_vec_map.items()}
df.columns = df.columns.to_series().map(inv_map)
df.index.names = ['Date']
The returned table looked like below:
Date TSX CPI GDP Unemployment_Rate Exchange_Rate_USD
2017-09-01 NaN NaN 1905358.0 NaN NaN
2017-10-01 16025.59 130.9 1907384.0 NaN NaN
2017-11-01 16067.48 131.3 1917258.0 5.4 1.2769
2017-12-01 16209.13 130.8 1922552.0 5.3 1.2771
...
2020-08-01 16514.44 137.0 1904249.0 11.0 1.3222
2020-09-01 16121.38 136.9 NaN 8.4 1.3228
2020-10-01 NaN NaN NaN 8.1 1.3215
Finally, I visualized all columns of the table:
df.plot(subplots = True, figsize = (14,8), layout = (3,2));
It showed that except CPI, other eco indexes experienced a sharp decline or increase near March-April 2020, which is likely impacted by the pandemic.
Summary
This article provided detailed instructions on how to collect data using Statistics Canada’s library in Python. The major steps are:
- Identify
vectorIDs
for targeted data items. - Install
stats_can
library and call functionstats_can.sc.vectors_to_df
to collect data and store as a DataFrame table.
I had a difficult time finding good tutorials online so I wrote this article hoping to help others. For any questions or suggestions, don’t hesitate to reach out to me via Linkedin. I’d be happy to connect :)
References:
[1] Web Data Service User Guide, Statistics Canada. https://www.statcan.gc.ca/eng/developers/wds/user-guide#a12-3
[2] A python library for reading data from Statistics Canada.
https://stats-can.readthedocs.io/en/latest/index.html#