A Deep Dive into Colombian Demographics Using ColOpenData

ColOpenData can be used to access open demographic data from Colombia. This demographic data is retrieved from the National Administrative Department of Statistics (DANE). The demographic module allows you to consult demographic data from the National Population and Dwelling Census (CNPV) of 2018 and Population Projections.

The available CNPV information is divided in four categories: households, persons demographic, persons social and dwellings. The population projections information presents data from 1950 to 2070 for a national level, from 1985 to 2050 for a departmental level and from 1985 to 2035 for a municipal level. All data documentation can be accessed as explained at Documentation and Dictionaries.

In this vignette you will learn:

  1. How to download demographic data using ColOpenData.
  2. How to filter, group, mutate and aggregate demographic data.
  3. How to visualize data using ggplot2.

As the goal of this vignette is to show some examples on how to use the data, we will load some specific libraries but that does not mean they are required to use the data in all cases.

In order to access its documentation we need to use the function list_datasets() and indicate as a parameter the module we are interested in. It is important to take a good look at this to have a clearer understanding of what we count with, before just throwing ourselves to work with the data. Now, we should start by loading all necessary libraries.

library(ColOpenData)
library(dplyr)
library(ggplot2)

Disclaimer: all data is loaded to the environment in the user’s R session, but is not downloaded to user’s computer.

Initial Exploration: Basic Data Handling with ColOpenData

Documentation access

First, we have to access the demographic documentation, to check available datasets.

datasets_dem <- list_datasets(module = "demographic", language = "EN")

head(datasets_dem)
#> # A tibble: 6 × 7
#>   name                group       source year  level        category description
#>   <chr>               <chr>       <chr>  <chr> <chr>        <chr>    <chr>      
#> 1 DANE_CNPVH_2018_1HD demographic DANE   2018  department   househo… Number of …
#> 2 DANE_CNPVH_2018_1HM demographic DANE   2018  municipality househo… Number of …
#> 3 DANE_CNPVH_2018_2HD demographic DANE   2018  department   househo… Number of …
#> 4 DANE_CNPVH_2018_2HM demographic DANE   2018  municipality househo… Number of …
#> 5 DANE_CNPVH_2018_3HD demographic DANE   2018  department   househo… Households…
#> 6 DANE_CNPVH_2018_3HM demographic DANE   2018  municipality househo… Households…

After checking the documentation, we can load the data we want to work with. To do this, we will use the download_demographic() function that takes by parameter the dataset name, presented in the documentation. For this first example we will focus on a CNPV dataset.

Data load

public_services_d <- download_demographic(dataset = "DANE_CNPVV_2018_8VD")
#> Original data is retrieved from the National Administrative Department
#> of Statistics (Departamento Administrativo Nacional de Estadística -
#> DANE).
#> Reformatted by package authors.
#> Stored by Universidad de Los Andes under the Epiverse TRACE iniative.

head(public_services_d)
#> # A tibble: 6 × 6
#>   codigo_departamento departamento   area     servicio_publico disponible  total
#>   <chr>               <chr>          <chr>    <chr>            <chr>       <int>
#> 1 00                  Total nacional total    energia_electri… si         1.30e7
#> 2 00                  Total nacional total    energia_electri… no         4.97e5
#> 3 00                  Total nacional total    energia_electri… sin_infor… 0     
#> 4 00                  Total nacional cabecera energia_electri… si         1.05e7
#> 5 00                  Total nacional cabecera energia_electri… no         8.16e4
#> 6 00                  Total nacional cabecera energia_electri… sin_infor… 0

As it can be seen above, public_services_d presents information regarding availability of public services in the country at the department level. Now, with this data we could, for example, find the proportion of dwellings that have access to a water supply system (WSS) by department and plot it.

Data filter and plot

First we will subset the data so it presents the information regarding the WSS by department.

wss <- public_services_d %>%
  filter(
    area == "total_departamental",
    servicio_publico == "acueducto"
  ) %>%
  select(departamento, disponible, total)

With the subset, we can calculate the total counts by department.

total_counts <- wss %>%
  group_by(departamento) %>%
  summarise(total_all = sum(total)) %>%
  ungroup()

Then, we can calculate the proportions of “yes” (“si”) by department.

proportions_wss <- wss %>%
  filter(disponible == "si") %>%
  left_join(total_counts, by = "departamento") %>%
  mutate(proportion_si = total / total_all)

For plotting purposes, we will change the name of “San Andrés”, since the complete name is too long.

proportions_wss[28, "departamento"] <- "SAPSC"

Finally, we can plot the results

ggplot(proportions_wss, aes(
  x = reorder(departamento, -proportion_si),
  y = proportion_si
)) +
  geom_bar(stat = "identity", fill = "#10bed2", color = "black", width = 0.6) +
  labs(
    title = "Proportion of dwellings with access to WSS by department",
    x = "Department",
    y = "Proportion"
  ) +
  theme_minimal() +
  theme(
    plot.background = element_rect(fill = "white", colour = "white"),
    panel.background = element_rect(fill = "white", colour = "white"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  )