diff --git a/index.ipynb b/index.ipynb index 285346f..5ca74c8 100644 --- a/index.ipynb +++ b/index.ipynb @@ -51,7 +51,9 @@ "1. [Getting started with FlowClient](01-getting-started-with-flowclient.ipynb) - this tutorial will teach you how to connect to a Flowkit server, \n", " 1a. [Getting an access token](01a-getting-an-access-token.ipynb) - instructions for getting an access token from FlowAuth, which you will use in tutorial 1, \n", "2. [Running a query](02-running-a-query.ipynb) - this tutorial will teach you how to use FlowClient to get aggregated data from FlowKit, \n", - "3. [Geography](03-geography.ipynb) - this tutorial will teach you how to join query results to geographic boundaries. " + "3. [Geography](03-geography.ipynb) - this tutorial will teach you how to join query results to geographic boundaries. \n", + "\n", + "Once you have worked through these, you could take a look at the [\"Mobility indicators\" worked example](mobility-indicators.ipynb) for a demonstration of some analysis using FlowKit aggregates." ] } ], diff --git a/mobility-indicators.ipynb b/mobility-indicators.ipynb new file mode 100644 index 0000000..02dd7bc --- /dev/null +++ b/mobility-indicators.ipynb @@ -0,0 +1,1647 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Mobility indicators\n", + "## Assessing the impact of mobility restrictions during a virus outbreak\n", + "\n", + "In this worked example we will assume the role of an analyst in Ghana monitoring the effects of government restrictions on mobility patterns during a virus outbreak. This is based on material produced during the COVID-19 pandemic, which can be found on Flowminder's [COVID-19 website](https://covid19.flowminder.org/) and [GitHub repository](https://github.com/Flowminder/COVID-19)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Introduction\n", + "\n", + "This example is written assuming that the reader has worked through the following tutorials, to gain familiarity with the use of FlowKit to produce CDR aggregates:\n", + "\n", + "- [Getting started with FlowClient](01-getting-started-with-flowclient.ipynb)\n", + "- [Running a query](02-running-a-query.ipynb)\n", + "- [Geography](03-geography.ipynb)\n", + "\n", + "In this example, we will first use FlowKit to produce some aggregates from CDR data. We will then perform some checks on these aggregates, to identify any potential issues with the underlying data. Finally, we will use the aggregates to produce some mobility indicators, which will give us insights into changes in population density and movements around the country over time." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to FlowKit server\n", + "\n", + "We'll start by creating a connection to the Ghana FlowCloud FlowKit server, following the instructions in the [\"Getting started with FlowClient\" tutorial](01-getting-started-with-flowclient.ipynb). First we import the FlowClient library, along with pandas, geopandas and matplotlib which we'll use later for further analysis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import flowclient as fc\n", + "import geopandas as gpd\n", + "import matplotlib.pyplot as plt\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can get a token from [FlowAuth](https://auth.flowcloud-ghana.flowminder.org/) and use it to create a FlowKit connection:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "token =" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "conn = fc.connect(url=\"https://api.flowcloud-ghana.flowminder.org\", token=token)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Run aggregates\n", + "\n", + "The first step is to run the aggregate queries that we will need for this analysis. The aggregates we will need are:\n", + "\n", + "1. Total subscriber count per day, for the whole country (admin0). This is the total number of unique subscribers who used their phone anywhere in the country, each day.\n", + "2. Subscriber count per region per day (admin1). This is similar to \"subscriber count per day\", but with a separate unique subscriber count for each region.\n", + "3. Subscriber count per district per day (admin2). This is similar to \"subscriber count per region per day\", but will count subscribers per district (admin2) instead of per region (admin1).\n", + "4. District-level OD matrix per day. For each pair of districts A and B each day, the OD matrix contains the count of subscribers who were active in district A and then active in district B later the same day.\n", + "5. Count of CDR events per day, for the whole country (admin0). This will not be used in the final analysis in this example, but will be useful when we check the aggregate results to ensure there are no issues with the underlying CDR data.\n", + "6. Count of CDR events per district per day (admin2). This is similar to \"count of CDR events per day\", but with a separate event count for each district. Again, this will be useful when we perform some quality checks on the query results.\n", + "\n", + "We need to specify the date range for which we want to run the queries. The mobility restrictions started on 1 March 2016, so we'll get results from the beginning of February (1 month before restrictions began) until the end of May (3 months after restrictions began). In a real-world scenario we could update this analysis every few days using the latest available CDR data, to get a near-real-time picture of mobility in Ghana. We'll make a list of all dates in our date range, using the pandas `date_range` function.\n", + "\n", + "**Note:** We will use the FlowKit convention that date intervals include the lower bound (start date) and exclude the upper bound (end date)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define date range\n", + "start_date = \"2016-02-01\"\n", + "end_date = \"2016-06-01\"\n", + "all_dates = pd.date_range(start_date, end_date, closed=\"left\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now define and run our FlowKit queries, following instructions in the [\"Running a query\" tutorial](02-running-a-query.ipynb).\n", + "\n", + "#### 1. Total subscriber count per day\n", + "\n", + "For \"total subscriber count per day\", we can use `unique_subscriber_counts` queries and set `aggregation_unit=\"admin0\"` - this will count the total number of unique active subscribers in the country in a specified time period. We will need a separate query for each day in our date range. For example, to count the total number of active subscribers on 2 February 2016 we can use the following query:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "total_subscribers_20160201_query = fc.unique_subscriber_counts(\n", + " connection=conn,\n", + " start_date=\"2016-02-01\",\n", + " end_date=\"2016-02-02\",\n", + " aggregation_unit=\"admin0\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's create a dictionary of \"total subscribers per day\" queries for all the days in our date range. To specify a 1-day time interval for each query, we need the `end_date` to be the day _after_ the `start_date` - we can do this by looping over the dates in `all_dates`, setting each date as the `start_date` of a query, and adding `pd.Timedelta(\"1 day\")` to that date to get the `end_date`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create an empty dictionary\n", + "total_subscribers_per_day_queries = {}\n", + "\n", + "# Loop over the days in our date range\n", + "for day in all_dates:\n", + " # Create a query object for this day, and add it to the dictionary\n", + " total_subscribers_per_day_queries[day] = fc.unique_subscriber_counts(\n", + " connection=conn,\n", + " start_date=str(day),\n", + " end_date=str(day + pd.Timedelta(\"1 day\")),\n", + " aggregation_unit=\"admin0\",\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have defined our \"total subscribers per day\" queries, we can ask the FlowKit server to run them by calling the `run()` method of each query object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for query in total_subscribers_per_day_queries.values():\n", + " query.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These queries may take a while to run, so we can continue to define and run the other queries we need while waiting for the results to become available.\n", + "\n", + "#### 2. Subscriber count per region per day\n", + "\n", + "For \"subscriber count per region per day\", we can use `unique_subscriber_counts` queries again, but this time set `aggregation_unit=admin1` (the admin1 divisions are the regions of Ghana). These queries will count the number of active subscribers in each region during the specified time interval.\n", + "\n", + "Let's create a dictionary of \"subscribers per region per day\" query objects, like we did for \"total subscribers per day\". We can use a dictionary comprehension to express this in a more compact form:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subscribers_per_region_per_day_queries = {\n", + " day: fc.unique_subscriber_counts(\n", + " connection=conn,\n", + " start_date=str(day),\n", + " end_date=str(day + pd.Timedelta(\"1 day\")),\n", + " aggregation_unit=\"admin1\",\n", + " )\n", + " for day in all_dates\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And again, we can now use the `run()` method to ask FlowKit to run these queries:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for query in subscribers_per_region_per_day_queries.values():\n", + " query.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 3. Subscriber count per district per day\n", + "\n", + "Similar to \"subscriber count per region per day\", we can calculate the subscriber count per district per day using `unique_subscriber_counts` queries with `aggregation_unit=admin2` (district level). We can define and run these queries as we did for \"subscriber count per region per day\":" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subscribers_per_district_per_day_queries = {\n", + " day: fc.unique_subscriber_counts(\n", + " connection=conn,\n", + " start_date=str(day),\n", + " end_date=str(day + pd.Timedelta(\"1 day\")),\n", + " aggregation_unit=\"admin2\",\n", + " )\n", + " for day in all_dates\n", + "}\n", + "\n", + "for query in subscribers_per_district_per_day_queries.values():\n", + " query.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 4. District-level OD matrix per day\n", + "\n", + "For the OD (origin-destination) matrix, we can use `trips_od_matrix` queries. For each pair of locations A and B, a `trips_od_matrix` query will count the number of unique subscribers who were active in A and then later active in B during the specified time period.\n", + "\n", + "The parameters of `trips_od_matrix` are the same as the parameters of `unique_subscriber_counts`: a `start_date` and `end_date` specifying a time interval, and an `aggregation_unit`. We'll set `aggregation_unit=\"admin2\"`, to get counts of subscribers moving between districts. We will create a dictionary of query objects and set them all running, as we did with the previous queries." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "od_matrix_district_per_day_queries = {\n", + " day: fc.trips_od_matrix(\n", + " connection=conn,\n", + " start_date=str(day),\n", + " end_date=str(day + pd.Timedelta(\"1 day\")),\n", + " aggregation_unit=\"admin2\",\n", + " )\n", + " for day in all_dates\n", + "}\n", + "\n", + "for query in od_matrix_district_per_day_queries.values():\n", + " query.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 5. Total event count per day\n", + "\n", + "The total number of CDR events (calls, SMS, data sessions, etc.) per day will be useful information for checking whether there are any issues with the underlying CDR data within FlowKit. For example, this would help to identify any periods with missing data. We can use a `location_event_counts` query to count the number of events per day. This query has the following parameters:\n", + "\n", + "- `start_date`: the first date in the time period\n", + "- `end_date`: the day _after_ the end of the time period\n", + "- `count_interval`: the length of time intervals within which events should be counted (e.g. `\"minute\"`, `\"hour\"`, `\"day\"`)\n", + "- `aggregation_unit`: the level of spatial aggregation unit for which events will be counted\n", + "\n", + "By setting `count_interval=\"day\"`, we can get the event count for each day using a single query (unlike the `unique_subscriber_counts` and `trips_od_matrix` queries we defined earlier, where a separate query was required for each day). We will set `start_date` and `end_date` to be the start and end dates of the full date range, as we defined them at the beginning of this notebook, and we will set `aggregation_unit=\"admin0\"` to get the total event count for the entire country." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "total_events_per_day_query = fc.location_event_counts(\n", + " connection=conn,\n", + " start_date=start_date,\n", + " end_date=end_date,\n", + " count_interval=\"day\",\n", + " aggregation_unit=\"admin0\",\n", + ")\n", + "\n", + "total_events_per_day_query.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 6. Event count per district per day\n", + "\n", + "We can also use a `location_event_counts` query to get a separate event count per day for each district, which will be useful to identify whether there are any data issues that only affect certain geographic areas. This query is identical to the \"total events per day\" query above, except that we've changed the aggregation unit to `\"admin2\"` to get an event count for each district." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "events_per_district_per_day_query = fc.location_event_counts(\n", + " connection=conn,\n", + " start_date=start_date,\n", + " end_date=end_date,\n", + " count_interval=\"day\",\n", + " aggregation_unit=\"admin2\",\n", + ")\n", + "\n", + "events_per_district_per_day_query.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Check status of queries\n", + "\n", + "We can check the status of a query by looking at the value of its `status` property. Once the result of a query is ready, its status will be `'completed'`. For example, we can check the status of `total_events_per_day_query` like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Total event counts:\", total_events_per_day_query.status)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the subscriber counts and OD matrix queries we have a separate query for each day, and each query could have a different status. A convenient way to check the status of all the queries in the `total_subscribers_per_day_queries` dictionary is to use a [Counter](https://docs.python.org/2/library/collections.html#collections.Counter):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from collections import Counter\n", + "\n", + "print(\n", + " \"Total subscriber counts:\",\n", + " Counter([query.status for query in total_subscribers_per_day_queries.values()]),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's check the status of all our queries:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for label, query_group in [\n", + " (\"Total subscriber counts\", total_subscribers_per_day_queries),\n", + " (\"Subscriber counts per region\", subscribers_per_region_per_day_queries),\n", + " (\"Subscriber counts per district\", subscribers_per_district_per_day_queries),\n", + " (\"OD matrix (district level)\", od_matrix_district_per_day_queries),\n", + "]:\n", + " print(label, Counter([query.status for query in query_group.values()]))\n", + "\n", + "print(f\"Total event counts: {total_events_per_day_query.status}\")\n", + "print(f\"Event counts per district: {events_per_district_per_day_query.status}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Get query results\n", + "\n", + "#### 1. Total subscriber count per day\n", + "\n", + "Once our queries have finished running, we can get the results using the `get_result` method. For example, to get the result of the \"total subscribers per day\" query for the first day in our date range:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "first_result = total_subscribers_per_day_queries[all_dates[0]].get_result()\n", + "first_result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This result has two columns:\n", + "\n", + "- `pcod` is the P-code. Since this result is at admin0 level, there is only one P-code (\"GH\" is the P-code for the whole of Ghana).\n", + "- `value` is the unique subscriber count.\n", + "\n", + "This is the result for a single day. It will be convenient for us to combine the \"total subscribers per day\" results for all days into a single DataFrame - so that we know which result corresponds to which date, we can add a 'date' column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Add date column\n", + "first_result[\"date\"] = all_dates[0]\n", + "first_result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The most convenient way to combine the \"total subscribers per day\" results for all dates into a single DataFrame is to use the pandas `concat` function. First, we can use a list comprehension to get a list of individual result DataFrames (`.assign(date=day)` adds the value of `day` in a new `date` column):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "all_total_subscribers_results = [\n", + " query.get_result().assign(date=day)\n", + " for day, query in total_subscribers_per_day_queries.items()\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can use `pd.concat` to concatenate all the DataFrames in the list `all_total_subscribers_results` into a single DataFrame (the `ignore_index=True` option ensures that each row in the final DataFrame has a unique index, rather than keeping the indexes from the individual DataFrames):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "total_subscribers_per_day_results = pd.concat(\n", + " all_total_subscribers_results, ignore_index=True\n", + ")\n", + "\n", + "total_subscribers_per_day_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `total_subscribers_per_day_results` DataFrame contains the total daily subscriber count for each day in our date range. The `pcod` column is not useful in this case - it has the same value for every row, so we can drop it:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "total_subscribers_per_day_results = total_subscribers_per_day_results.drop(\n", + " columns=\"pcod\"\n", + ")\n", + "total_subscribers_per_day_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 2. Subscriber count per region per day\n", + "\n", + "We can get the \"subscribers per region per day\" results in the same way. This time we're getting the list of individual query results within the call to `pd.concat()` instead of assigning it to a new variable first, but the process here is the same." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subscribers_per_region_per_day_results = pd.concat(\n", + " [\n", + " query.get_result().assign(date=day)\n", + " for day, query in subscribers_per_region_per_day_queries.items()\n", + " ],\n", + " ignore_index=True,\n", + ")\n", + "\n", + "subscribers_per_region_per_day_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This time the `pcod` column contains the level 1 P-code that identifies each region. There are 10 regions, so we get 10 subscriber counts per day. In this case the `pcod` column contains useful information, so we'll keep it.\n", + "\n", + "#### 3. Subscriber count per district per day\n", + "\n", + "We can get the \"subscribers per district per day\" results in the same way. This DataFrame is larger than `subscribers_per_region_per_day_results`, because there are 137 districts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subscribers_per_district_per_day_results = pd.concat(\n", + " [\n", + " query.get_result().assign(date=day)\n", + " for day, query in subscribers_per_district_per_day_queries.items()\n", + " ],\n", + " ignore_index=True,\n", + ")\n", + "\n", + "subscribers_per_district_per_day_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 4. District-level OD matrix per day\n", + "\n", + "The process for getting the OD matrix results is the same:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "od_matrix_district_per_day_results = pd.concat(\n", + " [\n", + " query.get_result().assign(date=day)\n", + " for day, query in od_matrix_district_per_day_queries.items()\n", + " ],\n", + " ignore_index=True,\n", + ")\n", + "\n", + "od_matrix_district_per_day_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This DataFrame has four columns:\n", + "\n", + "- `pcod_from` is the P-code of the first district,\n", + "- `pcod_to` is the P-code of the second district,\n", + "- `value` is the number of unique subscribers who were active in `pcod_from` and later `pcod_to`,\n", + "- `date` is the date on which subscribers were active.\n", + "\n", + "#### 5. Total event count per day\n", + "\n", + "\"Total events per day\" is a single query, so we can get the full result using a single call to `get_result()`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "total_events_per_day_results = total_events_per_day_query.get_result()\n", + "total_events_per_day_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This has three columns: `pcod` (the P-code for Ghana - this is always \"GH\"), `date` (the date on which events were counted) and `value` (the count of CDR events on that date). At first glance it looks like this is in the format we need, but let's check the information about this DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "total_events_per_day_results.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The datatype of the `date` column is 'object' (i.e. string). For later analysis, it will be useful to convert this column to 'datetime' data type, which we can do using the pandas `to_datetime` function. Let's also drop the \"pcod\" column (as we did for \"total subscribers per day\"), becuse this doesn't contain useful information." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "total_events_per_day_results[\"date\"] = pd.to_datetime(\n", + " total_events_per_day_results[\"date\"]\n", + ")\n", + "\n", + "total_events_per_day_results = total_events_per_day_results.drop(columns=[\"pcod\"])\n", + "\n", + "total_events_per_day_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 6. Event count per district per day\n", + "\n", + "\"Events per district per day\" is a single query, like \"total events per day\", so we can get the result in the same way. Again, we'll convert the `date` column to 'datetime' type. This time the `pcod` column is important (it identifies the districts), so we'll keep it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "events_per_district_per_day_results = events_per_district_per_day_query.get_result()\n", + "events_per_district_per_day_results[\"date\"] = pd.to_datetime(\n", + " events_per_district_per_day_results[\"date\"]\n", + ")\n", + "\n", + "events_per_district_per_day_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### QA checks\n", + "\n", + "It is possible that some of the CDR data available in FlowKit may be incomplete or incorrect. If this is the case, conclusions we draw from our aggregates may be incorrect or misleading, so it is important to perform some quality checks on our query results before we use them to investigate mobility patterns. In this section we'll go through some of these checks." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Time-series plots\n", + "\n", + "A simple check we can perform is to plot the subscriber counts over time, and look for any unexpected changes. For example, we can plot the total subscriber count per day:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "total_subscribers_per_day_results.plot(x=\"date\", y=\"value\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the total subscriber count per day remains stable over time, except for some very small fluctuations. If there were any large changes, this could indicate either missing data or changes in calling behaviour, either of which would make the insights from these results unreliable.\n", + "\n", + "**Note:** In real data, we would expect to see a repeating weekly pattern of variation in these results - for example, the number of active subscribers on a Sunday may typically be lower than on a weekday. The data we're using here are synthetic, and there is no such pattern present.\n", + "\n", + "We can make a similar plot of subscriber counts per region, using the pandas `pivot` method to transform the DataFrame into a structure that can easily be plotted as multiple lines:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subscribers_per_region_per_day_results.pivot(\n", + " index=\"date\", columns=\"pcod\", values=\"value\"\n", + ").plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we can see that there is a very large drop in the subscriber counts in 'GHA.1_1' (Ashanti region) on 1st March, followed by a smaller increase on 10th March. Subscriber counts in all other regions increase when the subscriber counts in Ashanti decrease. This is due to the modelled mobility pattern in the synthetic dataset, in which all subscribers were forced to leave the Ashanti region on 1st March and were allowed to begin returning from 10th March. So in this case, the changes seen here are due to \"real\" mobility changes, rather than problems with the data.\n", + "\n", + "**Note:** In a real-life scenario we would not usually expect changes due to mobility restrictions to be as large or as sudden as those seen here. However, such changes would typically be visible in these plots, so care should be taken when trying to determine whether a change is a \"real\" effect or a data issue.\n", + "\n", + "We can make a similar plot of district-level subscriber counts:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subscribers_per_district_per_day_results.pivot(\n", + " index=\"date\", columns=\"pcod\", values=\"value\"\n", + ").plot(legend=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again we can see the sharp changes on 1st and 10th March. Due to the large number of districts, it may be more insightful to plot the sum of subscriber counts across all districts. Let's include the \"total subscribers per day\" data in the same plot, for comparison - we can do this by creating a matplotlib figure, and specifying `ax=plt.gca()` (\"get current axes\") to plot multiple results in the same plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create a matplotlib figure\n", + "plt.figure()\n", + "\n", + "# Show total subscriber count for comparison\n", + "total_subscribers_per_day_results.plot(\n", + " ax=plt.gca(), x=\"date\", y=\"value\", label=\"total subscribers\"\n", + ")\n", + "\n", + "# Plot the sum of per-district subscriber counts across all districts\n", + "subscribers_per_district_per_day_results.groupby(\"date\").sum().plot(\n", + " ax=plt.gca(), y=\"value\", label=\"sum(subscribers per district)\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the sum of subscriber counts across all districts is also stable over time (again, if we were using real data we'd expect to see weekly variation, and changes due to mobility restrictions). Also, the sum of district-level subscriber counts is always larger than the total subscriber count - this is because some subscribers will be active in multiple districts on the same day, so they will be counted multiple times in the sum, whereas each subscriber is counted only once in the country-level total subscriber count.\n", + "\n", + "Let's also plot the sum of OD matrix counts:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure()\n", + "\n", + "# Show total subscriber count for comparison\n", + "total_subscribers_per_day_results.plot(\n", + " ax=plt.gca(), x=\"date\", y=\"value\", label=\"total subscribers\"\n", + ")\n", + "\n", + "# Plot the sum of per-district subscriber counts across all districts\n", + "subscribers_per_district_per_day_results.groupby(\"date\").sum().plot(\n", + " ax=plt.gca(), y=\"value\", label=\"sum(subscribers per district)\"\n", + ")\n", + "\n", + "# Similar for OD matrix\n", + "od_matrix_district_per_day_results.groupby(\"date\").sum().plot(\n", + " ax=plt.gca(), y=\"value\", label=\"sum(OD matrix)\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The sum of OD matrix counts is smaller than the total subscriber count - this is not unexpected, because some subscribers will only be active in a single district so will not be included in the OD matrix.\n", + "\n", + "There is an effect from the mobility changes on 1st and 10th March, but again there is no obvious sign of data issues." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Spatial distribution of subscriber counts\n", + "\n", + "Another thing we can check is the spatial distribution of subscriber counts. WE eould expect subscriber counts in urban districts to be larger than those in rural districts. To check this, let's calculate the median subscriber count per district:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "median_subscribers_per_district = subscribers_per_district_per_day_results.groupby(\n", + " \"pcod\"\n", + ").median()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To plot the median subscriber counts on a map, we need to get the geographic boundaries of the districts. We can do this using the flowclient `get_geography` function, as described in the [\"Geography\" tutorial](03-geography.ipynb):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "districts_geojson = fc.get_geography(connection=conn, aggregation_unit=\"admin2\")\n", + "districts_gdf = gpd.GeoDataFrame.from_features(districts_geojson)\n", + "districts_gdf" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now join the district boundaries to the median subscriber counts, and display them on a map:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "median_subscribers_per_district_with_geo = districts_gdf.merge(\n", + " median_subscribers_per_district, left_on=\"pcod\", right_index=True\n", + ")\n", + "\n", + "median_subscribers_per_district_with_geo.plot(column=\"value\", legend=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the subscriber count in Accra is much larger than in other districts, as we would expect.\n", + "\n", + "It is difficult to compare the subscriber counts in other districts on this colour scale. It would be clearer to us a log scale - we can do this using `matplotlib.colors.LogNorm`. First, we import the `matplotlib.colors` library:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.colors" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can plot the median subscriber counts in the same way as we did before, but additionally specifying the `norm` parameter. We use `matplotlib.colors.LogNorm()`, which takes two arguments `vmin` and `vmax` (the minimum and maximum values in the colour range):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "median_subscribers_per_district_with_geo.plot(\n", + " column=\"value\",\n", + " legend=True,\n", + " norm=matplotlib.colors.LogNorm(\n", + " vmin=median_subscribers_per_district_with_geo.value.min(),\n", + " vmax=median_subscribers_per_district_with_geo.value.max(),\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the cities of Accra, Kumasi, Tamale and Sekondi-Takoradi all have higher subscriber counts than their surrounding districts, which aligns with what we would expect. Subscriber counts in the Ashanti region are smaller - this is in line with the time-series plot we made earlier of subscriber counts per region." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Events per subscriber\n", + "\n", + "We can also check the average number of events per subscriber per day. We should always have at least 1 event per subscriber (becuase a subscriber with no events would not be present in the dataset), but this value should not be too large - it would be hard to believe that subscribers make 100 calls per day on average, for example. A \"reasonable\" value here will depend on which CDR data types are included - there would typically be more mobile data sessions than phone calls per subscriber per day.\n", + "\n", + "The number of events per subscriber should be fairly stable over time. If this is not the case, it could indicate that there have been substantial changes in calling behaviour, which could make our results unreliable.\n", + "\n", + "We can calculate the average events per subscriber for the whole country by dividing the total event count per day (from `total_events_per_day_results`) by the total subscriber count per day (from `total_subscribers_per_day_results`):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "average_events_per_subscriber = (\n", + " total_events_per_day_results.set_index(\"date\").value\n", + " / total_subscribers_per_day_results.set_index(\"date\").value\n", + ")\n", + "\n", + "average_events_per_subscriber.plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the average number of events per subscriber is around 2.3, which is reasonable, and there are no large changes over time.\n", + "\n", + "Similarly, we can calculate the average number of events per subscriber in each district by dividing event counts in `events_per_district_per_day_results` by subscriber counts in `subscribers_per_district_per_day_results`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "events_per_subscriber_per_district = (\n", + " events_per_district_per_day_results.set_index([\"date\", \"pcod\"])\n", + " / subscribers_per_district_per_day_results.set_index([\"date\", \"pcod\"])\n", + ").reset_index()\n", + "\n", + "events_per_subscriber_per_district.pivot(\n", + " index=\"date\", columns=\"pcod\", values=\"value\"\n", + ").plot(legend=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The changes on 1st and 10th March correspond to a large decrease in the number of events per subscriber in some districts. This could be because the residents of these districts have moved away, so the subscribers we see just made brief visits to the district, or could be because subscribers in these districts are making fewer calls than usual. In either case, we should be cautious when interpreting a change in subscriber count as a corresponding change in population size, as some of the apparent change may be due to different phone usage." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Missing data\n", + "\n", + "It is possible that some of the CDR data ingested into FlowKit's database may be incomplete - there may be missing data on certain days, or for certain geographic areas. To check for missing data we can plot the total event count per day:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "total_events_per_day_results.plot(x=\"date\", y=\"value\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this case, we can see that the event count is quite stable over time, and there are no days with a much smaller event count than usual.\n", + "\n", + "If we noticed that the event count on one day was significantly smaller than usual, we could look at the district-level subscriber counts to see whether there are areas with missing data. For example, to plot the event counts per district on 21st February (using a log colour scale, as we did previously):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "events_per_district_20160221 = events_per_district_per_day_results[\n", + " events_per_district_per_day_results.date == \"2016-02-21\"\n", + "]\n", + "\n", + "districts_gdf.merge(events_per_district_20160221, on=\"pcod\",).plot(\n", + " column=\"value\",\n", + " legend=True,\n", + " norm=matplotlib.colors.LogNorm(\n", + " vmin=events_per_district_20160221.value.min(),\n", + " vmax=events_per_district_20160221.value.max(),\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If there were parts of the country with significantly lower event counts than the rest, this could indicate missing data.\n", + "\n", + "If the aggregates are re-calculated regularly from new available dates of CDR data, these QA checks should be updated each time to check for issues in the new data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Mobility indicators\n", + "\n", + "The aggregates we have calculated do not reveal any personal information about subscribers. However, the raw aggregates are not appropriate to share with a wider audience, for two reasons:\n", + "\n", + "1. The aggregates may reveal commercially sensitive information (for example, the size or geographic distribution of an MNO's subscriber base),\n", + "2. The aggregates could be misleading (for example, a subscriber count of 10000 could be wrongly interpreted as a population count of 10000 people, whereas the true relationship between subscriber count and population count would depend on the MNO's market share, subscribers' calling behaviour, and other factors).\n", + "\n", + "In this section we will use the aggregates to calculate some \"mobility indicators\", which are scaled so that they do not reveal the absolute size or distribution of subscriber counts, while aiming to represent relevant mobility information.\n", + "\n", + "We will calculate three mobility indicators:\n", + "\n", + "1. Percentage change in subscriber presence per district\n", + "2. Average number of districts visited per subscriber within each region\n", + "3. Percentage change in subscriber movements between districts\n", + "\n", + "Details of other mobility indicators can be found on our [COVID-19 website](https://covid19.flowminder.org/).\n", + "\n", + "We need to choose a \"baseline period\" - this is a period before mobility restrictions began, which is assumed to display normal mobility. The baseline period should be at least 4 weeks long, so that baseline averages are not overly skewed by unusual events such as public holidays, and should be a whole number of weeks so that baseline averages are not skewed by including a larger number of Sundays than weekdays (for example). The first mobility restrictions in our example started on 1 March 2016, so we will choose the 4 weeks immediately before this as our baseline period:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "baseline_start = \"2016-02-02\"\n", + "baseline_end = \"2016-03-01\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is also useful, if possible, to identify a stable period during the period of mobility restrictions. In our example, the most recent change in mobility restrictions occurred on 10 March 2016, so we will take all dates from 10 March onwards as our stable period:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "stable_period_start = \"2016-03-10\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 1. Percentage change in subscriber presence per district\n", + "\n", + "This indicator is the percentage change in the number of subscribers in a district, relative to the median number of subscribers during the baseline period.\n", + "\n", + "The subscriber counts we have calculated could be affected by calling behaviour - if fewer subscribers make calls on a particular day, the subscriber count will be lower although those subscribers may still be present in a district, without making a call. To correct for this effect, we can divide the subscriber counts per district by the total subcriber count for the country each day. Assuming that the total number of people in the country doesn't significantly change during the period we're investigating, this will mitigate the effects of changes in calling behaviour.\n", + "\n", + "To calculate the \"scaled subscriber count\" per district (i.e. the subscriber count per district divided by the total subscriber count for the country), we first need to merge the district-level subscriber counts with the total subscriber counts, joining on the 'date' column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "merged_subscriber_counts = subscribers_per_district_per_day_results.merge(\n", + " total_subscribers_per_day_results.rename(columns={\"value\": \"total_subscribers\"}),\n", + " on=\"date\",\n", + ")\n", + "\n", + "merged_subscriber_counts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now add a new `\"scaled_subscriber_count\"` column to this DataFrame, calculated as the district-level subscriber count divided by the total subscriber count:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "merged_subscriber_counts[\"scaled_subscriber_count\"] = (\n", + " merged_subscriber_counts.value / merged_subscriber_counts.total_subscribers\n", + ")\n", + "\n", + "merged_subscriber_counts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We want to calculate the percentage change in the scaled subscriber count relative to the baseline period, so we need to define a single \"baseline\" scaled subscriber count for each district. We define this as the median of the daily scaled subscriber counts over all days in the baseline period. To calculate this, we filter the `merged_subscriber_counts` DataFrame to include only dates within the baseline period, then group by district (identified by the \"pcod\" column) and find the median of the \"scaled_subscriber_count\" column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "scaled_subscriber_count_baseline = (\n", + " merged_subscriber_counts[\n", + " (merged_subscriber_counts.date >= baseline_start)\n", + " & (merged_subscriber_counts.date < baseline_end)\n", + " ]\n", + " .groupby(\"pcod\")\n", + " .median()\n", + " .scaled_subscriber_count\n", + ")\n", + "\n", + "scaled_subscriber_count_baseline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can merge these baseline values into the `merged_subscriber_counts` DataFrame as a new \"scaled_subscriber_count_baseline\" column, joining on the \"pcod\" column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "merged_subscriber_counts = merged_subscriber_counts.merge(\n", + " scaled_subscriber_count_baseline,\n", + " left_on=\"pcod\",\n", + " right_index=True,\n", + " suffixes=(\"\", \"_baseline\"),\n", + ")\n", + "\n", + "merged_subscriber_counts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now calculate the percentage change as\n", + "$$\n", + "100 \\times \\left(\\frac{\\mathrm{scaled\\ subscriber\\ count}}{\\mathrm{scaled\\ subscriber\\ count\\ baseline}} - 1\\right)\n", + "$$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "merged_subscriber_counts[\"percent_change\"] = (\n", + " merged_subscriber_counts.scaled_subscriber_count\n", + " / merged_subscriber_counts.scaled_subscriber_count_baseline\n", + " - 1\n", + ") * 100\n", + "\n", + "merged_subscriber_counts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's choose a few key districts, and plot the percentage change in subscriber presence over time:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "display_districts = {\n", + " \"Accra\": \"GHA.5.1_1\",\n", + " \"Kumasi\": \"GHA.1.16_1\",\n", + " \"Tamale\": \"GHA.6.13_1\",\n", + "}\n", + "\n", + "plt.figure()\n", + "for name, pcod in display_districts.items():\n", + " merged_subscriber_counts[merged_subscriber_counts.pcod == pcod].plot(\n", + " ax=plt.gca(), x=\"date\", y=\"percent_change\", marker=\".\", ls=\"\", label=name\n", + " )\n", + "plt.axhline(0, ls=\":\", c=\"k\")\n", + "plt.ylim(bottom=-100)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that the subscriber presence in Kumasi decreased by more than 80% from early March, compared to the baseline period. In fact, there are no data for 1-10 March, suggesting that there were no active subscribers at all in Kumasi during this time. Meanwhile, the subscriber presence in Accra and Tamale increased by 20%. It appears that almost everybody in Kumasi has moved away to other districts.\n", + "\n", + "We can also look at the percentage changes in all districts, on a map. Since the percentage change will be different each day, let's find the median percentage change per district over the \"stable period\" from 10 March onwards:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "median_subscriber_count_percent_change = (\n", + " merged_subscriber_counts[merged_subscriber_counts.date >= stable_period_start]\n", + " .groupby(\"pcod\")\n", + " .median()\n", + " .percent_change\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now join the district boundaries to the median percentage changes and show these on a map (we'll use the \"Spectral\" colour map with a colour scale from -100% to 100%):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "districts_gdf.merge(\n", + " median_subscriber_count_percent_change, left_on=\"pcod\", right_index=True\n", + ").plot(\n", + " column=\"percent_change\",\n", + " legend=True,\n", + " cmap=\"Spectral\",\n", + " norm=plt.Normalize(vmin=-100, vmax=100),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that subscriber presence in the Ashanti region decreased by 80% following mobility restrictions, and subscriber presence across the rest of the country increased by around 20%." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 2. Average number of districts visited per subscriber within each region\n", + "\n", + "We can also look at the average number of districts visited per subscriber within each region. This is an indicator of the amount of movement in each region - if people move around less, they will be seen in fewer districts.\n", + "\n", + "The average number of districts per subscriber within each region is the sum of subscriber counts for all districts within a region, divided by the overall subscriber count for that region.\n", + "\n", + "We will start by mapping each district to a region. We can do this by looking at the P-codes - district \"GHA.X.Y_1\" is within region \"GHA.X_1\", so we can find the region P-code from the district P-code by taking everything before the final `\".\"`, and appending `\"_1\"`. Let's add this as a new \"region_pcod\" column in the `subscribers_per_district_per_day` DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subscribers_per_district_per_day_results[\n", + " \"region_pcod\"\n", + "] = subscribers_per_district_per_day_results.pcod.apply(\n", + " lambda x: x.rsplit(\".\", 1)[0] + \"_1\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now group by date and region, and sum the subscriber counts:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "districts_per_region_stats = (\n", + " subscribers_per_district_per_day_results.groupby([\"date\", \"region_pcod\"])\n", + " .sum()\n", + " .reset_index()\n", + ")\n", + "\n", + "# Rename column to be more explicit\n", + "districts_per_region_stats = districts_per_region_stats.rename(\n", + " columns={\"value\": \"sum_district_subscriber_counts\"}\n", + ")\n", + "\n", + "districts_per_region_stats" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can merge the summed district-level subscriber counts with the region-level subscriber counts:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Merge with region counts\n", + "districts_per_region_stats = districts_per_region_stats.merge(\n", + " subscribers_per_region_per_day_results,\n", + " left_on=[\"date\", \"region_pcod\"],\n", + " right_on=[\"date\", \"pcod\"],\n", + ")\n", + "\n", + "# Rename value column\n", + "districts_per_region_stats = districts_per_region_stats.rename(\n", + " columns={\"value\": \"region_subscriber_count\"}\n", + ")\n", + "\n", + "districts_per_region_stats" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can calculate the average number of districts visited per subscriber per day within each region, by dividing the summed district-level subscrber counts by the region-level subscriber count:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "districts_per_region_stats[\"average_districts_per_subscriber\"] = (\n", + " districts_per_region_stats.sum_district_subscriber_counts\n", + " / districts_per_region_stats.region_subscriber_count\n", + ")\n", + "\n", + "districts_per_region_stats" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's plot the average districts per subscriber over time for all regions:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "districts_per_region_stats.pivot(\n", + " index=\"date\", columns=\"region_pcod\", values=\"average_districts_per_subscriber\"\n", + ").plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that in all cases the average districts per subscriber is very close to 1, so levels of movement were already quite low before the mobility restrictions. After the mobility restrictions, the value in GHA.1_1 (Ashanti) drops lower (so the subscribers remaining in Ashanti are not moving around the region as much) while the values for all other regions show only a small increase." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 3. Percentage change in subscriber movements between districts\n", + "\n", + "Finally, we can look at the changes in subscriber movements between districts. To do this, we will use the \"district-level OD matrix per day\" aggregate, which counts the number of subscribers seen in both of each pair of regions per day. We will calculate the percentage change in subscriber movements, using a process very similar to the one we used for the \"percentage change in subscriber presence per day\" indicator:\n", + "\n", + "1. Scale the OD matrix subscriber count for each pair of districts by the total subscriber count for that day, to correct for the effects of changes in calling behaviour.\n", + "2. Calculate the median \"scaled subscriber count\" for each pair of districts over the baseline period,\n", + "3. For each pair of districts each day, calculate the percentage change in scaled subscriber count relative to the baseline.\n", + "\n", + "We start by merging the OD matrix DataFrame with the \"total subscribers per day\" DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "od_merged_subscriber_counts = od_matrix_district_per_day_results.merge(\n", + " total_subscribers_per_day_results.rename(columns={\"value\": \"total_subscribers\"}),\n", + " on=\"date\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we divide the subscriber count for each pair of districts by the total subscriber count for that day, to get the \"scaled subscriber count\":" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "od_merged_subscriber_counts[\"scaled_subscriber_count\"] = (\n", + " od_merged_subscriber_counts.value / od_merged_subscriber_counts.total_subscribers\n", + ")\n", + "\n", + "od_merged_subscriber_counts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now calculate the median \"scaled subscriber count\" for each pair of district over the baseline period:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "od_scaled_subscriber_count_baseline = (\n", + " od_merged_subscriber_counts[\n", + " (od_merged_subscriber_counts.date >= baseline_start)\n", + " & (od_merged_subscriber_counts.date < baseline_end)\n", + " ]\n", + " .groupby([\"pcod_from\", \"pcod_to\"])\n", + " .median()\n", + " .scaled_subscriber_count\n", + ")\n", + "od_scaled_subscriber_count_baseline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now merge the baseline values back into the `od_merged_subscriber_counts` DataFrame, joining on the \"pcod_from\" and \"pcod_to\" columns:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "od_merged_subscriber_counts = od_merged_subscriber_counts.merge(\n", + " od_scaled_subscriber_count_baseline,\n", + " left_on=[\"pcod_from\", \"pcod_to\"],\n", + " right_index=True,\n", + " suffixes=(\"\", \"_baseline\"),\n", + ")\n", + "\n", + "od_merged_subscriber_counts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, calculate the percentage change from the baseline value, for each pair of districts each day:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "od_merged_subscriber_counts[\"percent_change\"] = (\n", + " od_merged_subscriber_counts.scaled_subscriber_count\n", + " / od_merged_subscriber_counts.scaled_subscriber_count_baseline\n", + " - 1\n", + ") * 100\n", + "od_merged_subscriber_counts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To show the percentage changes on a map, let's calculate the median percent changes over the \"stable period\":" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "median_od_subscriber_count_percent_change = (\n", + " od_merged_subscriber_counts[od_merged_subscriber_counts.date >= stable_period_start]\n", + " .groupby([\"pcod_from\", \"pcod_to\"])\n", + " .median()\n", + " .percent_change.reset_index()\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We could choose a district, and show the median percentage changes in movements from that district to all other districts. For this, we need to join the percentage changes data to the district boundaries, using the \"pcod_to\" column.\n", + "\n", + "**Note:** If we wanted to show movements in the other direction (i.e. movements from all districts to a chosen district), we would join to the district boundaries using the \"pcod_from\" column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Join to geo (on the \"to\" location)\n", + "median_od_percent_change_with_geo = districts_gdf.merge(\n", + " median_od_subscriber_count_percent_change, left_on=\"pcod\", right_on=\"pcod_to\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's plot the percent change in movements from each of Accra, Kumasi and Tamale:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "display_districts = {\n", + " \"Accra\": \"GHA.5.1_1\",\n", + " \"Kumasi\": \"GHA.1.16_1\",\n", + " \"Tamale\": \"GHA.6.13_1\",\n", + "}\n", + "\n", + "for name, pcod in display_districts.items():\n", + " median_od_percent_change_with_geo[\n", + " median_od_percent_change_with_geo.pcod_from == pcod\n", + " ].plot(\n", + " column=\"percent_change\",\n", + " legend=True,\n", + " cmap=\"Spectral\",\n", + " norm=plt.Normalize(vmin=-100, vmax=100),\n", + " legend_kwds={\"label\": name},\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From Accra and Tamale, movements to the Ashanti region have decreased, while movements to other districts have increased slightly. For Kumasi, movements to all other districts have decreased, especially to the rest of Ashanti (reflecting the smaller number of subscribers remaining in the region)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Summary\n", + "\n", + "In this example, we used FlowKit to extract aggregates from CDR data, and used this to produce mobility indicators. Before producing the indicators, we performed some QA checks on the aggregates to check that there were no issues with the underlying CDR data that could affect the validity of our mobility insights.\n", + "\n", + "We discovered that there was a large decrease in subscriber presence in the Ashanti region from 1 March 2016, followed by a small recovery on 10 March 2016, and a corresponding increase in subscriber presence in the rest of the country (so it appears that there was a large displacement of people from the Ashanti region to the rest of the country).\n", + "\n", + "**Note:** This example is built on a synthetic dataset, which is not intended to represent the effects of real-life mobility restrictions.\n", + "\n", + "In the real-world scenario of an ongoing disease epidemic, the effects of mobility restrictions could be monitored in near-real-time by updating these indicators every few days, provided regular updates of CDR data are being added to the FlowKit server. For more infirmation on the use of CDR data to monitor mobility during a disease epidemic, visit Flowminder's [COVID-19 website](https://covid19.flowminder.org/)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}