Skip to content

Commit f5ceed9

Browse files
committed
add plotly output
1 parent 7578f69 commit f5ceed9

5 files changed

Lines changed: 72 additions & 52 deletions

File tree

config.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,9 @@ episodes:
7070
- conditionals.md
7171
- writing-functions.md
7272
- tidy.md
73-
- wrap.md
7473
- data-visualisation.md
74+
- wrap.md
75+
7576

7677
# Information for Learners
7778
learners:

episodes/data-visualisation.md

Lines changed: 42 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: 'Data Visualisation'
3-
teaching: 10
4-
exercises: 2
3+
teaching: 20
4+
exercises: 10
55
---
66

77
:::::::::::::::::::::::::::::::::::::: questions
@@ -49,20 +49,20 @@ df_long.head()
4949
4 Austin-Irving 6100 W. Irving Park Rd. ... 12593 2011-01-01
5050
```
5151

52+
## Convert a column to datetime
5253

53-
In order to plot this data over time we need to do two things to prepare it first. First, we need to tell Python that the data column is a datetime object using the Pandas `to_dateime` function. Second, we assign the date column as our index for the data. These two steps will set up our data for plotting.
54+
In order to plot this data over time we need to do two things to prepare it first. First, we need to tell Python that the data column is a [datetime object](https://docs.python.org/3/library/datetime.html) using the Pandas `to_dateime` function. Second, we assign the date column as our index for the data. These two steps will set up our data for plotting.
5455

5556
``` python
5657
df_long['date']= pd.to_datetime(df_long['date'])
5758
```
5859

59-
The above converts our date column to a data time object. Let’s confirm it worked.
60+
The above converts our date column to a datetime object. Let’s confirm it worked.
6061

6162
``` python
6263
df_long.info()
6364
```
6465

65-
6666
``` output
6767
<class 'pandas.core.frame.DataFrame'>
6868
RangeIndex: 11556 entries, 0 to 11555
@@ -82,7 +82,7 @@ dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
8282
memory usage: 812.7+ KB
8383
```
8484

85-
That worked! Now, we can make this datetime object our dataframe index.
85+
That worked! Now, we can make the datetime column the index of our DataFrame. In the Pandas episode we looked at Pandas default numerical index, but we can also use `.set_index()` to declare a specific column as the index of our DataFrame. Using a datetime index will make it easier for us to plot the DataFrame over time. The first parameter of `.set_index()` is the column name and the `inplace=True` parameter allows us to modify the DataFrame without assigning it to a new variable.
8686

8787

8888
``` python
@@ -106,6 +106,8 @@ df_long.head()
106106
107107
```
108108

109+
## Plotting with Pandas
110+
109111
Ok! We are now ready to plot our data. Since this data is monthly data, we can plot the circulation data over time.
110112

111113
At first, let’s focus on a specific branch. We can select the rows for the Albany Park branch:
@@ -128,22 +130,22 @@ albany.head()
128130
2018-01-01 Albany Park NaN ... january 9381
129131
```
130132

131-
Now we can use the `plot()` function built into pandas. Let’s try it:
133+
Now we can use the `plot()` function that is built in to pandas. Let’s try it:
132134

133135
``` python
134136
albany.plot()
135137
```
136138

137-
![](fig/albany-plot-1.png)
139+
![](fig/albany-plot-1.png){alt="Line plot of zip code, ytd, year, and circulation numbers over time from the albany DataFrame"}
138140

139-
That’s great! By default `plot` use a line plot and will plot all numeric variables of the data frame. This isn’t exactly what we want, so let’s tell `plot` what variable to use by selecting `circulation_count`.
141+
That's interesting, but by default `.plot()` will use a line plot for all numeric variables of the DataFrame. This isn’t exactly what we want, so let’s tell `.plot()` what variable to use by selecting the `circulation` column.
140142

141143

142144
``` python
143145
albany['circulation'].plot()
144146
```
145147

146-
![](fig/albany-circ-3.png)
148+
![](fig/albany-circ-3.png){alt="Line plot of the Albany Park branch circulation showing a big drop from 2013 to 2014."}
147149

148150
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: challenge
149151

@@ -159,8 +161,15 @@ The drop from 2012 through part of 2014 corresponds to the reconstruction period
159161
::::::::::::::::::::::::::::::::::::::::::::::::
160162
::::::::::::::::::::::::::::::::::::::::::::::::
161163

164+
## Use Matplotlib for more detailed charts
165+
162166
What if we want to alter the axis labels and the title of the graph. In order to do that, we need to first import `matplotlib`, an extensive plotting package in Python that lets us alter all aspects of a graph.
163167

168+
- We can pass parameters to matplotlib's `.plot()` function to assign a plot title, to declare a figsize - which accepts a width and height in inches - and to change the color of the line.
169+
- Next we'll add text labels for the x and y axis using `.xlabel()` and `.ylabel`.
170+
- Finally, we need a separate function `.show()` to display the plot using matplotlib.
171+
172+
164173
``` python
165174
# the plotting package pandas is using under the hood to `plot()`
166175
import matplotlib.pyplot as plt
@@ -171,17 +180,20 @@ plt.xlabel('Date')
171180
plt.ylabel('Circulation Count')
172181
plt.show()
173182
```
174-
![](fig/albany-circ-labeling-5.png)
175183

176-
What if we want to use a different plot type for this graphic? To do so, we can change the `kind` parameters in our `plot` function.
184+
![](fig/albany-circ-labeling-5.png){alt="Line plot of the Albany Park branch circulation with matplotlib styles applied."}
185+
186+
### Changing plot types with Matplotlib
187+
188+
What if we want to use a different plot type for this graphic? To do so, we can change the `kind` parameters in our `.plot()` function.
177189

178190
``` python
179191
albany['circulation'].plot(kind='area', title='Circulation Count Area Plot at Albany Park', alpha=0.5)
180192
plt.xlabel('Date')
181193
plt.ylabel('Circulation Count')
182194
plt.show()
183195
```
184-
![](fig/albany-circ-area-7.png)
196+
![](fig/albany-circ-area-7.png){alt="Area plot of the Albany Park branch circulation."}
185197

186198
We can also look at our circulation data as a histogram.
187199

@@ -191,25 +203,20 @@ plt.xlabel('Circulation Count')
191203
plt.show()
192204
```
193205

194-
![](fig/albany-circ-hist-9.png)
206+
![](fig/albany-circ-hist-9.png){alt="histogram of the Albany branch circulation."}
195207

196-
## Interactive Line Plot for Circulation Over Time
208+
## Use Plotly for interactive plots
197209

198-
Let’s switch back to the full dataframe in `df_long` and use another
199-
plotting package in Python called Plotly. First let’s install and then use
200-
the package.
210+
Let’s switch back to the full DataFrame in `df_long` and use another
211+
plotting package in Python called Plotly. First let’s install and then use the package.
201212

202-
```{python}
213+
```python
203214
# uncomment below to install plotly if the import fails.
204215
# !pip install plotly
205216
import plotly.express as px
206217
```
207-
Now we can visualize how circulation counts have changed over time for selected
208-
branches. This can be especially useful for identifying trends,
209-
seasonality, or anomalies. We will first create a subset of our data and
210-
only look at branches starting with the letter A. Feel free to select
211-
different branches. After subsetting, we will sort our new dataframe by
212-
date and then plot our data by date and ciculation count.
218+
219+
Now we can visualize how circulation counts have changed over time for selected branches. This can be especially useful for identifying trends, seasonality, or data anomalies. We willfirst create a subset of our data to look at branches starting with the letter 'A'. Feel free to select different branches. After subsetting, we will sort our new DataFrame by date and then plot our data by date and circulation count.
213220

214221
``` python
215222
# Creating a line plot for a few selected branches to avoid clutter
@@ -225,18 +232,17 @@ selected_branches = selected_branches.sort_values(by='date')
225232
fig = px.line(selected_branches, x=selected_branches.index, y='circulation', color='branch', title='Circulation Over Time for Selected Branches')
226233
fig.show()
227234
```
228-
TODO: include either a static, gif, or html output of plotly
229-
* https://plotly.com/python/interactive-html-export/
230-
* https://pypi.org/project/plotly-gif/
231235

232-
Plotly provides some nice interactive features out of the box. Hover
233-
over the data and interact witht he plot controls.
236+
Here is a view of the [interactive output of the Plotly line chart](learners/line_plot_int.html).
237+
238+
239+
One advantage that Plotly provides over matplotlib is that it has some interactive features out of the box. Hover your cursor over the lines in the output to find out more granular data about specific branches over time.
234240

235-
#### Barplot to Compare Circulation Distributions Among Branches
241+
242+
### Bar plots with Plotly
236243

237244
Let’s use a barplot to compare the distribution of circulation counts
238-
among branches. We first need to group our data by branch and sum up the
239-
circulation counts. Then we can use the bar plot to show the
245+
among branches. We first need to group our data by branch and sum up the circulation counts. Then we can use the bar plot to show the
240246
distribution of total circulation over branches.
241247

242248
``` python
@@ -247,9 +253,9 @@ total_circulation_by_branch = df_long.groupby('branch')['circulation'].sum().res
247253
fig = px.bar(total_circulation_by_branch, x='branch', y='circulation', title='Total Circulation by Branch')
248254
fig.show()
249255
```
250-
TODO: include either a static, gif, or html output of plotly
251-
* https://plotly.com/python/interactive-html-export/
252-
* https://pypi.org/project/plotly-gif/
256+
257+
Here is a view of the [interactive output of the Plotly bar chart](learners/bar_plot_int.html).
258+
253259

254260
::: keypoints
255261
- Explored the use of pandas for basic data manipulation, ensuring correct indexing with DatetimeIndex to enable time-series operations like resampling.

lc-python-intro.Rproj

Lines changed: 0 additions & 15 deletions
This file was deleted.

learners/bar_plot_int.html

Lines changed: 14 additions & 0 deletions
Large diffs are not rendered by default.

learners/line_plot_int.html

Lines changed: 14 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)