You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/data-visualisation.md
+42-36Lines changed: 42 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: 'Data Visualisation'
3
-
teaching: 10
4
-
exercises: 2
3
+
teaching: 20
4
+
exercises: 10
5
5
---
6
6
7
7
:::::::::::::::::::::::::::::::::::::: questions
@@ -49,20 +49,20 @@ df_long.head()
49
49
4 Austin-Irving 6100 W. Irving Park Rd. ... 12593 2011-01-01
50
50
```
51
51
52
+
## Convert a column to datetime
52
53
53
-
In order to plot this data over time we need to do two things to prepare it first. First, we need to tell Python that the data column is a datetime object using the Pandas `to_dateime` function. Second, we assign the date column as our index for the data. These two steps will set up our data for plotting.
54
+
In order to plot this data over time we need to do two things to prepare it first. First, we need to tell Python that the data column is a [datetime object](https://docs.python.org/3/library/datetime.html) using the Pandas `to_dateime` function. Second, we assign the date column as our index for the data. These two steps will set up our data for plotting.
54
55
55
56
```python
56
57
df_long['date']= pd.to_datetime(df_long['date'])
57
58
```
58
59
59
-
The above converts our date column to a data time object. Let’s confirm it worked.
60
+
The above converts our date column to a datetime object. Let’s confirm it worked.
That worked! Now, we can make this datetime object our dataframe index.
85
+
That worked! Now, we can make the datetime column the index of our DataFrame. In the Pandas episode we looked at Pandas default numerical index, but we can also use `.set_index()` to declare a specific column as the index of our DataFrame. Using a datetime index will make it easier for us to plot the DataFrame over time. The first parameter of `.set_index()` is the column name and the `inplace=True` parameter allows us to modify the DataFrame without assigning it to a new variable.
86
86
87
87
88
88
```python
@@ -106,6 +106,8 @@ df_long.head()
106
106
107
107
```
108
108
109
+
## Plotting with Pandas
110
+
109
111
Ok! We are now ready to plot our data. Since this data is monthly data, we can plot the circulation data over time.
110
112
111
113
At first, let’s focus on a specific branch. We can select the rows for the Albany Park branch:
@@ -128,22 +130,22 @@ albany.head()
128
130
2018-01-01 Albany Park NaN ... january 9381
129
131
```
130
132
131
-
Now we can use the `plot()` function built into pandas. Let’s try it:
133
+
Now we can use the `plot()` function that is built in to pandas. Let’s try it:
132
134
133
135
```python
134
136
albany.plot()
135
137
```
136
138
137
-

139
+
{alt="Line plot of zip code, ytd, year, and circulation numbers over time from the albany DataFrame"}
138
140
139
-
That’s great! By default `plot`use a line plot and will plot all numeric variables of the data frame. This isn’t exactly what we want, so let’s tell `plot` what variable to use by selecting `circulation_count`.
141
+
That's interesting, but by default `.plot()` will use a line plot for all numeric variables of the DataFrame. This isn’t exactly what we want, so let’s tell `.plot()` what variable to use by selecting the `circulation` column.
140
142
141
143
142
144
```python
143
145
albany['circulation'].plot()
144
146
```
145
147
146
-

148
+
{alt="Line plot of the Albany Park branch circulation showing a big drop from 2013 to 2014."}
@@ -159,8 +161,15 @@ The drop from 2012 through part of 2014 corresponds to the reconstruction period
159
161
::::::::::::::::::::::::::::::::::::::::::::::::
160
162
::::::::::::::::::::::::::::::::::::::::::::::::
161
163
164
+
## Use Matplotlib for more detailed charts
165
+
162
166
What if we want to alter the axis labels and the title of the graph. In order to do that, we need to first import `matplotlib`, an extensive plotting package in Python that lets us alter all aspects of a graph.
163
167
168
+
- We can pass parameters to matplotlib's `.plot()` function to assign a plot title, to declare a figsize - which accepts a width and height in inches - and to change the color of the line.
169
+
- Next we'll add text labels for the x and y axis using `.xlabel()` and `.ylabel`.
170
+
- Finally, we need a separate function `.show()` to display the plot using matplotlib.
171
+
172
+
164
173
```python
165
174
# the plotting package pandas is using under the hood to `plot()`
166
175
import matplotlib.pyplot as plt
@@ -171,17 +180,20 @@ plt.xlabel('Date')
171
180
plt.ylabel('Circulation Count')
172
181
plt.show()
173
182
```
174
-

175
183
176
-
What if we want to use a different plot type for this graphic? To do so, we can change the `kind` parameters in our `plot` function.
184
+
{alt="Line plot of the Albany Park branch circulation with matplotlib styles applied."}
185
+
186
+
### Changing plot types with Matplotlib
187
+
188
+
What if we want to use a different plot type for this graphic? To do so, we can change the `kind` parameters in our `.plot()` function.
177
189
178
190
```python
179
191
albany['circulation'].plot(kind='area', title='Circulation Count Area Plot at Albany Park', alpha=0.5)
180
192
plt.xlabel('Date')
181
193
plt.ylabel('Circulation Count')
182
194
plt.show()
183
195
```
184
-

196
+
{alt="Area plot of the Albany Park branch circulation."}
185
197
186
198
We can also look at our circulation data as a histogram.
{alt="histogram of the Albany branch circulation."}
195
207
196
-
## Interactive Line Plot for Circulation Over Time
208
+
## Use Plotly for interactive plots
197
209
198
-
Let’s switch back to the full dataframe in `df_long` and use another
199
-
plotting package in Python called Plotly. First let’s install and then use
200
-
the package.
210
+
Let’s switch back to the full DataFrame in `df_long` and use another
211
+
plotting package in Python called Plotly. First let’s install and then use the package.
201
212
202
-
```{python}
213
+
```python
203
214
# uncomment below to install plotly if the import fails.
204
215
# !pip install plotly
205
216
import plotly.express as px
206
217
```
207
-
Now we can visualize how circulation counts have changed over time for selected
208
-
branches. This can be especially useful for identifying trends,
209
-
seasonality, or anomalies. We will first create a subset of our data and
210
-
only look at branches starting with the letter A. Feel free to select
211
-
different branches. After subsetting, we will sort our new dataframe by
212
-
date and then plot our data by date and ciculation count.
218
+
219
+
Now we can visualize how circulation counts have changed over time for selected branches. This can be especially useful for identifying trends, seasonality, or data anomalies. We willfirst create a subset of our data to look at branches starting with the letter 'A'. Feel free to select different branches. After subsetting, we will sort our new DataFrame by date and then plot our data by date and circulation count.
213
220
214
221
```python
215
222
# Creating a line plot for a few selected branches to avoid clutter
Plotly provides some nice interactive features out of the box. Hover
233
-
over the data and interact witht he plot controls.
236
+
Here is a view of the [interactive output of the Plotly line chart](learners/line_plot_int.html).
237
+
238
+
239
+
One advantage that Plotly provides over matplotlib is that it has some interactive features out of the box. Hover your cursor over the lines in the output to find out more granular data about specific branches over time.
234
240
235
-
#### Barplot to Compare Circulation Distributions Among Branches
241
+
242
+
### Bar plots with Plotly
236
243
237
244
Let’s use a barplot to compare the distribution of circulation counts
238
-
among branches. We first need to group our data by branch and sum up the
239
-
circulation counts. Then we can use the bar plot to show the
245
+
among branches. We first need to group our data by branch and sum up the circulation counts. Then we can use the bar plot to show the
Here is a view of the [interactive output of the Plotly bar chart](learners/bar_plot_int.html).
258
+
253
259
254
260
::: keypoints
255
261
- Explored the use of pandas for basic data manipulation, ensuring correct indexing with DatetimeIndex to enable time-series operations like resampling.
0 commit comments