You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/data-visualisation.md
+64-47Lines changed: 64 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,46 +18,64 @@ exercises: 10
18
18
19
19
::::::::::::::::::::::::::::::::::::::::::::::::
20
20
21
-
For this module, we will use a different version of our circulation data that is in a tidy data format, where each variable forms a column, each observation forms a row, and each type of observation unit forms a row. If your workshop included the Tidy Data episode, you should be set and have an object called `df_long` in your Jupyter environment. If not, we’ll read that dataset in now, as it was provided for this lesson.
21
+
For this module, we will use the tidy (long) version of our circulation data, where each variable forms a column, each observation forms a row, and each type of observation unit forms a row. If your workshop included the Tidy Data episode, you should be set and have an object called `df_long` in your Jupyter environment. If not, we’ll read that dataset in now, as it was provided for this lesson.
22
22
23
23
24
24
```python
25
25
#import if it is already not
26
26
import pandas as pd
27
+
df_long = pd.read_pickle('data/df_long.pkl')
27
28
```
28
29
29
-
If `df_long` isn’t loaded already, we can read it in using `read_csv`. Note, if it isn’t present in your local `data/` folder.
We are using a couple of new parameters in our `read_csv` function above. Since the file we want to read in is a tab-separated values (TSV) file, we tell our `read_csv` function that using the `sep="\t"` parameter. `\t` is encoding for tab character in Python. Other separators or delimeters could include `|` or `;` depending on your data file.
36
+
|| branch | address | city | zip code | ytd | year | month | circulation |
| 0 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 120059 | 2011 | january | 8427 |
39
+
| 1 | Altgeld | 13281 S. Corliss Ave. | Chicago | 60827.0 | 9611 | 2011 | january | 1258 |
40
+
| 2 | Archer Heights | 5055 S. Archer Ave. | Chicago | 60632.0 | 101951 | 2011 | january | 8104 |
41
+
| 3 | Austin | 5615 W. Race Ave. | Chicago | 60644.0 | 25527 | 2011 | january | 1755 |
42
+
| 4 | Austin-Irving | 6100 W. Irving Park Rd. | Chicago | 60634.0 | 165634 | 2011 | january | 12593 |
36
43
37
-
Let’s look at the data:
44
+
## Convert year and month to datetime
45
+
46
+
In order to plot this data over time we need to do two things to prepare it first. First, we need to combine the year and month columns into a single [datetime](https://docs.python.org/3/library/datetime.html) column using the Pandas `to_datetime` function. Second, we assign the date column as our index for the data. These two steps will set up our data for plotting.
0 Albany Park 5150 N. Kimball Ave. ... 8427 2011-01-01
46
-
1 Altgeld 13281 S. Corliss Ave. ... 1258 2011-01-01
47
-
2 Archer Heights 5055 S. Archer Ave. ... 8104 2011-01-01
48
-
3 Austin 5615 W. Race Ave. ... 1755 2011-01-01
49
-
4 Austin-Irving 6100 W. Irving Park Rd. ... 12593 2011-01-01
50
-
```
52
+
Let's unpack that code:
51
53
52
-
## Convert a column to datetime
54
+
-`df_long['date']` - First, we create a new `date` column.
55
+
-`pd.to_datetime()` - Next we package everything into a datetime object.
56
+
-`df_long['year'].astype(str)` - We use the `.astype(str)` method to convert the year column to a string
57
+
-`+ '-' + df_long['month'],` - We concatenate a `-` to the string as a separator, followed by the month column.
58
+
-`format='%Y-%B'` - We pass the datetime parameter to tell Python to expect a 4 digit year (%Y), followed by a dash, followed by the month's full name (%B).
53
59
54
-
In order to plot this data over time we need to do two things to prepare it first. First, we need to tell Python that the data column is a [datetime object](https://docs.python.org/3/library/datetime.html) using the Pandas `to_dateime` function. Second, we assign the date column as our index for the data. These two steps will set up our data for plotting.
60
+
If we take a look at the date column, we'll see that datetime automatically adds a day (always `01`) in the absence of any specific day input.
55
61
56
-
```python
57
-
df_long['date']= pd.to_datetime(df_long['date'])
62
+
```python
63
+
df_long['date']
64
+
```
65
+
```output
66
+
0 2011-01-01
67
+
1 2011-01-01
68
+
2 2011-01-01
69
+
3 2011-01-01
70
+
4 2011-01-01
71
+
...
72
+
11551 2022-12-01
73
+
11552 2022-12-01
74
+
11553 2022-12-01
75
+
11554 2022-12-01
76
+
11555 2022-12-01
77
+
Name: date, Length: 11556, dtype: datetime64[ns]
58
78
```
59
-
60
-
The above converts our date column to a datetime object. Let’s confirm it worked.
| 2011-01-01 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 120059 | 2011 | january | 8427 |
144
+
| 2012-01-01 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 83297 | 2012 | january | 10173 |
145
+
| 2013-01-01 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 572 | 2013 | january | 0 |
146
+
| 2014-01-01 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 50484 | 2014 | january | 35 |
147
+
| 2015-01-01 | Albany Park | NaN | NaN | NaN | 133366 | 2015 | january | 10889 |
132
148
133
149
Now we can use the `plot()` function that is built in to pandas. Let’s try it:
134
150
@@ -165,9 +181,9 @@ The drop from 2012 through part of 2014 corresponds to the reconstruction period
165
181
166
182
What if we want to alter the axis labels and the title of the graph. In order to do that, we need to first import `matplotlib`, an extensive plotting package in Python that lets us alter all aspects of a graph.
167
183
168
-
- We can pass parameters to matplotlib's `.plot()` function to assign a plot title, to declare a figsize - which accepts a width and height in inches - and to change the color of the line.
184
+
- We can pass parameters to Matplotlib's `.plot()` function to assign a plot title, to declare a figsize - which accepts a width and height in inches - and to change the color of the line.
169
185
- Next we'll add text labels for the x and y axis using `.xlabel()` and `.ylabel`.
170
-
- Finally, we need a separate function `.show()` to display the plot using matplotlib.
186
+
- Finally, we need a separate function `.show()` to display the plot using Matplotlib.
171
187
172
188
173
189
```python
@@ -233,10 +249,10 @@ fig = px.line(selected_branches, x=selected_branches.index, y='circulation', col
233
249
fig.show()
234
250
```
235
251
236
-
Here is a view of the [interactive output of the Plotly line chart](learners/line_plot_int.html).
252
+
Here is a view of the <a href='learners/line_plot_int.html', target='_blank'>interactive output of the Plotly line chart</a>.
237
253
238
254
239
-
One advantage that Plotly provides over matplotlib is that it has some interactive features out of the box. Hover your cursor over the lines in the output to find out more granular data about specific branches over time.
255
+
One advantage that Plotly provides over Matplotlib is that it has some interactive features out of the box. Hover your cursor over the lines in the output to find out more granular data about specific branches over time.
0 commit comments