Skip to content

Commit 94f8f2f

Browse files
chennesyjt14den
authored andcommitted
update dataset
1 parent dfe9315 commit 94f8f2f

20 files changed

Lines changed: 1463 additions & 13055 deletions

episodes/data-visualisation.md

Lines changed: 64 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -18,46 +18,64 @@ exercises: 10
1818

1919
::::::::::::::::::::::::::::::::::::::::::::::::
2020

21-
For this module, we will use a different version of our circulation data that is in a tidy data format, where each variable forms a column, each observation forms a row, and each type of observation unit forms a row. If your workshop included the Tidy Data episode, you should be set and have an object called `df_long` in your Jupyter environment. If not, we’ll read that dataset in now, as it was provided for this lesson.
21+
For this module, we will use the tidy (long) version of our circulation data, where each variable forms a column, each observation forms a row, and each type of observation unit forms a row. If your workshop included the Tidy Data episode, you should be set and have an object called `df_long` in your Jupyter environment. If not, we’ll read that dataset in now, as it was provided for this lesson.
2222

2323

2424
``` python
2525
#import if it is already not
2626
import pandas as pd
27+
df_long = pd.read_pickle('data/df_long.pkl')
2728
```
2829

29-
If `df_long` isn’t loaded already, we can read it in using `read_csv`. Note, if it isn’t present in your local `data/` folder.
30+
Let’s look at the data:
3031

3132
``` python
32-
df_long = pd.read_csv('data/circ_long.tsv', sep="\t")
33+
df_long.head()
3334
```
3435

35-
We are using a couple of new parameters in our `read_csv` function above. Since the file we want to read in is a tab-separated values (TSV) file, we tell our `read_csv` function that using the `sep="\t"` parameter. `\t` is encoding for tab character in Python. Other separators or delimeters could include `|` or `;` depending on your data file.
36+
| | branch | address | city | zip code | ytd | year | month | circulation |
37+
|-----|----------------|-------------------------|---------|----------|--------|------|---------|-------------|
38+
| 0 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 120059 | 2011 | january | 8427 |
39+
| 1 | Altgeld | 13281 S. Corliss Ave. | Chicago | 60827.0 | 9611 | 2011 | january | 1258 |
40+
| 2 | Archer Heights | 5055 S. Archer Ave. | Chicago | 60632.0 | 101951 | 2011 | january | 8104 |
41+
| 3 | Austin | 5615 W. Race Ave. | Chicago | 60644.0 | 25527 | 2011 | january | 1755 |
42+
| 4 | Austin-Irving | 6100 W. Irving Park Rd. | Chicago | 60634.0 | 165634 | 2011 | january | 12593 |
3643

37-
Let’s look at the data:
44+
## Convert year and month to datetime
45+
46+
In order to plot this data over time we need to do two things to prepare it first. First, we need to combine the year and month columns into a single [datetime](https://docs.python.org/3/library/datetime.html) column using the Pandas `to_datetime` function. Second, we assign the date column as our index for the data. These two steps will set up our data for plotting.
3847

3948
``` python
40-
df_long.head()
49+
df_long['date'] = pd.to_datetime(df_long['year'].astype(str) + '-' + df_long['month'], format='%Y-%B')
4150
```
4251

43-
``` output
44-
branch address ... circulation date
45-
0 Albany Park 5150 N. Kimball Ave. ... 8427 2011-01-01
46-
1 Altgeld 13281 S. Corliss Ave. ... 1258 2011-01-01
47-
2 Archer Heights 5055 S. Archer Ave. ... 8104 2011-01-01
48-
3 Austin 5615 W. Race Ave. ... 1755 2011-01-01
49-
4 Austin-Irving 6100 W. Irving Park Rd. ... 12593 2011-01-01
50-
```
52+
Let's unpack that code:
5153

52-
## Convert a column to datetime
54+
- `df_long['date']` - First, we create a new `date` column.
55+
- `pd.to_datetime()` - Next we package everything into a datetime object.
56+
- `df_long['year'].astype(str)` - We use the `.astype(str)` method to convert the year column to a string
57+
- `+ '-' + df_long['month'],` - We concatenate a `-` to the string as a separator, followed by the month column.
58+
- `format='%Y-%B'` - We pass the datetime parameter to tell Python to expect a 4 digit year (%Y), followed by a dash, followed by the month's full name (%B).
5359

54-
In order to plot this data over time we need to do two things to prepare it first. First, we need to tell Python that the data column is a [datetime object](https://docs.python.org/3/library/datetime.html) using the Pandas `to_dateime` function. Second, we assign the date column as our index for the data. These two steps will set up our data for plotting.
60+
If we take a look at the date column, we'll see that datetime automatically adds a day (always `01`) in the absence of any specific day input.
5561

56-
``` python
57-
df_long['date']= pd.to_datetime(df_long['date'])
62+
```python
63+
df_long['date']
64+
```
65+
```output
66+
0 2011-01-01
67+
1 2011-01-01
68+
2 2011-01-01
69+
3 2011-01-01
70+
4 2011-01-01
71+
...
72+
11551 2022-12-01
73+
11552 2022-12-01
74+
11553 2022-12-01
75+
11554 2022-12-01
76+
11555 2022-12-01
77+
Name: date, Length: 11556, dtype: datetime64[ns]
5878
```
59-
60-
The above converts our date column to a datetime object. Let’s confirm it worked.
6179

6280
``` python
6381
df_long.info()
@@ -67,18 +85,18 @@ df_long.info()
6785
<class 'pandas.core.frame.DataFrame'>
6886
RangeIndex: 11556 entries, 0 to 11555
6987
Data columns (total 9 columns):
70-
# Column Non-Null Count Dtype
71-
--- ------ -------------- -----
88+
# Column Non-Null Count Dtype
89+
--- ------ -------------- -----
7290
0 branch 11556 non-null object
7391
1 address 7716 non-null object
7492
2 city 7716 non-null object
7593
3 zip code 7716 non-null float64
7694
4 ytd 11556 non-null int64
77-
5 year 11556 non-null int64
95+
5 year 11556 non-null object
7896
6 month 11556 non-null object
7997
7 circulation 11556 non-null int64
8098
8 date 11556 non-null datetime64[ns]
81-
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
99+
dtypes: datetime64[ns](1), float64(1), int64(2), object(5)
82100
memory usage: 812.7+ KB
83101
```
84102

@@ -95,16 +113,15 @@ If we look at the data again, we will see our index will be set to date.
95113
df_long.head()
96114
```
97115

98-
``` output
99-
branch address ... month circulation
100-
date ...
101-
2011-01-01 Albany Park 5150 N. Kimball Ave. ... january 8427
102-
2011-01-01 Altgeld 13281 S. Corliss Ave. ... january 1258
103-
2011-01-01 Archer Heights 5055 S. Archer Ave. ... january 8104
104-
2011-01-01 Austin 5615 W. Race Ave. ... january 1755
105-
2011-01-01 Austin-Irving 6100 W. Irving Park Rd. ... january 12593
116+
| | branch | address | city | zip code | ytd | year | month | circulation |
117+
|------------|----------------|-------------------------|---------|----------|--------|------|---------|-------------|
118+
| date | | | | | | | | |
119+
| 2011-01-01 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 120059 | 2011 | january | 8427 |
120+
| 2011-01-01 | Altgeld | 13281 S. Corliss Ave. | Chicago | 60827.0 | 9611 | 2011 | january | 1258 |
121+
| 2011-01-01 | Archer Heights | 5055 S. Archer Ave. | Chicago | 60632.0 | 101951 | 2011 | january | 8104 |
122+
| 2011-01-01 | Austin | 5615 W. Race Ave. | Chicago | 60644.0 | 25527 | 2011 | january | 1755 |
123+
| 2011-01-01 | Austin-Irving | 6100 W. Irving Park Rd. | Chicago | 60634.0 | 165634 | 2011 | january | 12593 |
106124

107-
```
108125

109126
## Plotting with Pandas
110127

@@ -120,15 +137,14 @@ albany = df_long[df_long['branch'] == 'Albany Park']
120137
albany.head()
121138
```
122139

123-
``` output
124-
branch address ... month circulation
125-
date ...
126-
2011-01-01 Albany Park 5150 N. Kimball Ave. ... january 8427
127-
2016-01-01 Albany Park NaN ... january 10905
128-
2017-01-01 Albany Park NaN ... january 11031
129-
2022-01-01 Albany Park 3401 W. Foster Ave. ... january 5561
130-
2018-01-01 Albany Park NaN ... january 9381
131-
```
140+
| | branch | address | city | zip code | ytd | year | month | circulation |
141+
|------------|-------------|----------------------|---------|----------|--------|------|---------|-------------|
142+
| date | | | | | | | | |
143+
| 2011-01-01 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 120059 | 2011 | january | 8427 |
144+
| 2012-01-01 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 83297 | 2012 | january | 10173 |
145+
| 2013-01-01 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 572 | 2013 | january | 0 |
146+
| 2014-01-01 | Albany Park | 5150 N. Kimball Ave. | Chicago | 60625.0 | 50484 | 2014 | january | 35 |
147+
| 2015-01-01 | Albany Park | NaN | NaN | NaN | 133366 | 2015 | january | 10889 |
132148

133149
Now we can use the `plot()` function that is built in to pandas. Let’s try it:
134150

@@ -165,9 +181,9 @@ The drop from 2012 through part of 2014 corresponds to the reconstruction period
165181

166182
What if we want to alter the axis labels and the title of the graph. In order to do that, we need to first import `matplotlib`, an extensive plotting package in Python that lets us alter all aspects of a graph.
167183

168-
- We can pass parameters to matplotlib's `.plot()` function to assign a plot title, to declare a figsize - which accepts a width and height in inches - and to change the color of the line.
184+
- We can pass parameters to Matplotlib's `.plot()` function to assign a plot title, to declare a figsize - which accepts a width and height in inches - and to change the color of the line.
169185
- Next we'll add text labels for the x and y axis using `.xlabel()` and `.ylabel`.
170-
- Finally, we need a separate function `.show()` to display the plot using matplotlib.
186+
- Finally, we need a separate function `.show()` to display the plot using Matplotlib.
171187

172188

173189
``` python
@@ -233,10 +249,10 @@ fig = px.line(selected_branches, x=selected_branches.index, y='circulation', col
233249
fig.show()
234250
```
235251

236-
Here is a view of the [interactive output of the Plotly line chart](learners/line_plot_int.html).
252+
Here is a view of the <a href='learners/line_plot_int.html', target='_blank'>interactive output of the Plotly line chart</a>.
237253

238254

239-
One advantage that Plotly provides over matplotlib is that it has some interactive features out of the box. Hover your cursor over the lines in the output to find out more granular data about specific branches over time.
255+
One advantage that Plotly provides over Matplotlib is that it has some interactive features out of the box. Hover your cursor over the lines in the output to find out more granular data about specific branches over time.
240256

241257

242258
### Bar plots with Plotly
@@ -254,7 +270,8 @@ fig = px.bar(total_circulation_by_branch, x='branch', y='circulation', title='To
254270
fig.show()
255271
```
256272

257-
Here is a view of the [interactive output of the Plotly bar chart](learners/bar_plot_int.html).
273+
Here is a view of the <a href='learners/bar_plot_int.html', target='_blank'>interactive output of the Plotly bar chart</a>.
274+
258275

259276

260277
::: keypoints

0 commit comments

Comments
 (0)