|
| 1 | +--- |
| 2 | +title: 'Data Visualisation' |
| 3 | +teaching: 20 |
| 4 | +exercises: 10 |
| 5 | +--- |
| 6 | + |
| 7 | +:::::::::::::::::::::::::::::::::::::: questions |
| 8 | + |
| 9 | +- How can I use Python tools like Pandas and Plotly to visualize library circulation data? |
| 10 | + |
| 11 | +:::::::::::::::::::::::::::::::::::::::::::::::: |
| 12 | + |
| 13 | +::::::::::::::::::::::::::::::::::::: objectives |
| 14 | + |
| 15 | +- Generate plots using Python to interpret and present data on library circulation. |
| 16 | +- Apply data manipulation techniques with pandas to prepare and transform library circulation data into a suitable format for visualization. |
| 17 | +- Analyze and interpret time-series data by identifying key trends and outliers in library circulation data. |
| 18 | + |
| 19 | +:::::::::::::::::::::::::::::::::::::::::::::::: |
| 20 | + |
| 21 | +For this module, we will use a different version of our circulation data that is in a tidy data format, where each variable forms a column, each observation forms a row, and each type of observation unit forms a row. If your workshop included the Tidy Data episode, you should be set and have an object called `df_long` in your Jupyter environment. If not, we’ll read that dataset in now, as it was provided for this lesson. |
| 22 | + |
| 23 | + |
| 24 | +``` python |
| 25 | +#import if it is already not |
| 26 | +import pandas as pd |
| 27 | +``` |
| 28 | + |
| 29 | +If `df_long` isn’t loaded already, we can read it in using `read_csv`. Note, if it isn’t present in your local `data/` folder. |
| 30 | + |
| 31 | +``` python |
| 32 | +df_long = pd.read_csv('data/circ_long.tsv', sep="\t") |
| 33 | +``` |
| 34 | + |
| 35 | +We are using a couple of new parameters in our `read_csv` function above. Since the file we want to read in is a tab-separated values (TSV) file, we tell our `read_csv` function that using the `sep="\t"` parameter. `\t` is encoding for tab character in Python. Other separators or delimeters could include `|` or `;` depending on your data file. |
| 36 | + |
| 37 | +Let’s look at the data: |
| 38 | + |
| 39 | +``` python |
| 40 | +df_long.head() |
| 41 | +``` |
| 42 | + |
| 43 | +``` output |
| 44 | + branch address ... circulation date |
| 45 | +0 Albany Park 5150 N. Kimball Ave. ... 8427 2011-01-01 |
| 46 | +1 Altgeld 13281 S. Corliss Ave. ... 1258 2011-01-01 |
| 47 | +2 Archer Heights 5055 S. Archer Ave. ... 8104 2011-01-01 |
| 48 | +3 Austin 5615 W. Race Ave. ... 1755 2011-01-01 |
| 49 | +4 Austin-Irving 6100 W. Irving Park Rd. ... 12593 2011-01-01 |
| 50 | +``` |
| 51 | + |
| 52 | +## Convert a column to datetime |
| 53 | + |
| 54 | +In order to plot this data over time we need to do two things to prepare it first. First, we need to tell Python that the data column is a [datetime object](https://docs.python.org/3/library/datetime.html) using the Pandas `to_dateime` function. Second, we assign the date column as our index for the data. These two steps will set up our data for plotting. |
| 55 | + |
| 56 | +``` python |
| 57 | +df_long['date']= pd.to_datetime(df_long['date']) |
| 58 | +``` |
| 59 | + |
| 60 | +The above converts our date column to a datetime object. Let’s confirm it worked. |
| 61 | + |
| 62 | +``` python |
| 63 | +df_long.info() |
| 64 | +``` |
| 65 | + |
| 66 | +``` output |
| 67 | +<class 'pandas.core.frame.DataFrame'> |
| 68 | +RangeIndex: 11556 entries, 0 to 11555 |
| 69 | +Data columns (total 9 columns): |
| 70 | +# Column Non-Null Count Dtype |
| 71 | + --- ------ -------------- ----- |
| 72 | + 0 branch 11556 non-null object |
| 73 | + 1 address 7716 non-null object |
| 74 | + 2 city 7716 non-null object |
| 75 | + 3 zip code 7716 non-null float64 |
| 76 | + 4 ytd 11556 non-null int64 |
| 77 | + 5 year 11556 non-null int64 |
| 78 | + 6 month 11556 non-null object |
| 79 | + 7 circulation 11556 non-null int64 |
| 80 | + 8 date 11556 non-null datetime64[ns] |
| 81 | +dtypes: datetime64[ns](1), float64(1), int64(3), object(4) |
| 82 | +memory usage: 812.7+ KB |
| 83 | +``` |
| 84 | + |
| 85 | +That worked! Now, we can make the datetime column the index of our DataFrame. In the Pandas episode we looked at Pandas default numerical index, but we can also use `.set_index()` to declare a specific column as the index of our DataFrame. Using a datetime index will make it easier for us to plot the DataFrame over time. The first parameter of `.set_index()` is the column name and the `inplace=True` parameter allows us to modify the DataFrame without assigning it to a new variable. |
| 86 | + |
| 87 | + |
| 88 | +``` python |
| 89 | +df_long.set_index('date', inplace=True) |
| 90 | +``` |
| 91 | + |
| 92 | +If we look at the data again, we will see our index will be set to date. |
| 93 | + |
| 94 | +``` python |
| 95 | +df_long.head() |
| 96 | +``` |
| 97 | + |
| 98 | +``` output |
| 99 | + branch address ... month circulation |
| 100 | + date ... |
| 101 | + 2011-01-01 Albany Park 5150 N. Kimball Ave. ... january 8427 |
| 102 | + 2011-01-01 Altgeld 13281 S. Corliss Ave. ... january 1258 |
| 103 | + 2011-01-01 Archer Heights 5055 S. Archer Ave. ... january 8104 |
| 104 | + 2011-01-01 Austin 5615 W. Race Ave. ... january 1755 |
| 105 | + 2011-01-01 Austin-Irving 6100 W. Irving Park Rd. ... january 12593 |
| 106 | +
|
| 107 | +``` |
| 108 | + |
| 109 | +## Plotting with Pandas |
| 110 | + |
| 111 | +Ok! We are now ready to plot our data. Since this data is monthly data, we can plot the circulation data over time. |
| 112 | + |
| 113 | +At first, let’s focus on a specific branch. We can select the rows for the Albany Park branch: |
| 114 | + |
| 115 | +``` python |
| 116 | +albany = df_long[df_long['branch'] == 'Albany Park'] |
| 117 | +``` |
| 118 | + |
| 119 | +``` python |
| 120 | +albany.head() |
| 121 | +``` |
| 122 | + |
| 123 | +``` output |
| 124 | + branch address ... month circulation |
| 125 | + date ... |
| 126 | + 2011-01-01 Albany Park 5150 N. Kimball Ave. ... january 8427 |
| 127 | + 2016-01-01 Albany Park NaN ... january 10905 |
| 128 | + 2017-01-01 Albany Park NaN ... january 11031 |
| 129 | + 2022-01-01 Albany Park 3401 W. Foster Ave. ... january 5561 |
| 130 | + 2018-01-01 Albany Park NaN ... january 9381 |
| 131 | +``` |
| 132 | + |
| 133 | +Now we can use the `plot()` function that is built in to pandas. Let’s try it: |
| 134 | + |
| 135 | +``` python |
| 136 | +albany.plot() |
| 137 | +``` |
| 138 | + |
| 139 | +{alt="Line plot of zip code, ytd, year, and circulation numbers over time from the albany DataFrame"} |
| 140 | + |
| 141 | +That's interesting, but by default `.plot()` will use a line plot for all numeric variables of the DataFrame. This isn’t exactly what we want, so let’s tell `.plot()` what variable to use by selecting the `circulation` column. |
| 142 | + |
| 143 | + |
| 144 | +``` python |
| 145 | +albany['circulation'].plot() |
| 146 | +``` |
| 147 | + |
| 148 | +{alt="Line plot of the Albany Park branch circulation showing a big drop from 2013 to 2014."} |
| 149 | + |
| 150 | +::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: challenge |
| 151 | + |
| 152 | +## Analyze the Circulation Trends |
| 153 | + |
| 154 | +Examine the line graph depicting library circulation data. You will notice two significant periods where the circulation drops to zero: first in March 2020 and then a two-year zero circulation period starting in 2012. Evaluate the graph and identify any trends, unusual patterns, or notable changes in the data. |
| 155 | + |
| 156 | +::::::::::::::::::::::::::::::::::::::::::::::::::: solution |
| 157 | + |
| 158 | +The significant drop in circulation in March 2020 is likely due to the COVID-19 pandemic, which caused widespread temporary closures of public spaces, including libraries. |
| 159 | + |
| 160 | +The drop from 2012 through part of 2014 corresponds to the reconstruction period of the Albany Park Branch. The original building at 5150 N. Kimball Avenue was demolished in 2012, and a new, modern building was constructed at the same site. The new Albany Park Branch opened on September 13, 2014, at 3104 W. Foster Avenue in the North Park neighborhood of Chicago. More details about this renovation can be found on the Chicago Public Library webpage: [Chicago Public Library - Albany Park](https://www.chipublib.org/news/stories-we-tell-albany-park-exhibit/). |
| 161 | +:::::::::::::::::::::::::::::::::::::::::::::::: |
| 162 | +:::::::::::::::::::::::::::::::::::::::::::::::: |
| 163 | + |
| 164 | +## Use Matplotlib for more detailed charts |
| 165 | + |
| 166 | +What if we want to alter the axis labels and the title of the graph. In order to do that, we need to first import `matplotlib`, an extensive plotting package in Python that lets us alter all aspects of a graph. |
| 167 | + |
| 168 | +- We can pass parameters to matplotlib's `.plot()` function to assign a plot title, to declare a figsize - which accepts a width and height in inches - and to change the color of the line. |
| 169 | +- Next we'll add text labels for the x and y axis using `.xlabel()` and `.ylabel`. |
| 170 | +- Finally, we need a separate function `.show()` to display the plot using matplotlib. |
| 171 | + |
| 172 | + |
| 173 | +``` python |
| 174 | +# the plotting package pandas is using under the hood to `plot()` |
| 175 | +import matplotlib.pyplot as plt |
| 176 | + |
| 177 | +albany['circulation'].plot(title='Circulation Count Over Time', figsize=(10, 5), color='blue') |
| 178 | +# Adding labels and showing the plot |
| 179 | +plt.xlabel('Date') |
| 180 | +plt.ylabel('Circulation Count') |
| 181 | +plt.show() |
| 182 | +``` |
| 183 | + |
| 184 | +{alt="Line plot of the Albany Park branch circulation with matplotlib styles applied."} |
| 185 | + |
| 186 | +### Changing plot types with Matplotlib |
| 187 | + |
| 188 | +What if we want to use a different plot type for this graphic? To do so, we can change the `kind` parameters in our `.plot()` function. |
| 189 | + |
| 190 | +``` python |
| 191 | +albany['circulation'].plot(kind='area', title='Circulation Count Area Plot at Albany Park', alpha=0.5) |
| 192 | +plt.xlabel('Date') |
| 193 | +plt.ylabel('Circulation Count') |
| 194 | +plt.show() |
| 195 | +``` |
| 196 | +{alt="Area plot of the Albany Park branch circulation."} |
| 197 | + |
| 198 | +We can also look at our circulation data as a histogram. |
| 199 | + |
| 200 | +``` python |
| 201 | +albany['circulation'].plot(kind='hist', bins=20, title='Distribution of Circulation Counts at Albany Park') |
| 202 | +plt.xlabel('Circulation Count') |
| 203 | +plt.show() |
| 204 | +``` |
| 205 | + |
| 206 | +{alt="histogram of the Albany branch circulation."} |
| 207 | + |
| 208 | +## Use Plotly for interactive plots |
| 209 | + |
| 210 | +Let’s switch back to the full DataFrame in `df_long` and use another |
| 211 | +plotting package in Python called Plotly. First let’s install and then use the package. |
| 212 | + |
| 213 | +```python |
| 214 | +# uncomment below to install plotly if the import fails. |
| 215 | +# !pip install plotly |
| 216 | +import plotly.express as px |
| 217 | +``` |
| 218 | + |
| 219 | +Now we can visualize how circulation counts have changed over time for selected branches. This can be especially useful for identifying trends, seasonality, or data anomalies. We willfirst create a subset of our data to look at branches starting with the letter 'A'. Feel free to select different branches. After subsetting, we will sort our new DataFrame by date and then plot our data by date and circulation count. |
| 220 | + |
| 221 | +``` python |
| 222 | +# Creating a line plot for a few selected branches to avoid clutter |
| 223 | +selected_branches = df_long[df_long['branch'].isin(['Altgeld', |
| 224 | + 'Archer Heights', |
| 225 | + 'Austin', |
| 226 | + 'Austin-Irving', |
| 227 | + 'Avalon'])] |
| 228 | +selected_branches = selected_branches.sort_values(by='date') |
| 229 | +``` |
| 230 | + |
| 231 | +``` python |
| 232 | +fig = px.line(selected_branches, x=selected_branches.index, y='circulation', color='branch', title='Circulation Over Time for Selected Branches') |
| 233 | +fig.show() |
| 234 | +``` |
| 235 | + |
| 236 | +Here is a view of the [interactive output of the Plotly line chart](learners/line_plot_int.html). |
| 237 | + |
| 238 | + |
| 239 | +One advantage that Plotly provides over matplotlib is that it has some interactive features out of the box. Hover your cursor over the lines in the output to find out more granular data about specific branches over time. |
| 240 | + |
| 241 | + |
| 242 | +### Bar plots with Plotly |
| 243 | + |
| 244 | +Let’s use a barplot to compare the distribution of circulation counts |
| 245 | +among branches. We first need to group our data by branch and sum up the circulation counts. Then we can use the bar plot to show the |
| 246 | +distribution of total circulation over branches. |
| 247 | + |
| 248 | +``` python |
| 249 | +# Aggregate circulation by branch |
| 250 | +total_circulation_by_branch = df_long.groupby('branch')['circulation'].sum().reset_index() |
| 251 | + |
| 252 | +# Create a bar plot |
| 253 | +fig = px.bar(total_circulation_by_branch, x='branch', y='circulation', title='Total Circulation by Branch') |
| 254 | +fig.show() |
| 255 | +``` |
| 256 | + |
| 257 | +Here is a view of the [interactive output of the Plotly bar chart](learners/bar_plot_int.html). |
| 258 | + |
| 259 | + |
| 260 | +::: keypoints |
| 261 | +- Explored the use of pandas for basic data manipulation, ensuring correct indexing with DatetimeIndex to enable time-series operations like resampling. |
| 262 | +- Used pandas’ built-in plot() for initial visualizations and faced issues with overplotting, leading to adjustments like data filtering and resampling to simplify plots. |
| 263 | +- Introduced Plotly for advanced interactive visualizations, enhancing user engagement through dynamic plots such as line graphs, area charts, and bar plots with capabilities like dropdown selections. |
| 264 | +::: |
0 commit comments