Skip to content

Commit 32d8d44

Browse files
travisjneumanclaude
andcommitted
feat: add Module 01 — Web Scraping curriculum
Five progressive projects teaching requests, BeautifulSoup, structured data extraction, pagination, and CSV export. All projects target books.toscrape.com as a safe scraping sandbox and follow the alter/break/fix/explain learning pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a19cc7c commit 32d8d44

18 files changed

Lines changed: 1197 additions & 0 deletions

File tree

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Module 01 / Project 01 — Fetch a Webpage
2+
3+
[README](../../../../README.md) · [Module Index](../README.md)
4+
5+
## Focus
6+
7+
- `requests.get()` to fetch a URL
8+
- HTTP status codes (200, 404, etc.)
9+
- Inspecting `response.text`, `response.status_code`, and `response.headers`
10+
11+
## Why this project exists
12+
13+
Before you can scrape data from any website, you need to know how to fetch a page and understand what comes back. This project teaches you the fundamentals of making HTTP requests in Python. You will see the raw HTML that your browser normally renders for you, and you will learn to check whether a request succeeded or failed.
14+
15+
## Run
16+
17+
```bash
18+
cd projects/modules/01-web-scraping/01-fetch-a-webpage
19+
python project.py
20+
```
21+
22+
## Expected output
23+
24+
```text
25+
Fetching http://books.toscrape.com/ ...
26+
Status code: 200
27+
Content type: text/html
28+
Content length: XXXXX characters
29+
30+
First 500 characters of the page:
31+
--------------------------------------------------
32+
<!DOCTYPE html>
33+
<!--[if lt IE 7]> <html lang="en-us" ...
34+
(HTML content continues)
35+
--------------------------------------------------
36+
Done.
37+
```
38+
39+
The exact character count and HTML will vary, but you should see status code 200 and recognizable HTML.
40+
41+
## Alter it
42+
43+
1. Change the URL to `http://books.toscrape.com/catalogue/page-2.html` and run again. What changes? What stays the same?
44+
2. Add a line that prints `response.headers` to see all the HTTP headers the server sent back. Pick two headers and look up what they mean.
45+
3. Add a check: if the status code is not 200, print a warning message instead of the page content.
46+
47+
## Break it
48+
49+
1. Change the URL to `http://books.toscrape.com/this-page-does-not-exist`. What status code do you get?
50+
2. Change the URL to `http://definitely-not-a-real-website-abc123.com`. What error do you get? (Hint: it is not a status code — it is a Python exception.)
51+
3. Remove the `import requests` line and run the script. Read the error message carefully.
52+
53+
## Fix it
54+
55+
1. Wrap the `requests.get()` call in a try/except block that catches `requests.exceptions.RequestException`. Print a friendly error message instead of a traceback.
56+
2. After fetching, check `response.status_code`. If it is 404, print "Page not found" and exit early. If it is anything other than 200, print the status code as a warning.
57+
3. Put the import back if you removed it.
58+
59+
## Explain it
60+
61+
1. What is an HTTP status code and what does 200 mean?
62+
2. What is the difference between `response.text` and `response.content`?
63+
3. Why might `requests.get()` raise an exception instead of returning a response?
64+
4. What does the `Content-Type` header tell you?
65+
66+
## Mastery check
67+
68+
You can move on when you can:
69+
70+
- Fetch any URL and check whether it succeeded, from memory.
71+
- Explain what a status code is without looking it up.
72+
- Handle both HTTP errors (404) and connection errors (no internet) gracefully.
73+
- Describe what `response.text` contains.
74+
75+
## Next
76+
77+
[Project 02 — Parse HTML](../02-parse-html/)
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Notes — Fetch a Webpage
2+
3+
## What I learned
4+
5+
6+
## What confused me
7+
8+
9+
## What I want to explore next
10+
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
"""
2+
Project 01 — Fetch a Webpage
3+
4+
This script fetches a web page using the requests library and prints
5+
basic information about the response: status code, content type,
6+
content length, and a preview of the HTML.
7+
8+
Target site: http://books.toscrape.com (a safe practice site for scraping)
9+
"""
10+
11+
# The requests library makes HTTP requests simple.
12+
# You installed it with: pip install requests
13+
import requests
14+
15+
16+
def fetch_page(url):
17+
"""
18+
Fetch a web page and return the response object.
19+
20+
requests.get() sends an HTTP GET request to the URL — the same kind
21+
of request your browser sends when you type a URL in the address bar.
22+
The response object contains the status code, headers, and body.
23+
"""
24+
print(f"Fetching {url} ...")
25+
response = requests.get(url)
26+
return response
27+
28+
29+
def display_response_info(response):
30+
"""
31+
Print useful information about the HTTP response.
32+
33+
Every HTTP response has:
34+
- A status code (200 = success, 404 = not found, 500 = server error)
35+
- Headers (metadata like content type, server name, etc.)
36+
- A body (the actual HTML, JSON, or other content)
37+
"""
38+
39+
# The status code tells you whether the request succeeded.
40+
# 200 means "OK" — the server found the page and sent it back.
41+
print(f"Status code: {response.status_code}")
42+
43+
# Headers are key-value pairs the server sends with the response.
44+
# Content-Type tells you what kind of content came back (HTML, JSON, etc.)
45+
content_type = response.headers.get("Content-Type", "unknown")
46+
print(f"Content type: {content_type}")
47+
48+
# response.text is the body of the response as a string.
49+
# For a web page, this is the raw HTML that the browser would render.
50+
# len() tells us how many characters are in the response.
51+
print(f"Content length: {len(response.text)} characters")
52+
53+
# Print the first 500 characters so you can see what HTML looks like.
54+
# This is the same HTML your browser receives — it just renders it
55+
# as a pretty page instead of showing the raw tags.
56+
print()
57+
print("First 500 characters of the page:")
58+
print("-" * 50)
59+
print(response.text[:500])
60+
print("-" * 50)
61+
62+
63+
def main():
64+
# books.toscrape.com is a website built specifically for people
65+
# learning web scraping. It is safe to scrape and will not block you.
66+
url = "http://books.toscrape.com/"
67+
68+
# Step 1: Fetch the page
69+
response = fetch_page(url)
70+
71+
# Step 2: Check if the request succeeded
72+
# Any status code in the 200s means success.
73+
# The most common success code is 200 ("OK").
74+
if response.status_code == 200:
75+
display_response_info(response)
76+
else:
77+
print(f"Request failed with status code: {response.status_code}")
78+
print("This means the server could not return the page you asked for.")
79+
80+
print("\nDone.")
81+
82+
83+
# This pattern means: only run main() when this file is executed directly.
84+
# If someone imports this file, main() will NOT run automatically.
85+
# This is a Python convention you will see in almost every script.
86+
if __name__ == "__main__":
87+
main()
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Module 01 / Project 02 — Parse HTML
2+
3+
[README](../../../../README.md) · [Module Index](../README.md)
4+
5+
## Focus
6+
7+
- Creating a BeautifulSoup object from HTML
8+
- `find()` and `find_all()` to locate elements
9+
- CSS selectors with `select()`
10+
- Extracting text and attributes from elements
11+
12+
## Why this project exists
13+
14+
Raw HTML is a mess of tags, attributes, and nesting. BeautifulSoup turns that mess into a tree structure you can search. This project teaches you to find specific elements on a page — the single most important skill in web scraping. You will extract book titles and prices from a real webpage.
15+
16+
## Run
17+
18+
```bash
19+
cd projects/modules/01-web-scraping/02-parse-html
20+
python project.py
21+
```
22+
23+
## Expected output
24+
25+
```text
26+
Fetching http://books.toscrape.com/ ...
27+
Parsing HTML with BeautifulSoup...
28+
29+
Found 20 books on the page:
30+
31+
1. A Light in the Attic £51.77
32+
2. Tipping the Velvet £53.74
33+
3. Soumission £50.10
34+
...
35+
20. (last book title) £XX.XX
36+
37+
Done. Extracted 20 books.
38+
```
39+
40+
The exact titles and prices depend on the current page content, but you should see 20 books listed.
41+
42+
## Alter it
43+
44+
1. Instead of printing the price, print the star rating. Each book has a `<p>` tag with a class like `star-rating Three`. Extract and print the rating word (One, Two, Three, etc.).
45+
2. Use `soup.select()` with a CSS selector instead of `find_all()`. For example, `soup.select("article.product_pod h3 a")` selects all title links. Try rewriting the extraction using only CSS selectors.
46+
3. Extract and print the URL of each book's detail page (the `href` attribute on the title link).
47+
48+
## Break it
49+
50+
1. Change the parser from `"lxml"` to `"html.parser"` (Python's built-in). Does the output change? What if the HTML were malformed — which parser would handle it better?
51+
2. Search for a tag that does not exist: `soup.find("div", class_="nonexistent")`. What does it return? What happens if you try to call `.text` on that result?
52+
3. Remove the `import` for BeautifulSoup and run the script. Read the error.
53+
54+
## Fix it
55+
56+
1. Before calling `.text` on a found element, add a check: `if element is not None`. Print "Not found" if the element is missing.
57+
2. If the page fetch fails (status code is not 200), skip the parsing step entirely and print an error message.
58+
3. Restore any imports you removed.
59+
60+
## Explain it
61+
62+
1. What does `BeautifulSoup(html, "lxml")` do? What is the second argument for?
63+
2. What is the difference between `find()` and `find_all()`?
64+
3. How do you get the text content of a tag? How do you get an attribute like `href`?
65+
4. What is a CSS selector and why might you prefer `select()` over `find_all()`?
66+
67+
## Mastery check
68+
69+
You can move on when you can:
70+
71+
- Parse any HTML string with BeautifulSoup without looking up the syntax.
72+
- Find elements by tag name, class, and CSS selector.
73+
- Extract both text content and attributes from elements.
74+
- Handle the case where an element is not found on the page.
75+
76+
## Next
77+
78+
[Project 03 — Extract Structured Data](../03-extract-structured-data/)
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Notes — Parse HTML
2+
3+
## What I learned
4+
5+
6+
## What confused me
7+
8+
9+
## What I want to explore next
10+
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
"""
2+
Project 02 — Parse HTML
3+
4+
This script fetches a page from books.toscrape.com, parses the HTML
5+
with BeautifulSoup, and extracts the title and price of every book
6+
on the page.
7+
8+
You will learn: BeautifulSoup basics, find(), find_all(), and
9+
extracting text and attributes from HTML elements.
10+
"""
11+
12+
import requests
13+
14+
# BeautifulSoup lives in the bs4 package.
15+
# You installed it with: pip install beautifulsoup4
16+
# The import name (bs4) is different from the package name (beautifulsoup4).
17+
from bs4 import BeautifulSoup
18+
19+
20+
def fetch_page(url):
21+
"""Fetch a web page and return the response text, or None on failure."""
22+
print(f"Fetching {url} ...")
23+
response = requests.get(url)
24+
25+
if response.status_code != 200:
26+
print(f"Failed to fetch page. Status code: {response.status_code}")
27+
return None
28+
29+
return response.text
30+
31+
32+
def parse_books(html):
33+
"""
34+
Parse HTML and extract book titles and prices.
35+
36+
BeautifulSoup turns raw HTML into a tree of objects you can search.
37+
Think of it like a map of the page — you can ask "find all the <h3> tags"
38+
or "find the element with class 'price_color'".
39+
40+
Returns a list of tuples: [(title, price), ...]
41+
"""
42+
43+
# Create a BeautifulSoup object from the HTML string.
44+
# "lxml" is the parser — it reads the HTML and builds the tree.
45+
# Other options: "html.parser" (built-in, slower) or "html5lib" (very lenient).
46+
print("Parsing HTML with BeautifulSoup...")
47+
soup = BeautifulSoup(html, "lxml")
48+
49+
books = []
50+
51+
# Each book on the page is inside an <article> tag with class "product_pod".
52+
# find_all() returns a list of ALL matching elements.
53+
# find() returns only the FIRST match (or None if nothing matches).
54+
articles = soup.find_all("article", class_="product_pod")
55+
56+
for article in articles:
57+
# The book title is inside an <h3> tag, inside an <a> tag.
58+
# The title attribute on the <a> tag has the full title text.
59+
# We use find() here because there is only one <h3> per article.
60+
title_tag = article.find("h3")
61+
link_tag = title_tag.find("a")
62+
63+
# The "title" attribute contains the full title.
64+
# link_tag["title"] gets an attribute, like href or title.
65+
# link_tag.text would give us the visible text, which is sometimes truncated.
66+
title = link_tag["title"]
67+
68+
# The price is inside a <p> tag with class "price_color".
69+
# .text gives us the text content of the element, like "£51.77".
70+
price_tag = article.find("p", class_="price_color")
71+
price = price_tag.text.strip()
72+
73+
books.append((title, price))
74+
75+
return books
76+
77+
78+
def display_books(books):
79+
"""Print the list of books in a formatted table."""
80+
print(f"\nFound {len(books)} books on the page:\n")
81+
82+
for i, (title, price) in enumerate(books, start=1):
83+
# Format each line so titles and prices line up in columns.
84+
# :<45 means left-align the title in a 45-character-wide column.
85+
print(f" {i:>3}. {title:<45} {price}")
86+
87+
88+
def main():
89+
url = "http://books.toscrape.com/"
90+
91+
# Step 1: Fetch the raw HTML
92+
html = fetch_page(url)
93+
if html is None:
94+
return
95+
96+
# Step 2: Parse the HTML and extract book data
97+
books = parse_books(html)
98+
99+
# Step 3: Display the results
100+
if books:
101+
display_books(books)
102+
else:
103+
print("No books found. The page structure may have changed.")
104+
105+
print(f"\nDone. Extracted {len(books)} books.")
106+
107+
108+
if __name__ == "__main__":
109+
main()

0 commit comments

Comments
 (0)