travisjneuman
diff --git a/‎projects/modules/01-web-scraping/01-fetch-a-webpage/README.md‎
Lines changed: 77 additions & 0 deletions b/‎projects/modules/01-web-scraping/01-fetch-a-webpage/README.md‎
Lines changed: 77 additions & 0 deletions
diff --git a/‎projects/modules/01-web-scraping/01-fetch-a-webpage/notes.md‎
Lines changed: 10 additions & 0 deletions b/‎projects/modules/01-web-scraping/01-fetch-a-webpage/notes.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎projects/modules/01-web-scraping/01-fetch-a-webpage/project.py‎
Lines changed: 87 additions & 0 deletions b/‎projects/modules/01-web-scraping/01-fetch-a-webpage/project.py‎
Lines changed: 87 additions & 0 deletions
diff --git a/‎projects/modules/01-web-scraping/02-parse-html/README.md‎
Lines changed: 78 additions & 0 deletions b/‎projects/modules/01-web-scraping/02-parse-html/README.md‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎projects/modules/01-web-scraping/02-parse-html/notes.md‎
Lines changed: 10 additions & 0 deletions b/‎projects/modules/01-web-scraping/02-parse-html/notes.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎projects/modules/01-web-scraping/02-parse-html/project.py‎
Lines changed: 109 additions & 0 deletions b/‎projects/modules/01-web-scraping/02-parse-html/project.py‎
Lines changed: 109 additions & 0 deletions
@@ -0,0 +1,77 @@
+# Module 01 / Project 01 — Fetch a Webpage
+
+[README](../../../../README.md) · [Module Index](../README.md)
+
+## Focus
+
+- `requests.get()` to fetch a URL
+- HTTP status codes (200, 404, etc.)
+- Inspecting `response.text`, `response.status_code`, and `response.headers`
+
+## Why this project exists
+
+Before you can scrape data from any website, you need to know how to fetch a page and understand what comes back. This project teaches you the fundamentals of making HTTP requests in Python. You will see the raw HTML that your browser normally renders for you, and you will learn to check whether a request succeeded or failed.
+
+## Run
+
+```bash
+cd projects/modules/01-web-scraping/01-fetch-a-webpage
+python project.py
+```
+
+## Expected output
+
+```text
+Fetching http://books.toscrape.com/ ...
+Status code: 200
+Content type: text/html
+Content length: XXXXX characters
+
+First 500 characters of the page:
+--------------------------------------------------
+<!DOCTYPE html>
+<!--[if lt IE 7]>      <html lang="en-us" ...
+(HTML content continues)
+--------------------------------------------------
+Done.
+```
+
+The exact character count and HTML will vary, but you should see status code 200 and recognizable HTML.
+
+## Alter it
+
+1. Change the URL to `http://books.toscrape.com/catalogue/page-2.html` and run again. What changes? What stays the same?
+2. Add a line that prints `response.headers` to see all the HTTP headers the server sent back. Pick two headers and look up what they mean.
+3. Add a check: if the status code is not 200, print a warning message instead of the page content.
+
+## Break it
+
+1. Change the URL to `http://books.toscrape.com/this-page-does-not-exist`. What status code do you get?
+2. Change the URL to `http://definitely-not-a-real-website-abc123.com`. What error do you get? (Hint: it is not a status code — it is a Python exception.)
+3. Remove the `import requests` line and run the script. Read the error message carefully.
+
+## Fix it
+
+1. Wrap the `requests.get()` call in a try/except block that catches `requests.exceptions.RequestException`. Print a friendly error message instead of a traceback.
+2. After fetching, check `response.status_code`. If it is 404, print "Page not found" and exit early. If it is anything other than 200, print the status code as a warning.
+3. Put the import back if you removed it.
+
+## Explain it
+
+1. What is an HTTP status code and what does 200 mean?
+2. What is the difference between `response.text` and `response.content`?
+3. Why might `requests.get()` raise an exception instead of returning a response?
+4. What does the `Content-Type` header tell you?
+
+## Mastery check
+
+You can move on when you can:
+
+- Fetch any URL and check whether it succeeded, from memory.
+- Explain what a status code is without looking it up.
+- Handle both HTTP errors (404) and connection errors (no internet) gracefully.
+- Describe what `response.text` contains.
+
+## Next
+
+[Project 02 — Parse HTML](../02-parse-html/)
@@ -0,0 +1,10 @@
+# Notes — Fetch a Webpage
+
+## What I learned
+
+
+## What confused me
+
+
+## What I want to explore next
+
@@ -0,0 +1,87 @@
+"""
+Project 01 — Fetch a Webpage
+
+This script fetches a web page using the requests library and prints
+basic information about the response: status code, content type,
+content length, and a preview of the HTML.
+
+Target site: http://books.toscrape.com (a safe practice site for scraping)
+"""
+
+# The requests library makes HTTP requests simple.
+# You installed it with: pip install requests
+import requests
+
+
+def fetch_page(url):
+    """
+    Fetch a web page and return the response object.
+
+    requests.get() sends an HTTP GET request to the URL — the same kind
+    of request your browser sends when you type a URL in the address bar.
+    The response object contains the status code, headers, and body.
+    """
+    print(f"Fetching {url} ...")
+    response = requests.get(url)
+    return response
+
+
+def display_response_info(response):
+    """
+    Print useful information about the HTTP response.
+
+    Every HTTP response has:
+    - A status code (200 = success, 404 = not found, 500 = server error)
+    - Headers (metadata like content type, server name, etc.)
+    - A body (the actual HTML, JSON, or other content)
+    """
+
+    # The status code tells you whether the request succeeded.
+    # 200 means "OK" — the server found the page and sent it back.
+    print(f"Status code: {response.status_code}")
+
+    # Headers are key-value pairs the server sends with the response.
+    # Content-Type tells you what kind of content came back (HTML, JSON, etc.)
+    content_type = response.headers.get("Content-Type", "unknown")
+    print(f"Content type: {content_type}")
+
+    # response.text is the body of the response as a string.
+    # For a web page, this is the raw HTML that the browser would render.
+    # len() tells us how many characters are in the response.
+    print(f"Content length: {len(response.text)} characters")
+
+    # Print the first 500 characters so you can see what HTML looks like.
+    # This is the same HTML your browser receives — it just renders it
+    # as a pretty page instead of showing the raw tags.
+    print()
+    print("First 500 characters of the page:")
+    print("-" * 50)
+    print(response.text[:500])
+    print("-" * 50)
+
+
+def main():
+    # books.toscrape.com is a website built specifically for people
+    # learning web scraping. It is safe to scrape and will not block you.
+    url = "http://books.toscrape.com/"
+
+    # Step 1: Fetch the page
+    response = fetch_page(url)
+
+    # Step 2: Check if the request succeeded
+    # Any status code in the 200s means success.
+    # The most common success code is 200 ("OK").
+    if response.status_code == 200:
+        display_response_info(response)
+    else:
+        print(f"Request failed with status code: {response.status_code}")
+        print("This means the server could not return the page you asked for.")
+
+    print("\nDone.")
+
+
+# This pattern means: only run main() when this file is executed directly.
+# If someone imports this file, main() will NOT run automatically.
+# This is a Python convention you will see in almost every script.
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,78 @@
+# Module 01 / Project 02 — Parse HTML
+
+[README](../../../../README.md) · [Module Index](../README.md)
+
+## Focus
+
+- Creating a BeautifulSoup object from HTML
+- `find()` and `find_all()` to locate elements
+- CSS selectors with `select()`
+- Extracting text and attributes from elements
+
+## Why this project exists
+
+Raw HTML is a mess of tags, attributes, and nesting. BeautifulSoup turns that mess into a tree structure you can search. This project teaches you to find specific elements on a page — the single most important skill in web scraping. You will extract book titles and prices from a real webpage.
+
+## Run
+
+```bash
+cd projects/modules/01-web-scraping/02-parse-html
+python project.py
+```
+
+## Expected output
+
+```text
+Fetching http://books.toscrape.com/ ...
+Parsing HTML with BeautifulSoup...
+
+Found 20 books on the page:
+
+  1. A Light in the Attic                         £51.77
+  2. Tipping the Velvet                            £53.74
+  3. Soumission                                    £50.10
+  ...
+ 20. (last book title)                             £XX.XX
+
+Done. Extracted 20 books.
+```
+
+The exact titles and prices depend on the current page content, but you should see 20 books listed.
+
+## Alter it
+
+1. Instead of printing the price, print the star rating. Each book has a `<p>` tag with a class like `star-rating Three`. Extract and print the rating word (One, Two, Three, etc.).
+2. Use `soup.select()` with a CSS selector instead of `find_all()`. For example, `soup.select("article.product_pod h3 a")` selects all title links. Try rewriting the extraction using only CSS selectors.
+3. Extract and print the URL of each book's detail page (the `href` attribute on the title link).
+
+## Break it
+
+1. Change the parser from `"lxml"` to `"html.parser"` (Python's built-in). Does the output change? What if the HTML were malformed — which parser would handle it better?
+2. Search for a tag that does not exist: `soup.find("div", class_="nonexistent")`. What does it return? What happens if you try to call `.text` on that result?
+3. Remove the `import` for BeautifulSoup and run the script. Read the error.
+
+## Fix it
+
+1. Before calling `.text` on a found element, add a check: `if element is not None`. Print "Not found" if the element is missing.
+2. If the page fetch fails (status code is not 200), skip the parsing step entirely and print an error message.
+3. Restore any imports you removed.
+
+## Explain it
+
+1. What does `BeautifulSoup(html, "lxml")` do? What is the second argument for?
+2. What is the difference between `find()` and `find_all()`?
+3. How do you get the text content of a tag? How do you get an attribute like `href`?
+4. What is a CSS selector and why might you prefer `select()` over `find_all()`?
+
+## Mastery check
+
+You can move on when you can:
+
+- Parse any HTML string with BeautifulSoup without looking up the syntax.
+- Find elements by tag name, class, and CSS selector.
+- Extract both text content and attributes from elements.
+- Handle the case where an element is not found on the page.
+
+## Next
+
+[Project 03 — Extract Structured Data](../03-extract-structured-data/)
@@ -0,0 +1,10 @@
+# Notes — Parse HTML
+
+## What I learned
+
+
+## What confused me
+
+
+## What I want to explore next
+
@@ -0,0 +1,109 @@
+"""
+Project 02 — Parse HTML
+
+This script fetches a page from books.toscrape.com, parses the HTML
+with BeautifulSoup, and extracts the title and price of every book
+on the page.
+
+You will learn: BeautifulSoup basics, find(), find_all(), and
+extracting text and attributes from HTML elements.
+"""
+
+import requests
+
+# BeautifulSoup lives in the bs4 package.
+# You installed it with: pip install beautifulsoup4
+# The import name (bs4) is different from the package name (beautifulsoup4).
+from bs4 import BeautifulSoup
+
+
+def fetch_page(url):
+    """Fetch a web page and return the response text, or None on failure."""
+    print(f"Fetching {url} ...")
+    response = requests.get(url)
+
+    if response.status_code != 200:
+        print(f"Failed to fetch page. Status code: {response.status_code}")
+        return None
+
+    return response.text
+
+
+def parse_books(html):
+    """
+    Parse HTML and extract book titles and prices.
+
+    BeautifulSoup turns raw HTML into a tree of objects you can search.
+    Think of it like a map of the page — you can ask "find all the <h3> tags"
+    or "find the element with class 'price_color'".
+
+    Returns a list of tuples: [(title, price), ...]
+    """
+
+    # Create a BeautifulSoup object from the HTML string.
+    # "lxml" is the parser — it reads the HTML and builds the tree.
+    # Other options: "html.parser" (built-in, slower) or "html5lib" (very lenient).
+    print("Parsing HTML with BeautifulSoup...")
+    soup = BeautifulSoup(html, "lxml")
+
+    books = []
+
+    # Each book on the page is inside an <article> tag with class "product_pod".
+    # find_all() returns a list of ALL matching elements.
+    # find() returns only the FIRST match (or None if nothing matches).
+    articles = soup.find_all("article", class_="product_pod")
+
+    for article in articles:
+        # The book title is inside an <h3> tag, inside an <a> tag.
+        # The title attribute on the <a> tag has the full title text.
+        # We use find() here because there is only one <h3> per article.
+        title_tag = article.find("h3")
+        link_tag = title_tag.find("a")
+
+        # The "title" attribute contains the full title.
+        # link_tag["title"] gets an attribute, like href or title.
+        # link_tag.text would give us the visible text, which is sometimes truncated.
+        title = link_tag["title"]
+
+        # The price is inside a <p> tag with class "price_color".
+        # .text gives us the text content of the element, like "£51.77".
+        price_tag = article.find("p", class_="price_color")
+        price = price_tag.text.strip()
+
+        books.append((title, price))
+
+    return books
+
+
+def display_books(books):
+    """Print the list of books in a formatted table."""
+    print(f"\nFound {len(books)} books on the page:\n")
+
+    for i, (title, price) in enumerate(books, start=1):
+        # Format each line so titles and prices line up in columns.
+        # :<45 means left-align the title in a 45-character-wide column.
+        print(f"  {i:>3}. {title:<45} {price}")
+
+
+def main():
+    url = "http://books.toscrape.com/"
+
+    # Step 1: Fetch the raw HTML
+    html = fetch_page(url)
+    if html is None:
+        return
+
+    # Step 2: Parse the HTML and extract book data
+    books = parse_books(html)
+
+    # Step 3: Display the results
+    if books:
+        display_books(books)
+    else:
+        print("No books found. The page structure may have changed.")
+
+    print(f"\nDone. Extracted {len(books)} books.")
+
+
+if __name__ == "__main__":
+    main()