Web Scraper — books.toscrape.com

Overview

A modular Python web scraping project that extracts book data from books.toscrape.com, a sandbox website designed for scraping practice. The project follows a structured pipeline: scrape, clean, validate, and export. It is built for clarity and real-world applicability, making it suitable for entry-to-intermediate level developers.

Features

Multi-page scraping with configurable page count
Extraction of book title, price, and rating
Automatic retry logic for failed HTTP requests
Request timeout to prevent hanging connections
Data cleaning: currency symbols removed, ratings converted from text to numeric
Data validation: missing values dropped, price and rating range enforced
Output in both CSV and JSON formats

Tech Stack

Tool	Purpose
Python 3.8+	Core language
requests	HTTP requests
BeautifulSoup	HTML parsing
pandas	Data manipulation and export
lxml	Fast HTML parser backend

Project Structure

web-scrapper/
├── main.py            # Entry point — orchestrates the full pipeline
├── extractor.py       # Fetches pages and parses HTML for book data
├── cleaner.py         # Cleans price and rating fields
├── validator.py       # Validates data quality (missing values, ranges)
├── requirements.txt   # Python dependencies
├── README.md
└── output/            # Generated at runtime
    ├── books_raw.csv
    ├── books_cleaned.csv
    └── books_cleaned.json

How It Works

Scraping — extractor.py sends HTTP requests to each catalogue page of books.toscrape.com. It retries up to 3 times on failure and uses a 10-second timeout. BeautifulSoup parses the HTML to extract the title, price, and rating from each book listing.
Raw Export — The unprocessed data is saved to output/books_raw.csv for reference.
Cleaning — cleaner.py converts price strings (e.g., £51.77) to floats and rating words (e.g., Three) to integers (e.g., 3).
Validation — validator.py removes rows that have missing values, prices less than or equal to zero, or ratings outside the 1-5 range.
Final Export — The validated dataset is saved to both output/books_cleaned.csv and output/books_cleaned.json.

How to Run

Prerequisites

Python 3.8 or higher
pip (Python package manager)

Installation

# Clone the repository
git clone https://github.com/your-username/web-scrapper.git
cd web-scrapper

# Install dependencies
pip install -r requirements.txt

Execution

python main.py

The script will display progress in the terminal and generate output files in the output/ directory.

Output

File	Description
`output/books_raw.csv`	Raw scraped data before any processing
`output/books_cleaned.csv`	Cleaned and validated data in CSV format
`output/books_cleaned.json`	Cleaned and validated data in JSON format

Purpose

This project demonstrates a practical, real-world scraping workflow broken into discrete, reusable modules. It is designed as a learning project that covers:

HTTP request handling with error recovery
HTML parsing and data extraction
Data cleaning and type conversion
Data validation and quality enforcement
Structured file output

It is well-suited for developers building a portfolio or learning how to work with web data in Python.

Future Improvements

Add command-line arguments for page count and output directory
Support scraping additional fields (availability, description, category)
Store data in a SQLite or PostgreSQL database
Add logging with the logging module instead of print statements
Implement asynchronous scraping with aiohttp for better performance
Add unit tests for each module

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper — books.toscrape.com

Overview

Features

Tech Stack

Project Structure

How It Works

How to Run

Prerequisites

Installation

Execution

Output

Purpose

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
cleaner.py		cleaner.py
extractor.py		extractor.py
main.py		main.py
requirements.txt		requirements.txt
validator.py		validator.py

Folders and files

Latest commit

History

Repository files navigation

Web Scraper — books.toscrape.com

Overview

Features

Tech Stack

Project Structure

How It Works

How to Run

Prerequisites

Installation

Execution

Output

Purpose

Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages