A modular Python web scraping project that extracts book data from books.toscrape.com, a sandbox website designed for scraping practice. The project follows a structured pipeline: scrape, clean, validate, and export. It is built for clarity and real-world applicability, making it suitable for entry-to-intermediate level developers.
- Multi-page scraping with configurable page count
- Extraction of book title, price, and rating
- Automatic retry logic for failed HTTP requests
- Request timeout to prevent hanging connections
- Data cleaning: currency symbols removed, ratings converted from text to numeric
- Data validation: missing values dropped, price and rating range enforced
- Output in both CSV and JSON formats
| Tool | Purpose |
|---|---|
| Python 3.8+ | Core language |
| requests | HTTP requests |
| BeautifulSoup | HTML parsing |
| pandas | Data manipulation and export |
| lxml | Fast HTML parser backend |
web-scrapper/
├── main.py # Entry point — orchestrates the full pipeline
├── extractor.py # Fetches pages and parses HTML for book data
├── cleaner.py # Cleans price and rating fields
├── validator.py # Validates data quality (missing values, ranges)
├── requirements.txt # Python dependencies
├── README.md
└── output/ # Generated at runtime
├── books_raw.csv
├── books_cleaned.csv
└── books_cleaned.json
-
Scraping —
extractor.pysends HTTP requests to each catalogue page of books.toscrape.com. It retries up to 3 times on failure and uses a 10-second timeout. BeautifulSoup parses the HTML to extract the title, price, and rating from each book listing. -
Raw Export — The unprocessed data is saved to
output/books_raw.csvfor reference. -
Cleaning —
cleaner.pyconverts price strings (e.g.,£51.77) to floats and rating words (e.g.,Three) to integers (e.g.,3). -
Validation —
validator.pyremoves rows that have missing values, prices less than or equal to zero, or ratings outside the 1-5 range. -
Final Export — The validated dataset is saved to both
output/books_cleaned.csvandoutput/books_cleaned.json.
- Python 3.8 or higher
- pip (Python package manager)
# Clone the repository
git clone https://github.com/your-username/web-scrapper.git
cd web-scrapper
# Install dependencies
pip install -r requirements.txtpython main.pyThe script will display progress in the terminal and generate output files in the output/ directory.
| File | Description |
|---|---|
output/books_raw.csv |
Raw scraped data before any processing |
output/books_cleaned.csv |
Cleaned and validated data in CSV format |
output/books_cleaned.json |
Cleaned and validated data in JSON format |
This project demonstrates a practical, real-world scraping workflow broken into discrete, reusable modules. It is designed as a learning project that covers:
- HTTP request handling with error recovery
- HTML parsing and data extraction
- Data cleaning and type conversion
- Data validation and quality enforcement
- Structured file output
It is well-suited for developers building a portfolio or learning how to work with web data in Python.
- Add command-line arguments for page count and output directory
- Support scraping additional fields (availability, description, category)
- Store data in a SQLite or PostgreSQL database
- Add logging with the
loggingmodule instead of print statements - Implement asynchronous scraping with
aiohttpfor better performance - Add unit tests for each module