Business Registry Web Scraper

Overview

This project is a Python web scraper that extracts business registry records from a demo website:

https://scraping-trial-test.vercel.app

The website:

Is built with React / Next.js
Uses multi-page result navigation
Serves business listings and profiles via server-rendered HTML

Although the content is currently accessible via direct HTTP requests, Selenium was intentionally chosen as the scraping interface to model real browser behavior and to avoid relying on assumptions about the site’s current or future rendering and data-delivery strategy.

The script:

Accepts a user-provided search term
Navigates through all result pages
Opens each business profile
Extracts detailed business and registered agent information
Saves results to both JSON and CSV formats

What Data Is Extracted

For each business, the scraper extracts the following fields from the Business Profile page:

Business Name
Registration ID
Status
Filing Date
Registered Agent Name
Registered Agent Address
Registered Agent Email (if available)

All fields are scraped directly from the site.
No data is inferred, synthesized, or fabricated.

Output Files

After execution, two output files are created in the project directory.

output.json (example)

{
  "business_name": "Silver Tech CORP",
  "registration_id": "SD0000001",
  "status": "Active",
  "filing_date": "1999-12-04",
  "agent_name": "Sara Smith",
  "agent_address": "1545 Maple Ave",
  "agent_email": "sara.smith@example.com"
}

output.csv (example)

business_name	registration_id	status	filing_date	agent_name	agent_address	agent_email
Silver Tech CORP	SD0000001	Active	1999-12-04	Sara Smith	1545 Maple Ave	sara.smith@example.com

Requirements

Python 3.x
Mozilla Firefox
GeckoDriver (Firefox WebDriver)
Internet connection
Git (optional)

Version Information

This repository follows Semantic Versioning.

Latest stable release:
v1.3.1

v1.3.1 is a documentation-only update that corrects technical descriptions and clarifies architectural and tooling decisions.
No functional or behavioral changes were introduced relative to v1.3.0.

Step-by-Step Installation (Beginner Friendly)

Option A: Download the Latest Stable Release

Go to: https://github.com/Shnxxx/scraping-trial-test
Open the Releases section
Download the latest release archive
Extract the files
Open a terminal inside the project directory

Option B: Clone Using Git

git clone https://github.com/Shnxxx/scraping-trial-test.git
cd scraping-trial-test
git checkout v1.3.1

(Optional) Create a Virtual Environment

python -m venv .venv

Activate it:

Windows

.venv\Scripts\activate

macOS / Linux

source .venv/bin/activate

Install Dependencies

pip install selenium

How to Run the Script

python scraper.py

Firefox launches automatically
Enter a search term (minimum 3 characters)
The script navigates all result pages and business profiles
Results are written to output.json and output.csv

Search Term Rules

Minimum of 3 characters
Empty searches are not allowed
Wildcard searches are not supported

Pagination Handling

URL-based pagination
All result pages are processed sequentially
Stable page-level indicators are used to avoid stale element references caused by frontend re-renders

Error Handling & Logging

Explicit waits ensure DOM readiness
Optional fields are handled safely
Errors are logged to scraper.log
Individual record failures do not halt execution

Performance Notes

Browser automation is inherently slower than direct HTTP scraping
This tradeoff is intentional and accepted
The implementation prioritizes correctness, transparency, and resilience over raw throughput

Limitations

Requires a local browser and WebDriver
Slower than a requests-based scraper
Resume-after-crash is not implemented
Full registry enumeration without a search term is not supported

Repository Structure

scraping-trial-test/
├── scraper.py
├── output.json
├── output.csv
├── scraper.log
├── README.md
├── .gitignore

Versioning

v1.0.0 – Initial working scraper
v1.1.0 – Business profile navigation
v1.2.0 – CSV output and stability fixes
v1.3.0 – Agent scraping and performance tuning
v1.3.1 – Documentation corrections and clarifications

Author’s Notes

Why Selenium Instead of requests + BeautifulSoup

While the site’s current HTML is server-rendered and technically scrapeable via requests and BeautifulSoup, Selenium was selected based on engineering risk management, not minimum technical feasibility.

Specifically:

The site is built with React / Next.js, where rendering strategy (SSR, SSG, CSR) can change without altering visible browser behavior but can silently break request-based scrapers.
Pagination and navigation are expressed through user-facing UI flows rather than a documented or stable backend API.
A browser-driven approach avoids assumptions about where data originates, how it is rendered, or whether it will remain present in initial HTML responses.
Selenium ensures continued correctness if content delivery shifts toward client-side rendering, hydration, or JavaScript-triggered navigation.

In this context, Selenium intentionally trades performance for robustness and maintainability. This mirrors real-world scraping constraints, where browser automation is often the only stable interface available.

Scope Decisions

Persistent resume logic and API-based scraping were intentionally excluded to keep the solution focused, transparent, and aligned with browser-oriented scraping constraints.

Author

Senjo
GitHub: https://github.com/Shnxxx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Business Registry Web Scraper

Overview

Table of Contents

What Data Is Extracted

Output Files

output.json (example)

output.csv (example)

Requirements

Version Information

Step-by-Step Installation (Beginner Friendly)

Option A: Download the Latest Stable Release

Option B: Clone Using Git

(Optional) Create a Virtual Environment

Install Dependencies

How to Run the Script

Search Term Rules

Pagination Handling

Error Handling & Logging

Performance Notes

Limitations

Repository Structure

Versioning

Author’s Notes

Why Selenium Instead of requests + BeautifulSoup

Scope Decisions

Author

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
Data-Scraping-Engineer-Trial-Test.pdf		Data-Scraping-Engineer-Trial-Test.pdf
README.md		README.md
geckodriver.exe		geckodriver.exe
output.csv		output.csv
output.json		output.json
scraper.py		scraper.py

Folders and files

Latest commit

History

Repository files navigation

Business Registry Web Scraper

Overview

Table of Contents

What Data Is Extracted

Output Files

output.json (example)

output.csv (example)

Requirements

Version Information

Step-by-Step Installation (Beginner Friendly)

Option A: Download the Latest Stable Release

Option B: Clone Using Git

(Optional) Create a Virtual Environment

Install Dependencies

How to Run the Script

Search Term Rules

Pagination Handling

Error Handling & Logging

Performance Notes

Limitations

Repository Structure

Versioning

Author’s Notes

Why Selenium Instead of requests + BeautifulSoup

Scope Decisions

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages