A production-ready web scraper for extracting product information from multiple e-commerce platforms with authenticated session support, intelligent path resolution, batch processing, and AI-powered marketing content generation.
- E-Commerces-WebScraper.
- Table of Contents
- Introduction
- Features
- Supported Platforms
- Architecture
- Requirements
- Installation
- Configuration
- Usage
- Authenticated Scraping
- Output Structure
- Product Directory Naming Rule
- AI-Powered Marketing Content
- Dependencies
- File Structure
- Implementation Details
- Troubleshooting
- Performance Considerations
- Ethical Considerations
- Contributing
- Collaborators
- License
E-Commerces-WebScraper is a comprehensive, production-ready Python application designed to automate the extraction of product information from multiple e-commerce platforms. Built with maintainability and extensibility in mind, it supports both traditional HTTP scraping and advanced authenticated browser automation for JavaScript-heavy websites.
The scraper extracts detailed product data including names, prices, discount information, descriptions, and high-resolution images. It features intelligent duplicate detection, asset optimization, batch processing capabilities, and optional AI-powered marketing content generation via Google Gemini.
- Multi-Platform Support: Scrapes AliExpress, Amazon, Mercado Livre, Shein, and Shopee with dedicated, platform-specific scrapers
- Authenticated Scraping: Uses existing Chrome profiles to bypass login requirements for Shopee and Shein
- Intelligent Path Resolution: Automatically resolves local HTML paths with multiple fallback strategies
- Batch Processing: Process multiple URLs from input files with configurable delays between requests
- Offline Scraping: Support for scraping from local HTML files and zip archives
- Image Optimization: Automatic duplicate detection and removal of low-quality images
- Asset Localization: Downloads and localizes external assets (images, CSS, JavaScript)
- AI Integration: Optional marketing content generation using Google Gemini API
- Comprehensive Logging: Detailed logs for all operations with timestamp tracking
- Error Recovery: Robust exception handling with detailed error reporting
- Platform-Specific Output: Organized directory structure with platform prefixes
- Product Validation: Validates scraped data to filter out placeholder entries
| Platform | Scraping Method | Authentication Required | Status |
|---|---|---|---|
| AliExpress | Browser Automation (Playwright) | Yes | β Active |
| Amazon | Browser Automation (Playwright) | Yes | β Active |
| Mercado Livre | HTTP Requests | No | β Active |
| Shein | Browser Automation (Playwright) | Yes | β Active |
| Shopee | Browser Automation (Playwright) | Yes | β Active |
The application follows a modular, class-based architecture with clear separation of concerns:
- main.py: Orchestration layer that handles URL routing, batch processing, validation, and output management
- AliExpress.py: Browser automation scraper for AliExpress using
Playwrightfor JavaScript-rendered pages - Amazon.py: Browser automation scraper for Amazon Brasil using
Playwrightfor JavaScript-rendered pages - Gemini.py: AI integration module for generating marketing content via Google Gemini API
- Logger.py: Custom logging utility for simultaneous terminal and file output
- MercadoLivre.py: HTTP-based scraper using
requestsandBeautifulSoupfor static content extraction - Shein.py: Browser automation scraper using
Playwrightfor JavaScript-rendered pages - Shopee.py: Browser automation scraper using
Playwrightfor JavaScript-rendered pages
- URL Loading: Reads URLs from
Inputs/urls.txtor test constants - Platform Detection: Analyzes URL patterns to determine the appropriate scraper
- Path Resolution: Resolves local HTML paths with fallback mechanisms for offline scraping
- Scraping Execution: Invokes platform-specific scraper with appropriate parameters
- Data Validation: Verifies product data completeness and authenticity
- Asset Processing: Downloads images, removes duplicates, excludes small files
- Output Generation: Creates organized directories with product descriptions
- AI Enhancement: Optionally generates marketing content via Gemini API
- Cleanup: Removes temporary files and extracted archives
User Authentication (One-time)
β
Chrome Profile Creation
β
Session Cookies Saved
β
Playwright Launches Chrome with Profile
β
Automatic Authentication via Cookies
β
Page Rendering with JavaScript
β
Content Extraction
- Python: >= 3.8
- Operating System: Windows, macOS, or Linux
- Chrome Browser: Required for authenticated scraping (Shopee/Shein)
- Internet Connection: Required for online scraping and AI features
- Google Gemini API Key: Optional, for AI-powered marketing content generation
-
Clone the Repository
git clone https://github.com/BrenoFariasdaSilva/E-Commerces-WebScraper.git cd E-Commerces-WebScraper -
Create Virtual Environment (Recommended)
python -m venv venv # Windows venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Install Dependencies
pip install -r requirements.txt
-
Install Playwright Browsers (Required for Shopee/Shein)
python -m playwright install chromium
-
Configure Environment Variables
Create a
.envfile in the project root (see Configuration section).
Create a .env file in the project root directory:
# AI Integration (Optional - for marketing content generation)
GEMINI_API_KEY=your_gemini_api_key_here
# Browser Authentication (Required for Shopee and Shein)
CHROME_PROFILE_PATH=C:/Users/YourUsername/AppData/Local/Google/Chrome/User Data
CHROME_EXECUTABLE_PATH=
HEADLESS=FalseGEMINI_API_KEY (Optional)
- Google Gemini API key for AI-powered marketing content generation
- Obtain from: https://makersuite.google.com/app/apikey
- Leave empty to skip AI content generation
CHROME_PROFILE_PATH (Required for Shopee/Shein)
- Path to your Chrome user data directory with authenticated sessions
- Windows:
C:/Users/YourUsername/AppData/Local/Google/Chrome/User Data - macOS:
/Users/YourUsername/Library/Application Support/Google/Chrome - Linux:
/home/YourUsername/.config/google-chrome β οΈ Use forward slashes/even on Windowsβ οΈ Close all Chrome windows before running the scraper
CHROME_EXECUTABLE_PATH (Optional)
- Path to Chrome executable if not in default location
- Leave empty if Chrome is installed in the standard location
HEADLESS (Optional)
False: Show browser window (recommended for debugging)True: Run browser in background without window
For Shopee and Shein scraping, you must authenticate once in your regular Chrome browser:
- Open Google Chrome normally
- Navigate to https://shopee.com.br and https://br.shein.com
- Log into both websites with your credentials
- Verify you can access product pages while logged in
- Close all Chrome windows completely
- Configure
CHROME_PROFILE_PATHin.envfile - Run the scraper - it will automatically use your saved sessions
The scraper will reuse your authenticated session without requiring credentials in the code.
-
Add URLs to Input File
Edit
Inputs/urls.txtand add one URL per line:https://mercadolivre.com.br/product-url https://br.shein.com/product-url https://shopee.com.br/product-url -
Run the Scraper
python main.py
Or using Make:
make run
-
Check Outputs
Results are saved in
Outputs/directory organized by platform and product name.
The Inputs/urls.txt file supports two formats per line:
You may specify entries in the Inputs/urls.txt file either as only the URL, or as a pair: the URL followed by a local HTML path or a zip path.
Online Scraping (URL only):
https://mercadolivre.com.br/product-url
Offline Scraping (URL + Local HTML Path):
https://shopee.com.br/product-url ./Inputs/shopee-product/index.html
The scraper automatically detects which format is provided and routes accordingly.
Process multiple products in sequence with automatic delay:
# In main.py
DELAY_BETWEEN_REQUESTS = 5 # Seconds between requests (default: 5)The scraper processes all URLs in Inputs/urls.txt with rate limiting to avoid triggering anti-bot measures.
The scraper supports offline scraping from local HTML files or zip archives:
From HTML File:
https://product-url ./Inputs/product-directory/index.html
From Zip Archive:
https://product-url ./Inputs/product-archive.zip
The scraper will:
- Extract zip files to temporary directories
- Scrape product information from local HTML
- Copy associated assets (images, scripts, styles)
- Clean up temporary files after processing
The scraper includes intelligent path resolution with multiple fallback strategies:
If a path like product-dir/index.html is specified but not found, it automatically tries:
- Original path as provided
- With
./Inputs/prefix - With
.zipsuffix - With
/index.htmlsuffix - All combinations of prefixes and suffixes
- Base directory extraction for
.htmlfiles
This ensures maximum flexibility in specifying input paths.
Shopee and Shein require JavaScript rendering and authenticated sessions. The scraper uses Playwright browser automation with existing Chrome profiles.
Instead of storing credentials or automating logins, the scraper:
- Reuses your existing Chrome profile with saved cookies
- Launches Chrome with
--user-data-dirpointing to your profile - Inherits authentication automatically from saved session cookies
- No credentials stored in code or configuration files
- Works with 2FA/MFA-enabled accounts
-
Authenticate in Chrome (One-time)
- Open Chrome normally
- Log into Shopee and Shein
- Verify access to product pages
- Close all Chrome windows
-
Configure Environment
CHROME_PROFILE_PATH=C:/Users/YourUsername/AppData/Local/Google/Chrome/User Data HEADLESS=False
-
Run Scraper
python main.py
The browser will launch with your authenticated profile and scrape products automatically.
Each run creates a timestamped run directory under Outputs/ and places all product directories for that run inside it. The timestamped run folder is created by main.py using the format <index>. YYYY-MM-DD - HHhMMmSSs (for example 1. 2026-02-15 - 16h26m31s). Inside the run folder each product gets its own directory named with the platform prefix and sanitized product name. A typical output tree for one run looks like:
Outputs/
βββ 1. 2026-02-15 - 16h26m31s/ # Timestamped run folder (created by create_timestamped_output_directory)
βββ Amazon - Product Name/ # Product directory (created by each scraper)
β βββ Product Name.txt # Product description file created by scraper (product_name_safe + .txt)
β βββ Product Name_Template.txt # AI-generated marketing content (optional, created when Gemini is enabled)
β βββ image_1.webp # Downloaded product images (image_N.ext)
β βββ image_2.webp
β βββ video_1.mp4 # Downloaded product videos (video_N.ext) if any
β βββ index.html # Localized page snapshot (saved as index.html)
β βββ assets/ # Localized assets referenced by the snapshot (images, css, js)
β β βββ asset_1.jpg
β β βββ ...
β βββ original_input/ # Optional: copy of the original input file/archive when available
βββ Shopee - Other Product/
β βββ ...
βββ Logs/ # Per-run or aggregated logs may be placed alongside product folders
Notes:
- Timestamped Run Folder:
main.pycreates a timestamped folder underOutputs/for every execution; product folders for that run are created inside it. The folder name begins with an incremental index for the day, followed by the date and time (e.g.,1. 2026-02-15 - 16h26m31s). - Product Directory Name: Product directory names use the platform prefix (from
PLATFORM_PREFIXES) plus the sanitized product name (created bysanitize_filename) separated by-. - Product Directory Name: Product directory names use the platform prefix (from
PLATFORM_PREFIXES) plus the sanitized product name separated by-. All product directory names are generated via a single shared helper functionproduct_utils.normalize_product_name(...)which performs the existing sanitization rules and then enforces a strict, deterministic 80-character limit (truncation via slicing) applied AFTER sanitization. All scrapers andmain.pyuse this helper for both directory creation and lookup to guarantee consistency. - Description File: The scraper writes a description file named exactly
{product_name_safe}.txt(not necessarily with_descriptionsuffix) containing the text generated from the product data and thePRODUCT_DESCRIPTION_TEMPLATE. - AI Template File: When Gemini is enabled the marketing text is saved as
{product_name_safe}_Template.txtinside the same product directory. - Snapshot & Assets: The full page snapshot is saved as
index.htmland external assets are localized under anassets/subfolder; scrapers may referenceindex.htmlorpage.htmlinternally, but the current implementation saves snapshots asindex.htmlinside the product folder. - Original Input Copy: If the input was a local HTML file, directory or zip archive,
main.pymay copy the original input into the product directory (underoriginal_input/) for traceability. - Logs: The
Logs/directory at repository root contains global logs; per-run logs may also be present inside the timestamped run folder depending on runtime configuration.
This layout matches the directory creation and naming performed by main.py and the per-scraper create_output_directory and media/snapshot routines.
Problem
- Very long product names were previously used directly to create product directories. Some operating systems truncate long filesystem names, which caused directory lookup and move operations to fail when code used the original (non-truncated) name.
Solution
- A single, centralized helper function
product_utils.normalize_product_name(raw_name, replace_with, title_case)is now the authoritative way to produce product-directory-safe names. The helper:- Preserves the existing sanitization behavior (NBSP normalization, whitespace collapse, title-casing where used, and replacement/removal of filesystem-invalid characters).
- Enforces a strict maximum length of 80 characters AFTER sanitization using deterministic slicing (no hashing, no randomness).
- Returns the final directory-safe string.
Usage and requirements for developers
- ALWAYS use
product_utils.normalize_product_name(...)when creating, searching, or referencing product directories in code. This applies to:AliExpress.py,Amazon.py,MercadoLivre.py,Shein.py,Shopee.py, andmain.py.
- The 80-character truncation is applied after sanitization; developers must not re-implement truncation or bypass the helper. Bypassing the helper will break directory-name consistency and may cause runtime failures (missing directories, failed moves, or lookup mismatches).
Deterministic behavior
- Truncation uses simple slicing of the sanitized string to 80 characters. Directory names and lookups are therefore reproducible and stable across runs and platforms.
Product Name: Wireless Gaming Mouse
Price: From R$89.90 to R$149.90 (40% OFF)
Description: High-precision wireless gaming mouse with RGB lighting...
π Encontre na Shopee:
π https://shopee.com.br/product-url
When GEMINI_API_KEY is configured, the scraper automatically generates marketing content for each product.
Generated Content Includes:
- Professional product descriptions
- Key feature highlights
- Usage scenarios
- Target audience recommendations
- Call-to-action text
Output: {Product Name}_Template.txt in the product directory.
Processing:
- Automatically triggered after successful scrape
- Validates and fixes formatting issues
- Retries on failures with error logging
The project uses the following production dependencies:
Core Libraries:
beautifulsoup4==4.14.3- HTML parsing and extractionrequests==2.32.5- HTTP requests for web scrapinglxml==5.3.0- Fast XML/HTML parsing backend
Browser Automation:
playwright==1.49.1- Headless browser automation frameworkpyee==12.0.0- Event emitter for Playwrightgreenlet==3.1.1- Asynchronous support for Playwright
Image Processing:
pillow==12.1.0- Image processing and optimization
AI Integration:
google-genai==1.61.0- Google Gemini API clientgoogle-auth==2.48.0- Google authenticationtenacity==9.1.2- Retry logic for API calls
Utilities:
colorama==0.4.6- Terminal color formattingpython-dotenv==1.2.1- Environment variable management
HTTP & Networking:
httpx==0.28.1- Modern HTTP clienthttpcore==1.0.9- Low-level HTTP transporturllib3==2.6.3- HTTP connection poolingcertifi==2026.1.4- SSL certificate bundle
Data Validation:
pydantic==2.12.5- Data validation using Python type hintspydantic_core==2.41.5- Core validation logic
For a complete list, see requirements.txt.
E-Commerces-WebScraper/
βββ main.py # Main orchestration script
βββ AliExpress.py # AliExpress scraper class
βββ Amazon.py # Amazon Brasil scraper class
βββ MercadoLivre.py # Mercado Livre scraper class
βββ product_utils.py # Product name sanitization and directory utilities
βββ Shein.py # Shein scraper class
βββ Shopee.py # Shopee scraper class
βββ Gemini.py # AI integration module
βββ Logger.py # Custom logging utility
βββ requirements.txt # Python dependencies
βββ Makefile # Build and run commands
βββ .env # Environment configuration (not tracked)
βββ .env.example # Environment template
βββ README.md # This file
βββ CONTRIBUTING.md # Contribution guidelines
βββ LICENSE # Apache 2.0 license
βββ Inputs/ # Input files directory
β βββ urls.txt # URLs to scrape
βββ Outputs/ # Scraped data output directory
β βββ {Platform} - {Product}/ # Product-specific directories
βββ Logs/ # Execution logs
β βββ main.log # Main script log file
βββ .assets/ # Project assets
βββ Icons/ # Icon files
βββ Sounds/ # Notification sounds
The scraper automatically detects platforms by analyzing URL patterns:
PLATFORMS_MAP = {
"AliExpress": "aliexpress",
"Amazon": "amazon",
"MercadoLivre": "mercadolivre",
"Shein": "shein",
"Shopee": "shopee",
}Detection logic checks for platform-specific domain keywords in the URL and routes to the appropriate scraper class.
Intelligent path resolution with 6+ variation attempts:
- Path as provided
- With
./Inputs/prefix - With
.zipsuffix - With
/index.htmlsuffix - All combinations of above
- Base directory extraction for
.htmlfiles
This ensures maximum user convenience when specifying local HTML paths.
Duplicate Detection:
- Normalizes images to minimum dimensions
- Computes MD5 hash of resized versions
- Groups duplicates by hash
- Keeps highest resolution version
- Deletes lower resolution duplicates
Size Filtering:
- Removes images smaller than 2KB (configurable)
- Filters out thumbnails and placeholder images
- Ensures only high-quality images are retained
Playwright Configuration:
- Uses existing Chrome profile for authentication
- Waits for network idle before extraction
- Auto-scrolls to trigger lazy-loaded content
- Captures complete page snapshots with assets
- Localizes external resources for offline viewing
Asset Collection:
- Downloads images, CSS, JavaScript files
- Rewrites URLs to use local paths
- Saves complete page snapshot with dependencies
Besides the AutoHotkey version, this repository now includes cross-platform alternatives:
Scripts/affiliate_pages_downloader.py: Python implementation (Linux/macOS/Windows) that mirrors the same image-detection + fallback-coordinates flow and condensed grouped report.
For Windows users, the original AutoHotkey automation remains available at Scripts/Affiliate Pages Downloader.ahk. If you run on Windows and prefer the native AHK flow, use that script with AutoHotkey installed (double-click the file or run via the AutoHotkey tray menu).
Run it with:
python Scripts/affiliate_pages_downloader.pyOptional arguments:
--tab-count 0(default): process all URLs fromInputs/urls.txt--urls-file /custom/path/urls.txt--assets-dir /custom/path/.assets/Browser
Issue: Unable to open user data directory
- Cause: Chrome is already running with the same profile
- Solution: Close all Chrome windows and check Task Manager for lingering chrome.exe processes
Issue: No product data extracted
- Cause: Not logged in, website structure changed, or anti-bot detection
- Solution: Verify login status in Chrome, try with
HEADLESS=False, check logs for selector errors
Issue: playwright._impl._api_types.Error: Executable doesn't exist
- Cause: Playwright browsers not installed
- Solution: Run
python -m playwright install chromium
Issue: Could not resolve local HTML path
- Cause: Local HTML file or directory not found
- Solution: Verify file paths, ensure
./Inputs/prefix is correct, check zip file integrity
Issue: Rate limiting or IP blocking
- Cause: Too many requests in short time
- Solution: Increase
DELAY_BETWEEN_REQUESTSin main.py, use VPN if necessary
For more detailed troubleshooting, see AUTHENTICATED_SCRAPING_SETUP.md.
Execution Speed:
- Mercado Livre: ~5-10 seconds per product (HTTP requests)
- Shopee/Shein: ~15-30 seconds per product (browser automation with rendering)
Resource Usage:
- CPU: Moderate during image processing and hash computation
- Memory: ~500MB-1GB for browser automation
- Disk: Depends on image quantity and quality
- Network: Varies by product image count and asset size
Optimization Tips:
- Process large batches during off-peak hours
- Use
HEADLESS=Truefor production runs - Increase
DELAY_BETWEEN_REQUESTSto avoid rate limiting - Consider parallel execution for independent URLs (not implemented)
Respect Website Policies:
- Review and comply with each platform's Terms of Service
- Respect
robots.txtdirectives - Implement appropriate rate limiting
- Do not overload servers with excessive requests
Data Usage:
- Scraped data is for personal analysis and monitoring
- Do not republish copyrighted content without permission
- Respect intellectual property rights
- Use product information ethically and legally
Authentication:
- Only scrape content you have legitimate access to
- Do not share or expose authentication credentials
- Do not circumvent security measures
- Use authenticated scraping responsibly
Anti-Bot Measures:
- The scraper mimics normal user behavior
- Uses authenticated sessions to avoid detection
- Implements delays between requests
- Does not attempt to bypass CAPTCHAs or security challenges
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. If you have suggestions for improving the code, your insights will be highly welcome.
Please follow the guidelines in CONTRIBUTING.md for detailed information about the commit standards and the entire pull request process.
-
Set Up Your Environment: Follow the Installation section
-
Make Your Changes:
- Create a branch:
git checkout -b feature/YourFeatureName - Implement your changes with tests
- Commit with clear messages:
- Features:
git commit -m "FEAT: Add some AmazingFeature" - Bug fixes:
git commit -m "FIX: Resolve Issue #123" - Documentation:
git commit -m "DOCS: Update README with new instructions" - Refactoring:
git commit -m "REFACTOR: Enhance component for better aspect"
- Features:
- Create a branch:
-
Submit Your Contribution:
- Push changes:
git push origin feature/YourFeatureName - Open a Pull Request with detailed description
- Push changes:
-
Stay Engaged: Respond to feedback and make necessary adjustments
We thank the following people who contributed to this project:
![]() Breno Farias da Silva |
This project is licensed under the Apache License 2.0. This license permits use, modification, distribution, and sublicense of the code for both private and commercial purposes, provided that the original copyright notice and a disclaimer of warranty are included in all copies or substantial portions of the software. It also requires a clear attribution back to the original author(s) of the repository. For more details, see the LICENSE file in this repository.
