Skip to content

Yengeshwaran/DataScraping_Pipeline

Repository files navigation

Unicorn Intelligence System

An automated, AI-powered pipeline for tracking, analyzing, and generating narrative reports on unicorn companies.

🚀 Overview

The Unicorn Intelligence System is a production-hardened data pipeline designed to:

  1. Scrape the latest unicorn company data from Crunchbase.
  2. Maintain a persistent Excel database (unicorn_companies.xlsx) with intelligent incremental updates.
  3. Generate comprehensive, narrative-driven company profiles using advanced AI models (OpenRouter, OpenAI, Gemini).
  4. Enrich reports with real-time web data via Tavily and Serper to ensure freshness and accuracy for critical metrics like Valuation and Funding.
  5. Validate all outputs with strict logic to prevent hallucinations and ensure data integrity.

🏗️ Architecture

The system operates in a modular pipeline:

  1. Scraper Module (scraper.py): Fetch basic metadata (Company, Country, Valuation, Investors) and update the master Excel sheet.
  2. Generator Module (ai_story_generator.py):
    • Orchestrator: Reads from Excel, manages flow.
    • AI Engine: Supports multiple providers (openai, openrouter, mock).
    • Enrichment Layer:
      • Tavily: Fills missing narrative gaps ("Not mentioned" -> Search).
      • Serper: Verifies and updates numeric data (Valuation, Funding) with strict date-based logic.
    • Validation: Enforces template structure and data completeness.

✨ Features

  • Multi-Model Support: Seamlessly switch between OpenAI (GPT-4o), Google Gemini, and OpenRouter models via .env.
  • Intelligent Freshness: Automatically verifies valuation and funding numbers against Google Search results (Serper) and updates them only if a newer date is confirmed.
  • Strict Validation: Rejects and regenerates reports that contain placeholders or miss required sections.
  • Cost-Safe Verification: Includes a verify pipeline mode that simulates the entire logic flow without incurring API costs.
  • Robust Error Handling: Automatic retries, rate limiting (10s delay for OpenAI), and graceful failure recovery.
  • Clean Output: Generates structured .txt reports in the stories/ directory.

🛠️ Installation

Prerequisites

  • Python 3.10+
  • An API Key for chosen AI provider (OpenAI, OpenRouter, or Gemini)
  • Tavily API Key (for enrichment)
  • Serper API Key (for freshness checks)

Setup

  1. Clone the repository

    git clone https://github.com/yourusername/unicorn-intelligence.git
    cd unicorn-intelligence
  2. Create a virtual environment

    python -m venv venv
    # Windows
    .\venv\Scripts\activate
    # Mac/Linux
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Configure Environment Copy the example environment file and add your keys:

    cp .env.example .env

    edit .env:

    # 1. Pipeline Mode
    PIPELINE_MODE=execute # or verify (cost-free simulation)
    
    # 2. AI Provider
    AI_MODE=openrouter # or openai, gemini, mock
    
    # 3. Keys
    OPENROUTER_API_KEY=sk-or-...
    OPENAI_API_KEY=sk-...
    GEMINI_API_KEY=AIza...
    
    # 4. Enrichment (Tavily & Serper)
    TAVILY_API_KEY_1=tvly-...
    SERPER_API_KEY=...
    
    # 5. Filters
    TARGET_COUNTRY=India
    MAX_COMPANIES=2

🏃 Usage

1. Scrape Data

Initialize or update the company database:

python scraper.py

Output: unicorn_companies.xlsx

2. Generate Reports

Run the main AI pipeline:

python main.py

Output: Structured text files in stories/ directory.

3. Verify Logic (Cost-Free)

Run a simulation to verify logic without API calls:

# In .env: PIPELINE_MODE=verify
python main.py

📂 Output Structure

unicorn/
├── stories/                 # Generated Reports
│   ├── Unacademy_openai_gpt-4o.txt
│   └── Razorpay_openai_gpt-4o_enriched.txt
├── unicorn_companies.xlsx   # Master Database
├── scraper.py               # Data Collection
├── main.py                  # Pipeline Entry Point
├── ai_story_generator.py    # Core Logic
└── requirements.txt         # Dependencies

🔒 Security Notes

  • API Keys: Never commit .env. It is added to .gitignore.
  • Validation: The system automatically masks API keys in logs (api_usage.log).
  • Sanitization: Verification mode uses sanitized mock templates to prevent data leaks or cost overruns during testing.

📜 License

MIT License. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages