An automated, AI-powered pipeline for tracking, analyzing, and generating narrative reports on unicorn companies.
The Unicorn Intelligence System is a production-hardened data pipeline designed to:
- Scrape the latest unicorn company data from Crunchbase.
- Maintain a persistent Excel database (
unicorn_companies.xlsx) with intelligent incremental updates. - Generate comprehensive, narrative-driven company profiles using advanced AI models (OpenRouter, OpenAI, Gemini).
- Enrich reports with real-time web data via Tavily and Serper to ensure freshness and accuracy for critical metrics like Valuation and Funding.
- Validate all outputs with strict logic to prevent hallucinations and ensure data integrity.
The system operates in a modular pipeline:
- Scraper Module (
scraper.py): Fetch basic metadata (Company, Country, Valuation, Investors) and update the master Excel sheet. - Generator Module (
ai_story_generator.py):- Orchestrator: Reads from Excel, manages flow.
- AI Engine: Supports multiple providers (
openai,openrouter,mock). - Enrichment Layer:
- Tavily: Fills missing narrative gaps ("Not mentioned" -> Search).
- Serper: Verifies and updates numeric data (Valuation, Funding) with strict date-based logic.
- Validation: Enforces template structure and data completeness.
- Multi-Model Support: Seamlessly switch between OpenAI (GPT-4o), Google Gemini, and OpenRouter models via
.env. - Intelligent Freshness: Automatically verifies valuation and funding numbers against Google Search results (Serper) and updates them only if a newer date is confirmed.
- Strict Validation: Rejects and regenerates reports that contain placeholders or miss required sections.
- Cost-Safe Verification: Includes a
verifypipeline mode that simulates the entire logic flow without incurring API costs. - Robust Error Handling: Automatic retries, rate limiting (10s delay for OpenAI), and graceful failure recovery.
- Clean Output: Generates structured
.txtreports in thestories/directory.
- Python 3.10+
- An API Key for chosen AI provider (OpenAI, OpenRouter, or Gemini)
- Tavily API Key (for enrichment)
- Serper API Key (for freshness checks)
-
Clone the repository
git clone https://github.com/yourusername/unicorn-intelligence.git cd unicorn-intelligence -
Create a virtual environment
python -m venv venv # Windows .\venv\Scripts\activate # Mac/Linux source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Configure Environment Copy the example environment file and add your keys:
cp .env.example .env
edit
.env:# 1. Pipeline Mode PIPELINE_MODE=execute # or verify (cost-free simulation) # 2. AI Provider AI_MODE=openrouter # or openai, gemini, mock # 3. Keys OPENROUTER_API_KEY=sk-or-... OPENAI_API_KEY=sk-... GEMINI_API_KEY=AIza... # 4. Enrichment (Tavily & Serper) TAVILY_API_KEY_1=tvly-... SERPER_API_KEY=... # 5. Filters TARGET_COUNTRY=India MAX_COMPANIES=2
Initialize or update the company database:
python scraper.pyOutput: unicorn_companies.xlsx
Run the main AI pipeline:
python main.pyOutput: Structured text files in stories/ directory.
Run a simulation to verify logic without API calls:
# In .env: PIPELINE_MODE=verify
python main.pyunicorn/
├── stories/ # Generated Reports
│ ├── Unacademy_openai_gpt-4o.txt
│ └── Razorpay_openai_gpt-4o_enriched.txt
├── unicorn_companies.xlsx # Master Database
├── scraper.py # Data Collection
├── main.py # Pipeline Entry Point
├── ai_story_generator.py # Core Logic
└── requirements.txt # Dependencies
- API Keys: Never commit
.env. It is added to.gitignore. - Validation: The system automatically masks API keys in logs (
api_usage.log). - Sanitization: Verification mode uses sanitized mock templates to prevent data leaks or cost overruns during testing.
MIT License. See LICENSE for details.