Skip to content

Peterase-1/Career-Page-Job-Indexing-System-Job-Index-Bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Career Page Job Indexing System (Job Index Bot)

A scalable job indexing system that collects job metadata from career pages (Indian & Global Tech Companies) and exposes them via a centralized REST API.

🚀 Features

  • Two-Tier Crawling:
    • Tier 1: Structured Data extraction (JSON-LD) for accuracy.
    • Tier 2: Heuristic HTML parsing fallback for sites without structured data.
  • Smart Fetching:
    • Uses Axios for static pages and Playwright for dynamic/SPA sites.
    • User-Agent rotation to bypass basic anti-bot protections.
  • Queue Management: Powered by BullMQ and Redis for robust job scheduling and processing.
  • Deduplication: SHA-256 content hashing to avoid duplicate job entries.
  • API Documentation: Integrated Swagger UI for easy API exploration.

🛠 Project Structure

├── Docs/               # Technical specifications and walkthroughs
├── src/
│   ├── api/            # Express API Server
│   │   ├── routes.js   # API Endpoints
│   │   └── server.js   # Server entry point with Swagger
│   ├── crawler/        # Crawling Logic
│   │   ├── crawler.js  # Main crawler service & seed list
│   │   ├── fetcher.js  # fetching (Axios/Playwright)
│   │   ├── parser.js   # Parsing (JSON-LD/HTML)
│   │   ├── processor.js # Job processing pipeline
│   │   └── queue.js    # Queue configuration
│   ├── database/       # Database Layer (PostgreSQL)
│   └── scripts/        # Utility scripts (Init DB)
├── docker-compose.yml  # Infrastructure setup (Postgres/Redis)
└── package.json        # Dependencies

🕷️ Sites to Crawl

The system is configured to crawl 40+ top career pages including:

  • Indian IT: TCS, Infosys, Wipro, HCLTech, Tech Mahindra, LTIMindtree, etc.
  • Global Tech: Google, Microsoft, Amazon, Meta, Apple, Netflix, etc.

See src/crawler/crawler.js for the full list.

🔧 Technical Report & Libraries

Core Stack

  • Node.js: Runtime environment.
  • PostgreSQL: Relational database for storing indexed jobs.
  • Redis: In-memory store for BullMQ job queues.

Key Libraries

  • express: Web framework for the REST API.
  • playwright: A browser automation library used to render JavaScript-heavy career pages (SPAs).
  • cheerio: Fast/flexible HTML parser for extracting data from static HTML or rendered content.
  • bullmq: A message queue based on Redis, ensuring robust job processing and retries.
  • node-cron: For scheduling the crawler to run periodically (every 6 hours).
  • swagger-ui-express: Generates interactive API documentation from JSDoc comments.
  • robots-parser: Ensures compliance with robots.txt rules of target sites.

📖 API Documentation

Once the server is running, visit: http://localhost:3000/api-docs

🏃‍♂️ How to Run

  1. Prerequisites: Docker & Node.js installed.
  2. Start Infrastructure:
    docker-compose up -d
  3. Initialize Database:
    node src/scripts/init-db.js
  4. Install Dependencies:
    npm install
  5. Start Services:
    • Crawler: npm run crawler
    • API Server: npm start

📄 License

ISC

About

A scalable job indexing system that collects job metadata from career pages (Indian & Global Tech Companies) and exposes them via a centralized REST API.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors