A scalable job indexing system that collects job metadata from career pages (Indian & Global Tech Companies) and exposes them via a centralized REST API.
- Two-Tier Crawling:
- Tier 1: Structured Data extraction (JSON-LD) for accuracy.
- Tier 2: Heuristic HTML parsing fallback for sites without structured data.
- Smart Fetching:
- Uses
Axiosfor static pages andPlaywrightfor dynamic/SPA sites. - User-Agent rotation to bypass basic anti-bot protections.
- Uses
- Queue Management: Powered by BullMQ and Redis for robust job scheduling and processing.
- Deduplication: SHA-256 content hashing to avoid duplicate job entries.
- API Documentation: Integrated Swagger UI for easy API exploration.
├── Docs/ # Technical specifications and walkthroughs
├── src/
│ ├── api/ # Express API Server
│ │ ├── routes.js # API Endpoints
│ │ └── server.js # Server entry point with Swagger
│ ├── crawler/ # Crawling Logic
│ │ ├── crawler.js # Main crawler service & seed list
│ │ ├── fetcher.js # fetching (Axios/Playwright)
│ │ ├── parser.js # Parsing (JSON-LD/HTML)
│ │ ├── processor.js # Job processing pipeline
│ │ └── queue.js # Queue configuration
│ ├── database/ # Database Layer (PostgreSQL)
│ └── scripts/ # Utility scripts (Init DB)
├── docker-compose.yml # Infrastructure setup (Postgres/Redis)
└── package.json # Dependencies
The system is configured to crawl 40+ top career pages including:
- Indian IT: TCS, Infosys, Wipro, HCLTech, Tech Mahindra, LTIMindtree, etc.
- Global Tech: Google, Microsoft, Amazon, Meta, Apple, Netflix, etc.
See src/crawler/crawler.js for the full list.
- Node.js: Runtime environment.
- PostgreSQL: Relational database for storing indexed jobs.
- Redis: In-memory store for BullMQ job queues.
express: Web framework for the REST API.playwright: A browser automation library used to render JavaScript-heavy career pages (SPAs).cheerio: Fast/flexible HTML parser for extracting data from static HTML or rendered content.bullmq: A message queue based on Redis, ensuring robust job processing and retries.node-cron: For scheduling the crawler to run periodically (every 6 hours).swagger-ui-express: Generates interactive API documentation from JSDoc comments.robots-parser: Ensures compliance withrobots.txtrules of target sites.
Once the server is running, visit: http://localhost:3000/api-docs
- Prerequisites: Docker & Node.js installed.
- Start Infrastructure:
docker-compose up -d
- Initialize Database:
node src/scripts/init-db.js
- Install Dependencies:
npm install
- Start Services:
- Crawler:
npm run crawler - API Server:
npm start
- Crawler:
ISC