Career Page Job Indexing System (Job Index Bot)

A scalable job indexing system that collects job metadata from career pages (Indian & Global Tech Companies) and exposes them via a centralized REST API.

🚀 Features

Two-Tier Crawling:
- Tier 1: Structured Data extraction (JSON-LD) for accuracy.
- Tier 2: Heuristic HTML parsing fallback for sites without structured data.
Smart Fetching:
- Uses Axios for static pages and Playwright for dynamic/SPA sites.
- User-Agent rotation to bypass basic anti-bot protections.
Queue Management: Powered by BullMQ and Redis for robust job scheduling and processing.
Deduplication: SHA-256 content hashing to avoid duplicate job entries.
API Documentation: Integrated Swagger UI for easy API exploration.

🛠 Project Structure

├── Docs/               # Technical specifications and walkthroughs
├── src/
│   ├── api/            # Express API Server
│   │   ├── routes.js   # API Endpoints
│   │   └── server.js   # Server entry point with Swagger
│   ├── crawler/        # Crawling Logic
│   │   ├── crawler.js  # Main crawler service & seed list
│   │   ├── fetcher.js  # fetching (Axios/Playwright)
│   │   ├── parser.js   # Parsing (JSON-LD/HTML)
│   │   ├── processor.js # Job processing pipeline
│   │   └── queue.js    # Queue configuration
│   ├── database/       # Database Layer (PostgreSQL)
│   └── scripts/        # Utility scripts (Init DB)
├── docker-compose.yml  # Infrastructure setup (Postgres/Redis)
└── package.json        # Dependencies

🕷️ Sites to Crawl

The system is configured to crawl 40+ top career pages including:

Indian IT: TCS, Infosys, Wipro, HCLTech, Tech Mahindra, LTIMindtree, etc.
Global Tech: Google, Microsoft, Amazon, Meta, Apple, Netflix, etc.

See src/crawler/crawler.js for the full list.

🔧 Technical Report & Libraries

Core Stack

Node.js: Runtime environment.
PostgreSQL: Relational database for storing indexed jobs.
Redis: In-memory store for BullMQ job queues.

Key Libraries

express: Web framework for the REST API.
playwright: A browser automation library used to render JavaScript-heavy career pages (SPAs).
cheerio: Fast/flexible HTML parser for extracting data from static HTML or rendered content.
bullmq: A message queue based on Redis, ensuring robust job processing and retries.
node-cron: For scheduling the crawler to run periodically (every 6 hours).
swagger-ui-express: Generates interactive API documentation from JSDoc comments.
robots-parser: Ensures compliance with robots.txt rules of target sites.

📖 API Documentation

Once the server is running, visit: http://localhost:3000/api-docs

🏃‍♂️ How to Run

Prerequisites: Docker & Node.js installed.
Start Infrastructure:
```
docker-compose up -d
```
Initialize Database:
```
node src/scripts/init-db.js
```
Install Dependencies:
```
npm install
```
Start Services:
- Crawler: npm run crawler
- API Server: npm start

📄 License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Docs		Docs
src		src
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Career Page Job Indexing System (Job Index Bot)

🚀 Features

🛠 Project Structure

🕷️ Sites to Crawl

🔧 Technical Report & Libraries

Core Stack

Key Libraries

📖 API Documentation

🏃‍♂️ How to Run

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Career Page Job Indexing System (Job Index Bot)

🚀 Features

🛠 Project Structure

🕷️ Sites to Crawl

🔧 Technical Report & Libraries

Core Stack

Key Libraries

📖 API Documentation

🏃‍♂️ How to Run

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages