Skip to content

maheshkr-code/job-search-rag-weaviate

Repository files navigation

Job Search: BM25 vs. Semantic vs. Hybrid Search with Weaviate (RAG)

This project demonstrates and compares different search strategies - keyword (BM25), semantic, and hybrid - for a job search application. It uses a dataset of job descriptions, ingests them into a Weaviate vector database, and provides a Streamlit interface to query the data.

Features

  • Data Processing: Cleans and prepares raw job posting data for vectorization.
  • Weaviate Integration: Sets up a Weaviate collection with a custom schema using Ollama for local embeddings.
  • Search Comparison: Implements and allows for testing of:
    • Keyword Search (BM25)
    • Semantic Search
    • Hybrid Search
  • Interactive UI: A simple Streamlit application to perform searches and view results.
image

Tech Stack

  • Python with Pandas for data manipulation.
  • Weaviate as the vector database.
  • Ollama for generating text embeddings locally.
  • Postman (optional) for interacting with the Weaviate API.
  • Streamlit for the user interface.
  • Docker for running Weaviate.

  1. Download the data from Kaggle: Job Description Dataset [1.5 GB]. Place the CSV file in a data directory.
  2. Clean the data using the provided Python script (prepare_data.py). This script uses the Pandas library to sample the dataset down to 300 rows and creates a new column named search_text, which concatenates several fields to be used for vector indexing. Refer: datacleaning.py

2. Weaviate and Ollama Setup

You need a running instance of Weaviate with the text2vec-ollama module enabled.

  1. Follow the official Weaviate guide to run a local instance using Docker: Weaviate Local Quickstart. Ensure you configure it to use the Ollama module.
  2. Make sure your local Ollama service is running and has the required embedding model (e.g., nomic-embed-text) downloaded. Ref: docker-compose.yml

3. Create the Weaviate Collection

Before ingesting data, you must create the JobPosting collection in Weaviate with the correct schema. You can do this via a cURL command, a Python script, or Postman.

The schema should define the properties of your job postings and configure the vectorizer. Crucially, the search_text property should be configured for vectorization, while other properties can be set to skip: true to avoid indexing them.

4. Data Ingestion

Run the Python ingestion script (load_dataset_into_weaviate.py) to read the prepared CSV file and load the 300 records into your Weaviate collection.

After the script completes, you can verify that the objects were created successfully using the Weaviate Console or the Weaviate Studio plugin in VS Code.

5. Run the Streamlit UI

Launch the user interface to interact with the search application. - Refer: job_search_api.py, streamlit_app.py

  1. Navigate to the project's root directory in your terminal.
  2. Run the following command:
    streamlit run app.py
    
  3. Open your web browser to the URL provided by Streamlit. You can now perform exact, semantic, and hybrid searches on the job postings.
b1 b2 b3 b4 b5 b6 b7 b8 b9 b10

Idea borrowed from BlogYourCode ; Built with Weaviate, FastAPI, Streamlit, Python and Llama models [10/Aug/2025] I will continue to update adding Llama model generating the desired response.

About

GenAI - Keyword(BM25) vs Semantic vs Hybrid Search with Weaviate, Ollama, Python, Streamlit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages