Job Search: BM25 vs. Semantic vs. Hybrid Search with Weaviate (RAG)

This project demonstrates and compares different search strategies - keyword (BM25), semantic, and hybrid - for a job search application. It uses a dataset of job descriptions, ingests them into a Weaviate vector database, and provides a Streamlit interface to query the data.

Features

Data Processing: Cleans and prepares raw job posting data for vectorization.
Weaviate Integration: Sets up a Weaviate collection with a custom schema using Ollama for local embeddings.
Search Comparison: Implements and allows for testing of:
- Keyword Search (BM25)
- Semantic Search
- Hybrid Search
Interactive UI: A simple Streamlit application to perform searches and view results.

Tech Stack

Python with Pandas for data manipulation.
Weaviate as the vector database.
Ollama for generating text embeddings locally.
Postman (optional) for interacting with the Weaviate API.
Streamlit for the user interface.
Docker for running Weaviate.

Download the data from Kaggle: Job Description Dataset [1.5 GB]. Place the CSV file in a data directory.
Clean the data using the provided Python script (prepare_data.py). This script uses the Pandas library to sample the dataset down to 300 rows and creates a new column named search_text, which concatenates several fields to be used for vector indexing. Refer: datacleaning.py

2. Weaviate and Ollama Setup

You need a running instance of Weaviate with the text2vec-ollama module enabled.

Follow the official Weaviate guide to run a local instance using Docker: Weaviate Local Quickstart. Ensure you configure it to use the Ollama module.
Make sure your local Ollama service is running and has the required embedding model (e.g., nomic-embed-text) downloaded. Ref: docker-compose.yml

3. Create the Weaviate Collection

Before ingesting data, you must create the JobPosting collection in Weaviate with the correct schema. You can do this via a cURL command, a Python script, or Postman.

The schema should define the properties of your job postings and configure the vectorizer. Crucially, the search_text property should be configured for vectorization, while other properties can be set to skip: true to avoid indexing them.

4. Data Ingestion

Run the Python ingestion script (load_dataset_into_weaviate.py) to read the prepared CSV file and load the 300 records into your Weaviate collection.

After the script completes, you can verify that the objects were created successfully using the Weaviate Console or the Weaviate Studio plugin in VS Code.

5. Run the Streamlit UI

Launch the user interface to interact with the search application. - Refer: job_search_api.py, streamlit_app.py

Navigate to the project's root directory in your terminal.
Run the following command:
```
streamlit run app.py
```
Open your web browser to the URL provided by Streamlit. You can now perform exact, semantic, and hybrid searches on the job postings.

Idea borrowed from BlogYourCode ; Built with Weaviate, FastAPI, Streamlit, Python and Llama models [10/Aug/2025] I will continue to update adding Llama model generating the desired response.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
300_job_descriptions_final.csv		300_job_descriptions_final.csv
README.md		README.md
create_schema.txt		create_schema.txt
datacleaning.py		datacleaning.py
delete_objects_from_weaviate.py		delete_objects_from_weaviate.py
docker-compose.yml		docker-compose.yml
job_search_api.py		job_search_api.py
load_dataset_into_weaviate.py		load_dataset_into_weaviate.py
quickstart_check_readiness.py		quickstart_check_readiness.py
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job Search: BM25 vs. Semantic vs. Hybrid Search with Weaviate (RAG)

Features

Tech Stack

2. Weaviate and Ollama Setup

3. Create the Weaviate Collection

4. Data Ingestion

5. Run the Streamlit UI

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Job Search: BM25 vs. Semantic vs. Hybrid Search with Weaviate (RAG)

Features

Tech Stack

2. Weaviate and Ollama Setup

3. Create the Weaviate Collection

4. Data Ingestion

5. Run the Streamlit UI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages