This project demonstrates and compares different search strategies - keyword (BM25), semantic, and hybrid - for a job search application. It uses a dataset of job descriptions, ingests them into a Weaviate vector database, and provides a Streamlit interface to query the data.
- Data Processing: Cleans and prepares raw job posting data for vectorization.
- Weaviate Integration: Sets up a Weaviate collection with a custom schema using Ollama for local embeddings.
- Search Comparison: Implements and allows for testing of:
- Keyword Search (BM25)
- Semantic Search
- Hybrid Search
- Interactive UI: A simple Streamlit application to perform searches and view results.
- Python with Pandas for data manipulation.
- Weaviate as the vector database.
- Ollama for generating text embeddings locally.
- Postman (optional) for interacting with the Weaviate API.
- Streamlit for the user interface.
- Docker for running Weaviate.
- Download the data from Kaggle: Job Description Dataset [1.5 GB]. Place the CSV file in a
datadirectory. - Clean the data using the provided Python script (
prepare_data.py). This script uses the Pandas library to sample the dataset down to 300 rows and creates a new column namedsearch_text, which concatenates several fields to be used for vector indexing. Refer: datacleaning.py
You need a running instance of Weaviate with the text2vec-ollama module enabled.
- Follow the official Weaviate guide to run a local instance using Docker: Weaviate Local Quickstart. Ensure you configure it to use the Ollama module.
- Make sure your local Ollama service is running and has the required embedding model (e.g.,
nomic-embed-text) downloaded. Ref: docker-compose.yml
Before ingesting data, you must create the JobPosting collection in Weaviate with the correct schema. You can do this via a cURL command, a Python script, or Postman.
The schema should define the properties of your job postings and configure the vectorizer. Crucially, the search_text property should be configured for vectorization, while other properties can be set to skip: true to avoid indexing them.
Run the Python ingestion script (load_dataset_into_weaviate.py) to read the prepared CSV file and load the 300 records into your Weaviate collection.
After the script completes, you can verify that the objects were created successfully using the Weaviate Console or the Weaviate Studio plugin in VS Code.
Launch the user interface to interact with the search application. - Refer: job_search_api.py, streamlit_app.py
- Navigate to the project's root directory in your terminal.
- Run the following command:
streamlit run app.py - Open your web browser to the URL provided by Streamlit. You can now perform exact, semantic, and hybrid searches on the job postings.
Idea borrowed from BlogYourCode ; Built with Weaviate, FastAPI, Streamlit, Python and Llama models [10/Aug/2025] I will continue to update adding Llama model generating the desired response.