Skip to content

Nikki-oo7/hybrid-semantic-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hybrid Semantic Search Engine for Educational Video Subtitles (KIS v1.1)

Long-form educational videos are valuable learning resources, but navigating them efficiently is difficult. Traditional keyword search over subtitles struggles when users phrase queries differently from the original speech.

This project implements a hybrid subtitle retrieval system that combines:

  • Sparse keyword search (BM25)
  • Dense semantic search (SBERT + FAISS)

to enable fast, accurate, and explainable navigation of lecture videos via timestamped deep links.

The system is containerised with Docker and deployed publicly on Hugging Face Spaces with GPU-backed inference (NVIDIA T4).


Live Demo

Hugging Face Spaces: https://huggingface.co/spaces/NIKKI77/ks-version-1-1

Note: The first request may take longer due to container cold start and model initialisation.


What This System Provides

  • Dual-mode retrieval:
    • Keyword Mode (BM25) for precise term-based ranking
    • Semantic Mode (SBERT + FAISS) for paraphrase-aware retrieval
  • 384-dimensional sentence embeddings (MiniLM)
  • FAISS IndexFlatL2 exact nearest neighbour search
  • Abstractive summaries (DistilBART) with safe fallback handling
  • Secure match highlighting using an escape-then-mark pattern
  • Timestamped YouTube deep-link navigation
  • Bigram-based autocomplete suggestions
  • Deterministic ranking for reproducibility
  • Dockerised deployment (GPU-enabled on HF Spaces)

System Architecture

The system separates offline preprocessing from runtime retrieval.

Offline Pipeline

  1. Parse WebVTT subtitle files
  2. Clean and normalise transcript text
  3. Chunk subtitles into ~40-line windows
  4. Restore punctuation (oliverguhr/fullstop-punctuation-multilang-large)
  5. Generate SBERT embeddings (384-d)
  6. Build FAISS IndexFlatL2 index
  7. Extract frequent bigrams for autocomplete

Runtime Pipeline

  1. Accept user query
  2. Route to BM25 (sparse) or SBERT+FAISS (semantic)
  3. Retrieve top-ranked segments
  4. Generate short summaries (with fallback handling)
  5. Safely highlight matches
  6. Render results with timestamped deep links

Retrieval Modes

Keyword Mode (BM25)

  • Probabilistic sparse retrieval
  • Deterministic ranking
  • Exact-phrase prioritisation
  • Strong precision for literal queries

Semantic Mode (SBERT + FAISS)

  • Handles paraphrases and conceptual similarity
  • Uses MiniLM sentence embeddings (384 dimensions)
  • Exact L2 nearest neighbour search via FAISS
  • Lemma/synonym-aware highlighting

Core Technologies

  • Python
  • PyTorch
  • Sentence-BERT (all-MiniLM-L6-v2)
  • FAISS (IndexFlatL2)
  • BM25
  • DistilBART (summarisation)
  • Punctuation Restoration (oliverguhr/fullstop-punctuation-multilang-large)
  • Flask
  • Docker
  • Hugging Face Spaces (NVIDIA T4 GPU)

Security & Engineering Considerations

  • Escape-then-highlight rendering prevents HTML injection
  • Deterministic ranking ensures stable evaluation
  • Separation of offline index building and runtime inference
  • Explicit deep-link construction (no user input echoed into URLs)
  • Containerised deployment with Gunicorn

Repository Structure

KIS_PROJECT_V1.1/
│
├── backend/          # Retrieval logic and pipelines
├── templates/        # Frontend (Jinja2 templates)
├── Dockerfile        # Containerised deployment
├── requirements.txt
├── README.md
└── .gitignore

Data & Artifacts

Large artifacts (FAISS index, embeddings, subtitle corpus, and bigram model) are intentionally excluded from this repository to keep it lightweight.

The publicly deployed Hugging Face Space includes the fully built artifacts.

License

MIT License

About

Hybrid semantic search engine combining BM25, SBERT embeddings, and FAISS for fast GPU-accelerated retrieval.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors