Hybrid Semantic Search Engine for Educational Video Subtitles (KIS v1.1)

Long-form educational videos are valuable learning resources, but navigating them efficiently is difficult. Traditional keyword search over subtitles struggles when users phrase queries differently from the original speech.

This project implements a hybrid subtitle retrieval system that combines:

Sparse keyword search (BM25)
Dense semantic search (SBERT + FAISS)

to enable fast, accurate, and explainable navigation of lecture videos via timestamped deep links.

The system is containerised with Docker and deployed publicly on Hugging Face Spaces with GPU-backed inference (NVIDIA T4).

Live Demo

Hugging Face Spaces: https://huggingface.co/spaces/NIKKI77/ks-version-1-1

Note: The first request may take longer due to container cold start and model initialisation.

What This System Provides

Dual-mode retrieval:
- Keyword Mode (BM25) for precise term-based ranking
- Semantic Mode (SBERT + FAISS) for paraphrase-aware retrieval
384-dimensional sentence embeddings (MiniLM)
FAISS IndexFlatL2 exact nearest neighbour search
Abstractive summaries (DistilBART) with safe fallback handling
Secure match highlighting using an escape-then-mark pattern
Timestamped YouTube deep-link navigation
Bigram-based autocomplete suggestions
Deterministic ranking for reproducibility
Dockerised deployment (GPU-enabled on HF Spaces)

System Architecture

The system separates offline preprocessing from runtime retrieval.

Offline Pipeline

Parse WebVTT subtitle files
Clean and normalise transcript text
Chunk subtitles into ~40-line windows
Restore punctuation (oliverguhr/fullstop-punctuation-multilang-large)
Generate SBERT embeddings (384-d)
Build FAISS IndexFlatL2 index
Extract frequent bigrams for autocomplete

Runtime Pipeline

Accept user query
Route to BM25 (sparse) or SBERT+FAISS (semantic)
Retrieve top-ranked segments
Generate short summaries (with fallback handling)
Safely highlight matches
Render results with timestamped deep links

Retrieval Modes

Keyword Mode (BM25)

Probabilistic sparse retrieval
Deterministic ranking
Exact-phrase prioritisation
Strong precision for literal queries

Semantic Mode (SBERT + FAISS)

Handles paraphrases and conceptual similarity
Uses MiniLM sentence embeddings (384 dimensions)
Exact L2 nearest neighbour search via FAISS
Lemma/synonym-aware highlighting

Core Technologies

Python
PyTorch
Sentence-BERT (all-MiniLM-L6-v2)
FAISS (IndexFlatL2)
BM25
DistilBART (summarisation)
Punctuation Restoration (oliverguhr/fullstop-punctuation-multilang-large)
Flask
Docker
Hugging Face Spaces (NVIDIA T4 GPU)

Security & Engineering Considerations

Escape-then-highlight rendering prevents HTML injection
Deterministic ranking ensures stable evaluation
Separation of offline index building and runtime inference
Explicit deep-link construction (no user input echoed into URLs)
Containerised deployment with Gunicorn

Repository Structure

KIS_PROJECT_V1.1/
│
├── backend/          # Retrieval logic and pipelines
├── templates/        # Frontend (Jinja2 templates)
├── Dockerfile        # Containerised deployment
├── requirements.txt
├── README.md
└── .gitignore

Data & Artifacts

Large artifacts (FAISS index, embeddings, subtitle corpus, and bigram model) are intentionally excluded from this repository to keep it lightweight.

The publicly deployed Hugging Face Space includes the fully built artifacts.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid Semantic Search Engine for Educational Video Subtitles (KIS v1.1)

Live Demo

What This System Provides

System Architecture

Offline Pipeline

Runtime Pipeline

Retrieval Modes

Keyword Mode (BM25)

Semantic Mode (SBERT + FAISS)

Core Technologies

Security & Engineering Considerations

Repository Structure

Data & Artifacts

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
data		data
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Hybrid Semantic Search Engine for Educational Video Subtitles (KIS v1.1)

Live Demo

What This System Provides

System Architecture

Offline Pipeline

Runtime Pipeline

Retrieval Modes

Keyword Mode (BM25)

Semantic Mode (SBERT + FAISS)

Core Technologies

Security & Engineering Considerations

Repository Structure

Data & Artifacts

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages