Long-form educational videos are valuable learning resources, but navigating them efficiently is difficult. Traditional keyword search over subtitles struggles when users phrase queries differently from the original speech.
This project implements a hybrid subtitle retrieval system that combines:
- Sparse keyword search (BM25)
- Dense semantic search (SBERT + FAISS)
to enable fast, accurate, and explainable navigation of lecture videos via timestamped deep links.
The system is containerised with Docker and deployed publicly on Hugging Face Spaces with GPU-backed inference (NVIDIA T4).
Hugging Face Spaces: https://huggingface.co/spaces/NIKKI77/ks-version-1-1
Note: The first request may take longer due to container cold start and model initialisation.
- Dual-mode retrieval:
- Keyword Mode (BM25) for precise term-based ranking
- Semantic Mode (SBERT + FAISS) for paraphrase-aware retrieval
- 384-dimensional sentence embeddings (MiniLM)
- FAISS
IndexFlatL2exact nearest neighbour search - Abstractive summaries (DistilBART) with safe fallback handling
- Secure match highlighting using an escape-then-mark pattern
- Timestamped YouTube deep-link navigation
- Bigram-based autocomplete suggestions
- Deterministic ranking for reproducibility
- Dockerised deployment (GPU-enabled on HF Spaces)
The system separates offline preprocessing from runtime retrieval.
- Parse WebVTT subtitle files
- Clean and normalise transcript text
- Chunk subtitles into ~40-line windows
- Restore punctuation (oliverguhr/fullstop-punctuation-multilang-large)
- Generate SBERT embeddings (384-d)
- Build FAISS
IndexFlatL2index - Extract frequent bigrams for autocomplete
- Accept user query
- Route to BM25 (sparse) or SBERT+FAISS (semantic)
- Retrieve top-ranked segments
- Generate short summaries (with fallback handling)
- Safely highlight matches
- Render results with timestamped deep links
- Probabilistic sparse retrieval
- Deterministic ranking
- Exact-phrase prioritisation
- Strong precision for literal queries
- Handles paraphrases and conceptual similarity
- Uses MiniLM sentence embeddings (384 dimensions)
- Exact L2 nearest neighbour search via FAISS
- Lemma/synonym-aware highlighting
- Python
- PyTorch
- Sentence-BERT (
all-MiniLM-L6-v2) - FAISS (
IndexFlatL2) - BM25
- DistilBART (summarisation)
- Punctuation Restoration (oliverguhr/fullstop-punctuation-multilang-large)
- Flask
- Docker
- Hugging Face Spaces (NVIDIA T4 GPU)
- Escape-then-highlight rendering prevents HTML injection
- Deterministic ranking ensures stable evaluation
- Separation of offline index building and runtime inference
- Explicit deep-link construction (no user input echoed into URLs)
- Containerised deployment with Gunicorn
KIS_PROJECT_V1.1/
│
├── backend/ # Retrieval logic and pipelines
├── templates/ # Frontend (Jinja2 templates)
├── Dockerfile # Containerised deployment
├── requirements.txt
├── README.md
└── .gitignore
Large artifacts (FAISS index, embeddings, subtitle corpus, and bigram model) are intentionally excluded from this repository to keep it lightweight.
The publicly deployed Hugging Face Space includes the fully built artifacts.
MIT License