Skip to content
Change the repository type filter

All

    Repositories list

    • warc2text

      Public
      Extracts plain text, language identification and more metadata from WARC records
      C++
      MIT License
      62383Updated Apr 16, 2026Apr 16, 2026
    • Cython wrapper on Hunspell Dictionary
      Python
      Other
      28000Updated Apr 10, 2026Apr 10, 2026
    • Bicleaner fork that uses neural networks
      Python
      GNU General Public License v3.0
      44010Updated Feb 23, 2026Feb 23, 2026
    • Pre-filtering step for bicleaner
      Python
      GNU General Public License v3.0
      3510Updated Jan 16, 2026Jan 16, 2026
    • monocleaner

      Public
      Python
      GNU General Public License v3.0
      1710Updated Nov 3, 2025Nov 3, 2025
    • bifixer

      Public
      Tool to fix bitexts and tag near-duplicates for removal
      Python
      GNU General Public License v3.0
      33500Updated Sep 4, 2025Sep 4, 2025
    • biroamer

      Public
      Utility that will help you to ROAM (Random Omit Anonymize and Mix) your parallel corpus.
      Python
      GNU General Public License v3.0
      21101Updated Mar 3, 2025Mar 3, 2025
    • cld2

      Public
      Compact Language Detector 2
      C++
      Apache License 2.0
      143000Updated Feb 4, 2025Feb 4, 2025
    • Monocleaner models repository
      GNU General Public License v3.0
      0100Updated Jan 8, 2025Jan 8, 2025
    • scrawl

      Public
      Playwright-based web crawler
      Python
      GNU General Public License v3.0
      0100Updated Nov 14, 2024Nov 14, 2024
    • bitextor

      Public
      Bitextor generates translation memories from multilingual websites
      Python
      GNU General Public License v3.0
      4029934Updated Nov 11, 2024Nov 11, 2024
    • PDF parser and converter to HTML
      Java
      GNU General Public License v3.0
      139341Updated Oct 3, 2024Oct 3, 2024
    • bicleaner

      Public
      Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
      Python
      GNU General Public License v3.0
      2216011Updated Jun 18, 2024Jun 18, 2024
    • Python
      GNU General Public License v3.0
      1700Updated May 31, 2023May 31, 2023
    • Repository for storing testing outputs from Bitextor
      GNU General Public License v3.0
      0000Updated May 29, 2023May 29, 2023
    • Extracts plain text, language identification and more metadata from Spiderling prevertical files
      C++
      MIT License
      0200Updated May 17, 2023May 17, 2023
    • fastText

      Public
      Library for fast text representation and classification.
      HTML
      MIT License
      4.8k000Updated May 4, 2023May 4, 2023
    • Reconstructs sentences using deferred crawling standoff annotations from Bitextor
      Python
      MIT License
      0000Updated May 4, 2023May 4, 2023
    • Repository of Bicleaner AI models
      Other
      0500Updated Mar 28, 2023Mar 28, 2023
    • C++
      GNU General Public License v3.0
      2920Updated Mar 10, 2023Mar 10, 2023
    • Repository for data models, dictionaries and more resources for Bicleaner
      GNU General Public License v3.0
      0600Updated Dec 15, 2022Dec 15, 2022
    • vecalign

      Public
      Improved Sentence Alignment in Linear Time and Space
      Python
      Apache License 2.0
      35200Updated Dec 4, 2022Dec 4, 2022
    • Python interface to Apache Tika, HTML extraction from PDF
      Python
      Other
      139000Updated Nov 30, 2022Nov 30, 2022
    • Python module to interface with Java Loomchild sentence segmenter
      Python
      GNU General Public License v3.0
      1110Updated Nov 28, 2022Nov 28, 2022
    • Fork of glove-python to distribute binary builds
      Python
      Apache License 2.0
      319000Updated Aug 12, 2022Aug 12, 2022
    • Document aligner which uses neural technologies to search matches across bilingual documents
      Python
      GNU General Public License v3.0
      4800Updated Jun 9, 2022Jun 9, 2022
    • bitextor-neural

      Public archive
      Bitextor Neural generates translation memories from multilingual websites using state-of-the-art Machine Learning tools
      Python
      GNU General Public License v3.0
      0300Updated Jun 3, 2022Jun 3, 2022
    • hunalign

      Public
      Sentence aligner
      C++
      GNU Lesser General Public License v3.0
      43000Updated May 21, 2021May 21, 2021
    • Python interface to pdf-extract, HTML extraction from PDF
      Python
      Other
      139600Updated Sep 3, 2020Sep 3, 2020
    • Repository for data models, dictionaries and more resources for Bitextor
      GNU General Public License v3.0
      0600Updated Feb 7, 2020Feb 7, 2020
    ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.