Contributing to ztoken

Thank you for your interest in contributing to ztoken, the BPE tokenizer library with HuggingFace compatibility for the Zerfoo ML ecosystem. This guide will help you get started.

Development Setup
Building from Source
Running Tests
Code Style
Commit Conventions
Pull Request Process
Issue Reporting
Good First Issues
Key Conventions

Development Setup

Prerequisites

Go 1.25+
Git

No GPU, C compiler, or external libraries are required. ztoken has zero external dependencies beyond the Go standard library and golang.org/x/text.

Clone and Verify

git clone https://github.com/zerfoo/ztoken.git
cd ztoken
go mod tidy
go test ./...

Building from Source

go build ./...

ztoken compiles on every platform Go supports with no additional setup.

Running Tests

# Run all tests
go test ./...

# Run tests with race detector
go test -race ./...

# Run tests with coverage
go test -cover ./...
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out -o coverage.html

All new code must have tests. Aim for at least 80% coverage on new packages.

Code Style

Formatting and Linting

gofmt — all code must be formatted with gofmt
goimports — imports must be organized (stdlib, external)
golangci-lint — run golangci-lint run before submitting

Go Conventions

Follow standard Go naming: PascalCase for exported symbols, camelCase for unexported
Use table-driven tests with t.Run subtests
Write documentation comments for all exported functions, types, and methods

Commit Conventions

We use Conventional Commits for automated versioning with release-please.

Format

<type>(<scope>): <description>

[optional body]

[optional footer(s)]

Types

Type	Description
`feat`	A new feature
`fix`	A bug fix
`perf`	A performance improvement
`docs`	Documentation only changes
`test`	Adding or correcting tests
`chore`	Maintenance tasks, CI, dependencies
`refactor`	Code change that neither fixes a bug nor adds a feature

Examples

feat(bpe): add support for Llama 3 tokenizer format
fix(decode): handle surrogate pairs in Unicode decoding
perf(encode): optimize merge priority queue for long sequences
docs: update HuggingFace compatibility notes
test: add round-trip encoding/decoding tests for Gemma vocabulary

Pull Request Process

One logical change per PR — keep PRs focused and reviewable
Branch from main and keep your branch up to date with rebase
All CI checks must pass — tests, linting, formatting
Rebase and merge — we do not use squash merges or merge commits
Reference related issues — use Fixes #123 or Closes #123 in the PR description
Respond to review feedback promptly

Before Submitting

go test ./...
go test -race ./...
go vet ./...
golangci-lint run

Issue Reporting

Bug Reports

Please include:

Description: Clear summary of the bug
Steps to reproduce: Minimal code with the tokenizer model file used
Expected behavior: Expected token IDs or decoded text
Actual behavior: Actual token IDs or decoded text
Environment: Go version, OS
Tokenizer model: Which HuggingFace model's tokenizer was used

Feature Requests

Please include:

Problem statement: What problem does this solve?
Proposed solution: How should it work?
Alternatives considered: Other approaches you thought about
Use case: How would you use this feature in practice?

Good First Issues

Look for issues labeled good first issue on GitHub. These are scoped, well-defined tasks suitable for new contributors.

Good areas for first contributions:

Adding test coverage for edge cases in encoding/decoding
Documentation improvements
Supporting additional HuggingFace tokenizer configurations
Performance optimizations in the BPE merge loop

Key Conventions

These conventions are critical to maintaining consistency across the codebase:

HuggingFace compatibility

ztoken must produce identical token IDs to the HuggingFace tokenizers library for all supported models. When adding support for a new tokenizer format, verify against the Python reference implementation:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("model-name")
print(tok.encode("test string"))

Zero external dependencies

ztoken depends only on the Go standard library and golang.org/x/text. Do not add third-party dependencies. This keeps the library lightweight and easy to embed.

Stdlib-only testing

Tests use only the testing package from the standard library. Do not introduce test frameworks like testify.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing to ztoken

Table of Contents

Development Setup

Prerequisites

Clone and Verify

Building from Source

Running Tests

Code Style

Formatting and Linting

Go Conventions

Commit Conventions

Format

Types

Examples

Pull Request Process

Before Submitting

Issue Reporting

Bug Reports

Feature Requests

Good First Issues

Key Conventions

HuggingFace compatibility

Zero external dependencies

Stdlib-only testing

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to ztoken

Table of Contents

Development Setup

Prerequisites

Clone and Verify

Building from Source

Running Tests

Code Style

Formatting and Linting

Go Conventions

Commit Conventions

Format

Types

Examples

Pull Request Process

Before Submitting

Issue Reporting

Bug Reports

Feature Requests

Good First Issues

Key Conventions

HuggingFace compatibility

Zero external dependencies

Stdlib-only testing