Thank you for your interest in contributing to ztoken, the BPE tokenizer library with HuggingFace compatibility for the Zerfoo ML ecosystem. This guide will help you get started.
- Development Setup
- Building from Source
- Running Tests
- Code Style
- Commit Conventions
- Pull Request Process
- Issue Reporting
- Good First Issues
- Key Conventions
- Go 1.25+
- Git
No GPU, C compiler, or external libraries are required. ztoken has zero external dependencies beyond the Go standard library and golang.org/x/text.
git clone https://github.com/zerfoo/ztoken.git
cd ztoken
go mod tidy
go test ./...go build ./...ztoken compiles on every platform Go supports with no additional setup.
# Run all tests
go test ./...
# Run tests with race detector
go test -race ./...
# Run tests with coverage
go test -cover ./...
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out -o coverage.htmlAll new code must have tests. Aim for at least 80% coverage on new packages.
gofmt— all code must be formatted withgofmtgoimports— imports must be organized (stdlib, external)golangci-lint— rungolangci-lint runbefore submitting
- Follow standard Go naming: PascalCase for exported symbols, camelCase for unexported
- Use table-driven tests with
t.Runsubtests - Write documentation comments for all exported functions, types, and methods
We use Conventional Commits for automated versioning with release-please.
<type>(<scope>): <description>
[optional body]
[optional footer(s)]
| Type | Description |
|---|---|
feat |
A new feature |
fix |
A bug fix |
perf |
A performance improvement |
docs |
Documentation only changes |
test |
Adding or correcting tests |
chore |
Maintenance tasks, CI, dependencies |
refactor |
Code change that neither fixes a bug nor adds a feature |
feat(bpe): add support for Llama 3 tokenizer format
fix(decode): handle surrogate pairs in Unicode decoding
perf(encode): optimize merge priority queue for long sequences
docs: update HuggingFace compatibility notes
test: add round-trip encoding/decoding tests for Gemma vocabulary
- One logical change per PR — keep PRs focused and reviewable
- Branch from
mainand keep your branch up to date with rebase - All CI checks must pass — tests, linting, formatting
- Rebase and merge — we do not use squash merges or merge commits
- Reference related issues — use
Fixes #123orCloses #123in the PR description - Respond to review feedback promptly
go test ./...
go test -race ./...
go vet ./...
golangci-lint runPlease include:
- Description: Clear summary of the bug
- Steps to reproduce: Minimal code with the tokenizer model file used
- Expected behavior: Expected token IDs or decoded text
- Actual behavior: Actual token IDs or decoded text
- Environment: Go version, OS
- Tokenizer model: Which HuggingFace model's tokenizer was used
Please include:
- Problem statement: What problem does this solve?
- Proposed solution: How should it work?
- Alternatives considered: Other approaches you thought about
- Use case: How would you use this feature in practice?
Look for issues labeled good first issue on GitHub. These are scoped, well-defined tasks suitable for new contributors.
Good areas for first contributions:
- Adding test coverage for edge cases in encoding/decoding
- Documentation improvements
- Supporting additional HuggingFace tokenizer configurations
- Performance optimizations in the BPE merge loop
These conventions are critical to maintaining consistency across the codebase:
ztoken must produce identical token IDs to the HuggingFace tokenizers library for all supported models. When adding support for a new tokenizer format, verify against the Python reference implementation:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("model-name")
print(tok.encode("test string"))ztoken depends only on the Go standard library and golang.org/x/text. Do not add third-party dependencies. This keeps the library lightweight and easy to embed.
Tests use only the testing package from the standard library. Do not introduce test frameworks like testify.