Skip to content

Commit dbbe29a

Browse files
authored
Parse tokenize (#1497)
* ✨ feat: add tokenize command to CLI for token processing Adds a new tokenize command to parse text into hex tokens with options. * docs generator * remove comments * updated docs prompt
1 parent f24fabd commit dbbe29a

5 files changed

Lines changed: 127 additions & 62 deletions

File tree

docs/src/content/docs/reference/cli/commands.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -588,6 +588,8 @@ Commands:
588588
code <file> [query] Parse code using tree sitter and executes a
589589
query
590590
tokens [options] <files...> Count tokens in a set of files
591+
tokenize [options] <file> Tokenizes a piece of text and display the
592+
tokens (in hex format)
591593
jsonl2json Converts JSONL files to a JSON file
592594
prompty [options] <file...> Converts .prompty files to genaiscript
593595
jinja2 [options] <file> Renders Jinja2 or prompty template
@@ -684,6 +686,21 @@ Options:
684686
-h, --help display help for command
685687
```
686688

689+
### `parse tokenize`
690+
691+
```
692+
Usage: genaiscript parse tokenize [options] <file>
693+
694+
Tokenizes a piece of text and display the tokens (in hex format)
695+
696+
Arguments:
697+
file file to tokenize
698+
699+
Options:
700+
-m, --model <string> encoding model
701+
-h, --help display help for command
702+
```
703+
687704
### `parse jsonl2json`
688705

689706
```

genaisrc/docs.genai.mts

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ async function generateDocs(file: WorkspaceFile, fileStats: any) {
162162
_.def("FILE", missingDoc.getRoot().root().text())
163163
_.def("FUNCTION", missingDoc.text())
164164
// this needs more eval-ing
165-
_.$`Generate a function documentation for <FUNCTION>.
165+
_.$`Generate a TypeScript function documentation for <FUNCTION>.
166166
- Make sure parameters are documented.
167167
- Be concise. Use technical tone.
168168
- do NOT include types, this is for TypeScript.
@@ -257,8 +257,8 @@ rule:
257257
_.def("DOCSTRING", comment.text(), { flex: 10 })
258258
_.def("FUNCTION", match.text(), { flex: 10 })
259259
// this needs more eval-ing
260-
_.$`Update the docstring <DOCSTRING> to match the code in function <FUNCTION>.
261-
- If the docstring is up to date, return /NOP/.
260+
_.$`Update the TypeScript docstring <DOCSTRING> to match the code in function <FUNCTION>.
261+
- If the docstring is up to date, return /NOP/. It's ok to leave it as is.
262262
- do not rephrase an existing sentence if it is correct.
263263
- Make sure parameters are documented.
264264
- do NOT include types, this is for TypeScript.
@@ -268,6 +268,15 @@ rule:
268268
The full source of the file is in <FILE> for reference.
269269
The source of the function is in <FUNCTION>.
270270
The current docstring is <DOCSTRING>.
271+
272+
docstring:
273+
274+
/**
275+
* description
276+
* @param param1 - description
277+
* @param param2 - description
278+
* @returns description
279+
*/
271280
`
272281
},
273282
{

packages/cli/src/cli.ts

Lines changed: 58 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ import {
2222
parseMarkdown,
2323
parsePDF,
2424
parseSecrets,
25+
parseTokenize,
2526
parseTokens,
2627
prompty2genaiscript,
2728
} from "./parse" // Parsing functions
@@ -77,17 +78,55 @@ import { DEBUG_CATEGORIES } from "../../core/src/dbg"
7778
/**
7879
* Main function to initialize and run the CLI.
7980
*
80-
* Sets up global error handling for uncaught exceptions.
81-
* Verifies Node.js version compatibility.
82-
* Configures CLI options and commands, including:
83-
* - `configure`: Interactive help to configure providers.
84-
* - `run`: Executes a GenAIScript against files with various options for output, retries, and caching.
85-
* - `runs`: Commands to manage and list previous runs.
86-
* - `test`: Group of commands for running and managing tests, including listing and viewing tests.
87-
* - `convert`: Converts files through a GenAIScript with options for output, concurrency, and file-specific settings.
88-
* Handles environment setup and NodeHost installation.
89-
* Adds support for various CLI options such as working directory, environment files, color output, verbosity, and performance logging.
90-
* Includes error handling for request errors and runtime compatibility issues.
81+
* @param script - The script to execute.
82+
* @param files - Optional list of files to process.
83+
* @param cwd - Working directory for the CLI.
84+
* @param env - Paths to environment files.
85+
* @param noColors - Disable color output.
86+
* @param quiet - Disable verbose output.
87+
* @param debug - Debug categories to enable.
88+
* @param perf - Enable performance logging.
89+
* @param provider - Preferred LLM provider aliases.
90+
* @param accept - Comma-separated list of accepted file extensions.
91+
* @param excludedFiles - List of files to exclude.
92+
* @param ignoreGitIgnore - Disable exclusion of files ignored by .gitignore.
93+
* @param fallbackTools - Enable prompt-based tools instead of built-in LLM tool calls.
94+
* @param out - Output folder for results.
95+
* @param removeOut - Remove output folder if it exists.
96+
* @param outTrace - Output file for trace.
97+
* @param outOutput - Output file for output.
98+
* @param outData - Output file for data, including JSON schema validation.
99+
* @param outAnnotations - Output file for annotations.
100+
* @param outChangelog - Output file for changelogs.
101+
* @param pullRequest - Pull request identifier.
102+
* @param pullRequestComment - Create a comment on a pull request with a unique ID.
103+
* @param pullRequestDescription - Create a comment on a pull request description with a unique ID.
104+
* @param pullRequestReviews - Create pull request reviews from annotations.
105+
* @param teamsMessage - Post a message to the Teams channel.
106+
* @param json - Emit full JSON response to output.
107+
* @param yaml - Emit full YAML response to output.
108+
* @param failOnErrors - Fail on detected annotation errors.
109+
* @param retry - Number of retries for the run.
110+
* @param retryDelay - Minimum delay between retries.
111+
* @param maxDelay - Maximum delay between retries.
112+
* @param label - Label for the run.
113+
* @param temperature - Temperature for the run.
114+
* @param topP - Top-p for the run.
115+
* @param maxTokens - Maximum completion tokens for the run.
116+
* @param maxDataRepairs - Maximum data repairs.
117+
* @param maxToolCalls - Maximum tool calls for the run.
118+
* @param toolChoice - Tool choice for the run.
119+
* @param seed - Seed for the run.
120+
* @param cache - Enable LLM result cache.
121+
* @param cacheName - Custom cache file name.
122+
* @param csvSeparator - CSV separator.
123+
* @param fenceFormat - Fence format for output.
124+
* @param applyEdits - Apply file edits.
125+
* @param vars - Variables as name=value pairs.
126+
* @param runRetry - Number of retries for the entire run.
127+
* @param noRunTrace - Disable automatic trace generation.
128+
* @param noOutputTrace - Disable automatic output generation.
129+
* @returns Exit code indicating success or failure.
91130
*/
92131
export async function cli() {
93132
let nodeHost: NodeHost // Variable to hold NodeHost instance
@@ -611,6 +650,14 @@ export async function cli() {
611650
.arguments("<files...>")
612651
.option("-ef, --excluded-files <string...>", "excluded files")
613652
.action(parseTokens) // Action to count tokens in files
653+
parser
654+
.command("tokenize")
655+
.argument("<file>", "file to tokenize")
656+
.description(
657+
"Tokenizes a piece of text and display the tokens (in hex format)"
658+
)
659+
.option("-m, --model <string>", "encoding model")
660+
.action(parseTokenize)
614661
parser
615662
.command("jsonl2json", "Converts JSONL files to a JSON file")
616663
.argument("<file...>", "input JSONL files")

packages/cli/src/parse.ts

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,11 @@ import { parsePdf } from "../../core/src/pdf"
1414
import { estimateTokens } from "../../core/src/tokens"
1515
import { YAMLStringify } from "../../core/src/yaml"
1616
import { resolveTokenEncoder } from "../../core/src/encoders"
17-
import { MD_REGEX, PROMPTY_REGEX } from "../../core/src/constants"
17+
import {
18+
CONSOLE_TOKEN_COLORS,
19+
MD_REGEX,
20+
PROMPTY_REGEX,
21+
} from "../../core/src/constants"
1822
import { promptyParse, promptyToGenAIScript } from "../../core/src/prompty"
1923
import { basename, join } from "node:path"
2024
import { CSVStringify, dataToMarkdownTable } from "../../core/src/csv"
@@ -31,6 +35,10 @@ import { chunkMarkdown } from "../../core/src/mdchunk"
3135
import { normalizeInt } from "../../core/src/cleaners"
3236
import { prettyBytes } from "../../core/src/pretty"
3337
import { terminalSize } from "../../core/src/terminal"
38+
import { consoleColors, wrapColor } from "../../core/src/consolecolor"
39+
import { genaiscriptDebug } from "../../core/src/debug"
40+
import { stderr, stdout } from "../../core/src/stdio"
41+
const dbg = genaiscriptDebug("cli:parse")
3442

3543
/**
3644
* This module provides various parsing utilities for different file types such
@@ -230,7 +238,7 @@ export async function jsonl2json(files: string[]) {
230238
/**
231239
* Estimates the number of tokens in the content of files and logs the results.
232240
* @param filesGlobs - An array of files or glob patterns to process.
233-
* @param options - Options for excluding files, specifying the model, and ignoring .gitignore.
241+
* @param options - Options for processing files.
234242
* - excludedFiles - A list of files to exclude from processing.
235243
* - model - The name of the model used for token encoding.
236244
* - ignoreGitIgnore - Whether to ignore .gitignore rules when expanding files.
@@ -261,6 +269,36 @@ export async function parseTokens(
261269
console.log(text)
262270
}
263271

272+
/**
273+
* Tokenizes the content of a specified file using a provided model and logs the tokens.
274+
*
275+
* @param file - Path to the file to tokenize.
276+
* @param options - Object containing the following properties:
277+
* - model - The name of the model used for token encoding.
278+
*
279+
* The function reads the content of the file, tokenizes it using the given model,
280+
* and logs each token along with its hexadecimal representation.
281+
* Debug information about the process is also logged.
282+
*/
283+
export async function parseTokenize(file: string, options: { model: string }) {
284+
const text = await readText(file)
285+
dbg(`text: %s`, text)
286+
const { model } = options || {}
287+
const {
288+
model: tokenModel,
289+
encode: encoder,
290+
decode: decoder,
291+
} = await resolveTokenEncoder(model)
292+
293+
console.debug(`model: %s`, tokenModel)
294+
const tokens = encoder(text)
295+
for (const token of tokens) {
296+
stdout.write(
297+
`(${wrapColor(CONSOLE_TOKEN_COLORS[0], decoder([token]))}, x${wrapColor(CONSOLE_TOKEN_COLORS[1], token.toString(16))})`
298+
)
299+
}
300+
}
301+
264302
/**
265303
* Converts "prompty" format files to GenAI script files.
266304
*

packages/core/src/runpromptcontext.ts

Lines changed: 0 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -108,25 +108,6 @@ import { dotGenaiscriptPath } from "./workdir"
108108
import { prettyBytes } from "./pretty"
109109
import { createCache } from "./cache"
110110

111-
/**
112-
* Creates a context for generating chat turn prompts.
113-
*
114-
* @param options - Contains generation options such as model, lineNumbers, and fenceFormat.
115-
* @param trace - Trace object used to log the process and record outputs.
116-
*
117-
* @returns A context object used to generate prompt nodes and manage output-related functionalities.
118-
*
119-
* The returned context includes methods:
120-
* - `writeText`: Appends a text node to the prompt for a specified role, priority, and max token limit.
121-
* - `assistant`: Shortcut for writing text with an "assistant" role.
122-
* - `$`: Creates and appends a template string node, returning a chainable interface for modifiers (e.g., priority, transforms).
123-
* - `def`: Creates and appends a definition node for body content or external files, supporting error handling for empty definitions.
124-
* - `defData`: Adds structured data as a definition.
125-
* - `defDiff`: Adds a diff comparison between two data sets.
126-
* - `fence`: Shortcut for creating definition nodes with code fences.
127-
* - `importTemplate`: Imports a pre-defined template with associated data and appends it to the node tree.
128-
* - `console`: Diagnostic methods (`log`, `debug`, `warn`, `error`) for capturing and printing logs or errors during the generation process.
129-
*/
130111
export function createChatTurnGenerationContext(
131112
options: GenerationOptions,
132113
trace: MarkdownTrace,
@@ -427,33 +408,6 @@ export interface RunPromptContextNode extends ChatGenerationContext {
427408
node: PromptNode
428409
}
429410

430-
/**
431-
* Creates a chat generation context for handling prompts and related tasks in a conversational context.
432-
*
433-
* @param options - Configuration options for generation, including cancellation token, info callback, and user state.
434-
* @param trace - Tracing utilities for logging and debugging execution.
435-
* @param projectOptions - Project-specific parameters, including the project instance and environment variables.
436-
* @returns A context object with various utility functions and properties for managing prompts and AI interactions.
437-
*
438-
* Utility Functions:
439-
* - `defAgent(name, description, fn, options)`: Defines an agent with tools, memory, and task-solving capabilities.
440-
* - `defTool(name, description, parameters, fn, defOptions)`: Registers a tool for use in the chat session. Supports multiple formats for tool definitions, including callbacks and MCP server configurations.
441-
* - `defSchema(name, schema, defOptions)`: Defines a JSON schema for validation or metadata.
442-
* - `defImages(files, defOptions)`: Processes and encodes image files for their integration into prompts. Supports tiling and slicing of images.
443-
* - `defChatParticipant(generator, options)`: Adds chat participant logic (e.g., other agents or external systems).
444-
* - `defFileOutput(pattern, description, options)`: Specifies output file patterns for tracking in the session.
445-
* - `defOutputProcessor(fn)`: Adds output post-processing logic.
446-
* - `defFileMerge(fn)`: Declares logic for merging file changes in the session output.
447-
* - `prompt(strings, ...args)`: Runs a prompt with given input and additional options.
448-
* - `runPrompt(generator, runOptions)`: Executes prompt logic and generates results, supporting inner execution.
449-
* - `transcribe(audio, options)`: Transcribes audio input into text, with optional caching and language options.
450-
* - `speak(input, options)`: Converts text to speech and saves the audio file.
451-
* - `generateImage(prompt, imageOptions)`: Generates an image based on a textual description and custom options.
452-
*
453-
* Context Properties:
454-
* - `node`: Root node capturing children elements in the prompt structure.
455-
* - `env`: Environment variables passed to the project context for dynamic adjustments.
456-
*/
457411
export function createChatGenerationContext(
458412
options: GenerationOptions,
459413
trace: MarkdownTrace,

0 commit comments

Comments
 (0)