|
| 1 | +# TOON(面向Token的对象表示法) |
| 2 | + |
| 3 | +[English](README.md) | [中文](README.zh-CN.md) |
| 4 | + |
| 5 | +一种紧凑、人类可读的序列化格式,专为向大型语言模型传递结构化数据而设计,显著减少Token使用量。 |
| 6 | + |
| 7 | +[](https://www.python.org/downloads/) |
| 8 | +[](https://opensource.org/licenses/MIT) |
| 9 | + |
| 10 | +## 概述 |
| 11 | + |
| 12 | +TOON在实现**CSV般的紧凑性**的同时增加了**明确的结构**,非常适合: |
| 13 | +- 降低LLM API调用的Token成本 |
| 14 | +- 提高上下文窗口效率 |
| 15 | +- 保持人类可读性 |
| 16 | +- 保留数据结构和类型 |
| 17 | + |
| 18 | +### 主要特性 |
| 19 | + |
| 20 | +- ✅ **紧凑**:比JSON结构化数据小30-60% |
| 21 | +- ✅ **可读**:简洁、基于缩进的语法 |
| 22 | +- ✅ **结构化**:保留嵌套对象和数组 |
| 23 | +- ✅ **类型安全**:支持字符串、数字、布尔值、null |
| 24 | +- ✅ **灵活**:多种分隔符选项(逗号、制表符、竖线) |
| 25 | +- ✅ **智能**:对统一数组自动使用表格格式 |
| 26 | +- ✅ **高效**:对深层嵌套对象的键折叠 |
| 27 | + |
| 28 | +## 安装 |
| 29 | + |
| 30 | +```bash |
| 31 | +pip install toonify |
| 32 | +``` |
| 33 | + |
| 34 | +开发环境安装: |
| 35 | +```bash |
| 36 | +pip install toonify[dev] |
| 37 | +``` |
| 38 | + |
| 39 | +## 快速开始 |
| 40 | + |
| 41 | +### Python API |
| 42 | + |
| 43 | +```python |
| 44 | +from toon import encode, decode |
| 45 | + |
| 46 | +# 将Python字典编码为TOON |
| 47 | +data = { |
| 48 | + 'products': [ |
| 49 | + {'sku': 'LAP-001', 'name': 'Gaming Laptop', 'price': 1299.99}, |
| 50 | + {'sku': 'MOU-042', 'name': 'Wireless Mouse', 'price': 29.99} |
| 51 | + ] |
| 52 | +} |
| 53 | + |
| 54 | +toon_string = encode(data) |
| 55 | +print(toon_string) |
| 56 | +# 输出: |
| 57 | +# products[2]{sku,name,price}: |
| 58 | +# LAP-001,Gaming Laptop,1299.99 |
| 59 | +# MOU-042,Wireless Mouse,29.99 |
| 60 | + |
| 61 | +# 将TOON解码回Python |
| 62 | +result = decode(toon_string) |
| 63 | +assert result == data |
| 64 | +``` |
| 65 | + |
| 66 | +### 命令行 |
| 67 | + |
| 68 | +```bash |
| 69 | +# 将JSON编码为TOON |
| 70 | +toon input.json -o output.toon |
| 71 | + |
| 72 | +# 将TOON解码为JSON |
| 73 | +toon input.toon -o output.json |
| 74 | + |
| 75 | +# 使用管道 |
| 76 | +cat data.json | toon -e > data.toon |
| 77 | + |
| 78 | +# 显示Token统计信息 |
| 79 | +toon data.json --stats |
| 80 | +``` |
| 81 | + |
| 82 | +## TOON格式规范 |
| 83 | + |
| 84 | +### 基本语法 |
| 85 | + |
| 86 | +```toon |
| 87 | +# 简单的键值对 |
| 88 | +title: Machine Learning Basics |
| 89 | +chapters: 12 |
| 90 | +published: true |
| 91 | +``` |
| 92 | + |
| 93 | +### 数组 |
| 94 | + |
| 95 | +**原始数组**(内联): |
| 96 | +```toon |
| 97 | +temperatures: [72.5,68.3,75.1,70.8,73.2] |
| 98 | +categories: [electronics,computers,accessories] |
| 99 | +``` |
| 100 | + |
| 101 | +**表格数组**(具有标题的统一对象): |
| 102 | +```toon |
| 103 | +inventory[3]{sku,product,stock}: |
| 104 | + KB-789,Mechanical Keyboard,45 |
| 105 | + MS-456,RGB Mouse Pad,128 |
| 106 | + HD-234,USB Headset,67 |
| 107 | +``` |
| 108 | + |
| 109 | +**列表数组**(非统一或嵌套): |
| 110 | +```toon |
| 111 | +tasks[2]: |
| 112 | + Complete documentation |
| 113 | + Review pull requests |
| 114 | +``` |
| 115 | + |
| 116 | +### 嵌套对象 |
| 117 | + |
| 118 | +```toon |
| 119 | +server: |
| 120 | + hostname: api-prod-01 |
| 121 | + config: |
| 122 | + port: 8080 |
| 123 | + region: us-east |
| 124 | +``` |
| 125 | + |
| 126 | +### 引号规则 |
| 127 | + |
| 128 | +字符串仅在必要时使用引号: |
| 129 | +- 包含特殊字符(`,`、`:`、`"`、换行符) |
| 130 | +- 有前导/尾随空格 |
| 131 | +- 看起来像字面量(`true`、`false`、`null`) |
| 132 | +- 为空字符串 |
| 133 | + |
| 134 | +```toon |
| 135 | +simple: ProductName |
| 136 | +quoted: "Product, Description" |
| 137 | +escaped: "Size: 15\" display" |
| 138 | +multiline: "First feature\nSecond feature" |
| 139 | +``` |
| 140 | + |
| 141 | +## API参考 |
| 142 | + |
| 143 | +### `encode(data, options=None)` |
| 144 | + |
| 145 | +将Python对象转换为TOON字符串。 |
| 146 | + |
| 147 | +**参数:** |
| 148 | +- `data`:Python字典或列表 |
| 149 | +- `options`:可选字典,包含: |
| 150 | + - `delimiter`:`'comma'`(默认)、`'tab'`或`'pipe'` |
| 151 | + - `indent`:每级缩进的空格数(默认:2) |
| 152 | + - `key_folding`:`'off'`(默认)或`'safe'` |
| 153 | + - `flatten_depth`:键折叠的最大深度(默认:None) |
| 154 | + |
| 155 | +**示例:** |
| 156 | +```python |
| 157 | +toon = encode(data, { |
| 158 | + 'delimiter': 'tab', |
| 159 | + 'indent': 4, |
| 160 | + 'key_folding': 'safe' |
| 161 | +}) |
| 162 | +``` |
| 163 | + |
| 164 | +### `decode(toon_string, options=None)` |
| 165 | + |
| 166 | +将TOON字符串转换为Python对象。 |
| 167 | + |
| 168 | +**参数:** |
| 169 | +- `toon_string`:TOON格式字符串 |
| 170 | +- `options`:可选字典,包含: |
| 171 | + - `strict`:严格验证结构(默认:True) |
| 172 | + - `expand_paths`:`'off'`(默认)或`'safe'` |
| 173 | + - `default_delimiter`:默认分隔符(默认:`','`) |
| 174 | + |
| 175 | +**示例:** |
| 176 | +```python |
| 177 | +data = decode(toon_string, { |
| 178 | + 'expand_paths': 'safe', |
| 179 | + 'strict': False |
| 180 | +}) |
| 181 | +``` |
| 182 | + |
| 183 | +## CLI使用 |
| 184 | + |
| 185 | +``` |
| 186 | +用法:toon [-h] [-o OUTPUT] [-e] [-d] [--delimiter {comma,tab,pipe}] |
| 187 | + [--indent INDENT] [--stats] [--no-strict] |
| 188 | + [--key-folding {off,safe}] [--flatten-depth DEPTH] |
| 189 | + [--expand-paths {off,safe}] |
| 190 | + [input] |
| 191 | +
|
| 192 | +TOON (Token-Oriented Object Notation) - 在JSON和TOON格式之间转换 |
| 193 | +
|
| 194 | +位置参数: |
| 195 | + input 输入文件路径(或"-"表示stdin) |
| 196 | +
|
| 197 | +可选参数: |
| 198 | + -h, --help 显示帮助信息并退出 |
| 199 | + -o, --output OUTPUT 输出文件路径(默认:stdout) |
| 200 | + -e, --encode 强制编码模式(JSON到TOON) |
| 201 | + -d, --decode 强制解码模式(TOON到JSON) |
| 202 | + --delimiter {comma,tab,pipe} |
| 203 | + 数组分隔符(默认:comma) |
| 204 | + --indent INDENT 缩进大小(默认:2) |
| 205 | + --stats 显示Token统计信息 |
| 206 | + --no-strict 禁用严格验证(仅解码) |
| 207 | + --key-folding {off,safe} |
| 208 | + 键折叠模式(仅编码) |
| 209 | + --flatten-depth DEPTH 最大键折叠深度(仅编码) |
| 210 | + --expand-paths {off,safe} |
| 211 | + 路径扩展模式(仅解码) |
| 212 | +``` |
| 213 | + |
| 214 | +## 高级特性 |
| 215 | + |
| 216 | +### 键折叠 |
| 217 | + |
| 218 | +将单键链折叠为点分隔路径: |
| 219 | + |
| 220 | +```python |
| 221 | +data = { |
| 222 | + 'api': { |
| 223 | + 'response': { |
| 224 | + 'product': { |
| 225 | + 'title': 'Wireless Keyboard' |
| 226 | + } |
| 227 | + } |
| 228 | + } |
| 229 | +} |
| 230 | + |
| 231 | +# 使用key_folding='safe' |
| 232 | +toon = encode(data, {'key_folding': 'safe'}) |
| 233 | +# 输出:api.response.product.title: Wireless Keyboard |
| 234 | +``` |
| 235 | + |
| 236 | +### 路径扩展 |
| 237 | + |
| 238 | +将点分隔的键扩展为嵌套对象: |
| 239 | + |
| 240 | +```python |
| 241 | +toon = 'store.location.zipcode: 10001' |
| 242 | + |
| 243 | +# 使用expand_paths='safe' |
| 244 | +data = decode(toon, {'expand_paths': 'safe'}) |
| 245 | +# 结果:{'store': {'location': {'zipcode': 10001}}} |
| 246 | +``` |
| 247 | + |
| 248 | +### 自定义分隔符 |
| 249 | + |
| 250 | +选择最适合您数据的分隔符: |
| 251 | + |
| 252 | +```python |
| 253 | +# 制表符分隔符(更适合类似电子表格的数据) |
| 254 | +toon = encode(data, {'delimiter': 'tab'}) |
| 255 | + |
| 256 | +# 竖线分隔符(当数据包含逗号时) |
| 257 | +toon = encode(data, {'delimiter': 'pipe'}) |
| 258 | +``` |
| 259 | + |
| 260 | +## 格式比较 |
| 261 | + |
| 262 | +### JSON vs TOON |
| 263 | + |
| 264 | +**JSON**(247字节): |
| 265 | +```json |
| 266 | +{ |
| 267 | + "products": [ |
| 268 | + {"id": 101, "name": "Laptop Pro", "price": 1299}, |
| 269 | + {"id": 102, "name": "Magic Mouse", "price": 79}, |
| 270 | + {"id": 103, "name": "USB-C Cable", "price": 19} |
| 271 | + ] |
| 272 | +} |
| 273 | +``` |
| 274 | + |
| 275 | +**TOON**(98字节,**减少60%**): |
| 276 | +```toon |
| 277 | +products[3]{id,name,price}: |
| 278 | + 101,Laptop Pro,1299 |
| 279 | + 102,Magic Mouse,79 |
| 280 | + 103,USB-C Cable,19 |
| 281 | +``` |
| 282 | + |
| 283 | +### 何时使用TOON |
| 284 | + |
| 285 | +**使用TOON的场景:** |
| 286 | +- ✅ 向LLM API传递数据(降低Token成本) |
| 287 | +- ✅ 处理统一的表格数据 |
| 288 | +- ✅ 上下文窗口受限 |
| 289 | +- ✅ 重视人类可读性 |
| 290 | + |
| 291 | +**使用JSON的场景:** |
| 292 | +- ❌ 需要最大兼容性 |
| 293 | +- ❌ 数据高度不规则/嵌套 |
| 294 | +- ❌ 使用仅支持JSON的现有工具 |
| 295 | + |
| 296 | +## 开发 |
| 297 | + |
| 298 | +### 设置 |
| 299 | + |
| 300 | +```bash |
| 301 | +git clone https://github.com/ScrapeGraphAI/toonify.git |
| 302 | +cd toonify |
| 303 | +pip install -e .[dev] |
| 304 | +``` |
| 305 | + |
| 306 | +### 运行测试 |
| 307 | + |
| 308 | +```bash |
| 309 | +pytest |
| 310 | +pytest --cov=toon --cov-report=term-missing |
| 311 | +``` |
| 312 | + |
| 313 | +### 运行示例 |
| 314 | + |
| 315 | +```bash |
| 316 | +python examples/basic_usage.py |
| 317 | +python examples/advanced_features.py |
| 318 | +``` |
| 319 | + |
| 320 | +## 性能 |
| 321 | + |
| 322 | +TOON通常实现: |
| 323 | +- 与JSON相比,结构化数据**减少30-60%的大小** |
| 324 | +- 表格数据**减少40-70%的Token** |
| 325 | +- **最小的开销**用于编码/解码(典型有效负载<1ms) |
| 326 | + |
| 327 | +## 贡献 |
| 328 | + |
| 329 | +欢迎贡献!请: |
| 330 | + |
| 331 | +1. Fork仓库 |
| 332 | +2. 创建功能分支(`git checkout -b feature/amazing-feature`) |
| 333 | +3. 进行更改并编写测试 |
| 334 | +4. 运行测试(`pytest`) |
| 335 | +5. 提交更改(`git commit -m 'Add amazing feature'`) |
| 336 | +6. 推送到分支(`git push origin feature/amazing-feature`) |
| 337 | +7. 打开Pull Request |
| 338 | + |
| 339 | +## 许可证 |
| 340 | + |
| 341 | +MIT许可证 - 详情请参见[LICENSE](LICENSE)文件。 |
| 342 | + |
| 343 | +## 致谢 |
| 344 | + |
| 345 | +Python实现受[toon-format/toon](https://github.com/toon-format/toon)的TypeScript TOON库启发。 |
| 346 | + |
| 347 | +## 链接 |
| 348 | + |
| 349 | +- **GitHub**:https://github.com/ScrapeGraphAI/toonify |
| 350 | +- **PyPI**:https://pypi.org/project/toonify/ |
| 351 | +- **文档**:https://github.com/ScrapeGraphAI/toonify#readme |
| 352 | +- **格式规范**:https://github.com/toon-format/toon |
| 353 | + |
| 354 | +--- |
| 355 | + |
| 356 | +由[ScrapeGraph团队](https://scrapegraphai.com)用心制作 |
| 357 | + |
| 358 | +<p align="center"> |
| 359 | + <img src="https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/main/docs/assets/scrapegraphai_logo.png" alt="ScrapeGraphAI Logo" width="250"> |
| 360 | +</p> |
| 361 | + |
0 commit comments