This document describes the ReCoco transform functions implemented for Thread's semantic extraction capabilities.
The Thread-ReCoco integration provides dataflow-based code analysis through transform functions that extract semantic information from source code. These functions follow the ReCoco SimpleFunctionFactory/SimpleFunctionExecutor pattern.
Factory: ThreadParseFactory
Executor: ThreadParseExecutor
Input:
content(String): Source code contentlanguage(String): Language identifier or file extensionfile_path(String, optional): Path for context
Output: Struct containing three tables:
symbols: LTable of symbol definitionsimports: LTable of import statementscalls: LTable of function calls
Features:
- Content-addressable caching enabled
- 30-second timeout
- Automatic language detection from extensions
- Hash-based content identification
Factory: ExtractSymbolsFactory
Executor: ExtractSymbolsExecutor
Input:
parsed_document(Struct): Output from ThreadParse
Output: LTable with schema:
name(String): Symbol namekind(String): Symbol type (Function, Class, Variable, etc.)scope(String): Lexical scope path
Features:
- Extracts first field from parsed document
- Caching enabled
- 30-second timeout
Factory: ExtractImportsFactory
Executor: ExtractImportsExecutor
Input:
parsed_document(Struct): Output from ThreadParse
Output: LTable with schema:
symbol_name(String): Imported symbol namesource_path(String): Import source module/filekind(String): Import type (Named, Default, Namespace, etc.)
Features:
- Extracts second field from parsed document
- Caching enabled
- 30-second timeout
Factory: ExtractCallsFactory
Executor: ExtractCallsExecutor
Input:
parsed_document(Struct): Output from ThreadParse
Output: LTable with schema:
function_name(String): Called function namearguments_count(Int64): Number of arguments
Features:
- Extracts third field from parsed document
- Caching enabled
- 30-second timeout
All schema types are defined in conversion.rs:
pub fn symbol_type() -> ValueType { /* ... */ }
pub fn import_type() -> ValueType { /* ... */ }
pub fn call_type() -> ValueType { /* ... */ }These schemas use ReCoco's type system (ValueType, StructSchema, FieldSchema) to define the structure of extracted data.
crates/flow/src/
├── functions/
│ ├── mod.rs # Exports all factories
│ ├── parse.rs # ThreadParseFactory
│ ├── symbols.rs # ExtractSymbolsFactory
│ ├── imports.rs # ExtractImportsFactory
│ └── calls.rs # ExtractCallsFactory
├── conversion.rs # Schema definitions and serialization
├── bridge.rs # CocoIndexAnalyzer integration
└── lib.rs # Main library entry
use thread_flow::functions::{
ThreadParseFactory,
ExtractSymbolsFactory,
ExtractImportsFactory,
ExtractCallsFactory,
};
// Create flow pipeline
let parse_op = ThreadParseFactory;
let symbols_op = ExtractSymbolsFactory;
let imports_op = ExtractImportsFactory;
let calls_op = ExtractCallsFactory;
// Build executors
let parse_executor = parse_op.build(/* ... */).await?;
let symbols_executor = symbols_op.build(/* ... */).await?;
// Execute pipeline
let parsed_doc = parse_executor.evaluate(vec![
Value::Str("fn main() {}".into()),
Value::Str("rs".into()),
Value::Str("main.rs".into()),
]).await?;
let symbols_table = symbols_executor.evaluate(vec![parsed_doc]).await?;These transform functions integrate with CocoIndex's dataflow framework to provide:
- Content-Addressed Caching: Parse results are cached by content hash
- Incremental Updates: Only re-analyze changed files
- Dependency Tracking: Track symbol usage across files
- Storage Backend: Results can be persisted to Postgres, D1, or Qdrant
- Parse: O(n) where n = source code length
- Extract: O(1) field access from parsed struct
- Caching: Near-instant for cache hits
- Timeout: 30 seconds per operation (configurable)
All functions use ReCoco's error system:
Error::client(): Invalid input or unsupported languageError::internal_msg(): Internal processing errors
Potential additions:
- Type information extraction
- Control flow graph generation
- Complexity metrics calculation
- Documentation extraction
- Cross-reference resolution