This document details the design of language-specific extractors for open-cql. Extractors transform source code into relational databases that QL queries run against.
Source Files
│
▼
┌──────────────┐
│ tree-sitter │ ←── Language grammar
│ parsing │
└──────┬───────┘
│ CST (Concrete Syntax Tree)
▼
┌──────────────┐
│ AST/Type │ ←── Symbol table, type resolution
│ Analysis │
└──────┬───────┘
│ Typed AST
▼
┌──────────────┐
│ Fact │ ←── .dbscheme (target schema)
│ Emission │
└──────┬───────┘
│ TRAP-like records
▼
┌──────────────┐
│ Database │
│ Builder │
└──────┬───────┘
│
▼
open-cql Database
We parse CodeQL's .dbscheme files directly to understand the target schema.
This lets us:
- Validate that our extractors produce the right tables
- Potentially read CodeQL-generated databases for testing
- Stay aligned with the expected relational structure
// Entity type declarations
@file = @file_default;
// Table declarations
files(
unique int id: @file,
string name: string ref
);
locations_default(
unique int id: @location_default,
int file: @file ref,
int beginLine: int ref,
int beginColumn: int ref,
int endLine: int ref,
int endColumn: int ref
);
// Case/union types
@expr = @literal
| @unary_expr
| @binary_expr
| @call_expr
| ...;
struct DbScheme {
entity_types: HashMap<String, EntityType>,
tables: HashMap<String, Table>,
unions: HashMap<String, Vec<String>>,
}
struct Table {
name: String,
columns: Vec<Column>,
key_columns: Vec<usize>,
}
struct Column {
name: String,
col_type: ColumnType,
is_unique: bool,
is_ref: bool,
}
enum ColumnType {
Int,
Float,
String,
Boolean,
Entity(String), // Reference to @entity_type
}C/C++ is the most complex language to extract due to:
- Preprocessor: Macros, conditional compilation, includes
- Templates: Template instantiation, specialization, SFINAE
- Overload resolution: Complex rules for function overloading
- Name lookup: ADL, qualified lookup, dependent names
- Multiple translation units: Each .cpp file is compiled independently
Phase 1 (MVP): tree-sitter for AST extraction, limited type resolution
- Parse with
tree-sitter-cpp - Extract AST structure (classes, functions, statements, expressions)
- Basic type resolution for simple cases
- Skip template instantiation initially
Phase 2: Integrate with clang for semantic analysis
- Use
libclangorclang-sysfor type checking and name resolution - Clang provides accurate AST with full semantic information
- Map clang AST nodes to our relational schema
From semmlecode.cpp.dbscheme, the essential tables:
File system:
files(id, name)— Source filesfolders(id, name)— Directorieslocations_default(id, file, beginLine, beginColumn, endLine, endColumn)
Types:
builtintypes(id, name, kind, size, sign, alignment)— Primitive typesderivedtypes(id, name, kind, type)— Pointers, references, arraysusertypes(id, name, kind)— Classes, structs, unions, enums
Declarations:
functions(id, name, type, kind)— Function declarationsfunction_entry_point(id, stmt)— Function body entryvariables(id, name, type, kind)— Variable declarationsfieldoffsets(id, type, offset, bitoffset)— Field layouts
Expressions (89+ kinds):
exprs(id, kind, type, location)— All expressionsexpr_types(expr, type, value_category)— Expression type info
Statements (26+ kinds):
stmts(id, kind, location)— All statementsstmt_parent(stmt, index, parent)— Statement tree
Relationships:
derivations(id, sub, super, index)— Inheritanceoverrides(id, base, derived)— Virtual overridefunbind(expr, function)— Call target binding
Example: extracting a function declaration
int factorial(int n) {
if (n <= 1) return 1;
return n * factorial(n - 1);
}tree-sitter CST:
(function_definition
type: (primitive_type) → "int"
declarator: (function_declarator
declarator: (identifier) → "factorial"
parameters: (parameter_list
(parameter_declaration
type: (primitive_type) → "int"
declarator: (identifier)))) → "n"
body: (compound_statement ...))
Emitted relations:
files(1, "example.cpp")
locations_default(1, 1, 1, 1, 4, 1)
functions(1, "factorial", 100, "normal") -- type 100 = int(int)
variables(1, "n", 101, "parameter") -- type 101 = int
stmts(1, "if_stmt", ...)
stmts(2, "return_stmt", ...)
stmts(3, "return_stmt", ...)
exprs(1, "le_expr", ...) -- n <= 1
exprs(2, "literal", ...) -- 1
exprs(3, "call_expr", ...) -- factorial(n - 1)
...
- No preprocessor
- Clear package/import structure
- Generics are simpler than templates (type erasure)
- Well-defined name resolution
- Single compilation unit per file (mostly)
Phase 1 (MVP): tree-sitter for AST extraction
- Parse with
tree-sitter-java - Extract full AST including type declarations, methods, fields
- Basic type resolution using import declarations
- Handle generic types at the surface level
Phase 2: Deeper semantic analysis
- Full type resolution including generics
- Method overload resolution
- Annotation processing
- Lambda expression desugaring
From semmlecode.dbscheme (Java):
Packages and types:
packages(id, name)— Java packagesclasses_or_interfaces(id, name, package, kind)— Type declarationsprimitives(id, name)— Primitive typesarrays(id, name, component_type)— Array types
Members:
fields(id, name, type, parent)— Field declarationsmethods(id, name, signature, type, parent)— Method declarationsconstructors(id, name, signature, parent)— Constructor declarationsparams(id, type, index, callable)— Method/constructor parameters
Expressions and statements:
exprs(id, kind, type, parent, index)— 89 expression kindsstmts(id, kind, parent, index)— 26 statement kindsvariableBinding(expr, variable)— Variable reference resolutioncallableBinding(expr, callable)— Method call resolution
Generics:
typeVars(id, name, pos, parent)— Type parameterstypeBounds(id, type, pos, typevar)— Type boundswildcards(id, name, kind)— Wildcard types
Both extractors need to construct control flow graphs for the database. The CFG is essential for data flow analysis.
struct CfgBuilder {
nodes: Vec<CfgNode>,
edges: Vec<(CfgNodeId, CfgNodeId, CfgEdgeKind)>,
}
enum CfgNode {
Entry(FunctionId),
Exit(FunctionId),
Expr(ExprId),
Stmt(StmtId),
Guard(ExprId, bool), // Branch condition (true/false arm)
}
enum CfgEdgeKind {
Normal,
True, // Condition is true
False, // Condition is false
Exception, // Exception thrown
Break,
Continue,
Return,
}CFG construction rules:
- Sequential statements: edge from end of S1 to start of S2
- If statement: edges from condition to true/false branches, both converge after
- While/for: edge from condition to body (true) and exit (false), back-edge from body end to condition
- Try/catch: edges from any statement to catch handlers
- Return/break/continue: edge to function exit / loop exit / loop header
All extractors share a common ID allocation scheme:
struct IdAllocator {
next_id: AtomicU64,
}
impl IdAllocator {
fn alloc(&self) -> EntityId {
EntityId(self.next_id.fetch_add(1, Ordering::Relaxed))
}
}Extractors emit facts in a common intermediate format:
struct Fact {
table: TableName,
values: Vec<Value>,
}
trait FactEmitter {
fn emit(&mut self, fact: Fact);
fn emit_location(&mut self, entity: EntityId, file: FileId,
start_line: u32, start_col: u32,
end_line: u32, end_col: u32);
}The database includes a copy of the analyzed source files for displaying results in context:
database/source/
├── src/
│ ├── main.cpp
│ ├── utils.h
│ └── ...
└── lib/
└── ...
Create minimal C++/Java programs that exercise specific language features. Extract them and verify the resulting database has the expected relations.
For programs where we can run both CodeQL and open-cql:
- Extract with both tools
- Compare the resulting databases (modulo ID differences)
- Run the same queries and compare results
Verify that tree-sitter can parse all our test programs correctly. Track any parsing failures or inaccuracies.