Feature/json indexes#2477
Open
palmer159 wants to merge 42 commits into
Open
Conversation
added 30 commits
May 14, 2026 17:03
Adds the design document for BSON-path functional indexes (Phase 0–5) plus six self-contained implementation plans. Plans are written so a subagent can execute each phase end-to-end with TDD discipline and local verification.
…okup
Indexed BSON expressions are stored in canonical JSONPath form ('$.name'),
but CommonComparisonExpressionUtils.getFieldFromDocument was the legacy
non-canonical walker that treats the leading '$' as a top-level field name
and returns null for any indexed lookup. As a result, BsonValueFunction.evaluate
took the missing-path branch on every Put, sparse-skip kicked in, and the
index never received any rows.
Add a canonical-aware walker that handles '$.field', '$['quoted field']', and
'$.field[n]' forms, and dispatch to it when the path begins with '$'. Legacy
non-canonical paths still flow through the original walker unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Incremental changes for Expression-Based Secondary Indexes on BSON/JSON Paths
IndexExpressionParseNodeRewriter.leaveCompoundNodematches ParseNode by
equals().FunctionParseNode.equalscompares name + children;a path literal child is a
LiteralParseNodewhoseequalsis byte-for-byte on thevalue. So
BSON_VALUE(doc, 'a.b', 'VARCHAR')andBSON_VALUE(doc, '$.a.b', 'VARCHAR')— which are semantically identical — do not match the same index. A user whose
query spells the path one way misses an index created with the other spelling.
(
Bson5IT) showsBSON_VALUE(COL, p, t) = ?hitting the index. Range (<,<=,>,>=),BETWEEN, andINare not exercised — and because the rewriter relies onAST equality of the indexed expression against the predicate LHS, range predicates
will rewrite correctly for the indexed column (that part is generic scan-range
derivation), but canonicalization differences will still cause misses.
type.
BSON_VALUE(doc, 'x', 'VARCHAR')returns VARCHAR bytes — fine. ButBSON_VALUE(doc, 'x', 'DOUBLE')at the write path setsptrtoPDouble.INSTANCE.toBytes(double)(IEEE 754 bits), which is not order-preservingunder unsigned byte comparison for negative values. Range scans on such indexes will
return incorrect ordering across sign boundaries. Must be verified; if confirmed,
fixed at index-key time with a sign-flip or by routing through Phoenix's fixed-width
numeric encoders that already provide order-preserving bytes.
IndexStatementRewriter.translateruns on everySELECT, walking the index column list and parsing every indexed expression. Ontables with no BSON columns and no BSON indexes, this is pure overhead. It does not
scale as BSON indexes proliferate.
BSON_VALUE,BSON_VALUE_TYPE,BSON_CONDITION_EXPRESSION. There is no->or->>in thegrammar (
grepofPhoenixSQL.g,PhoenixBsonExpression.g: no hits). Adding them isa self-contained grammar change independent of the indexing work.
EXPLAIN hint showing which canonical path matched, no counter for partial-index
skips. Operators debugging "why didn't my query hit the index" have no signal.
2. What this design delivers
A focused, incremental enhancement program that closes the five real gaps above without
inventing new infrastructure where Phoenix already has working infrastructure.
Non-goals (unchanged from prior spec, reaffirmed)
@>) predicates.Goals, restated
index as the DDL that created it.
<,<=,>,>=),BETWEEN, andINpredicates on BSON-path expressionsmust use the index correctly, including with correct sort order for numeric types.
indexes.
Bson5ITbreaks.
3. Architecture
Two modest additions, both client-side; no on-disk format change; no coprocessor change.
BsonPathNormalizeris applied at two points: (a) duringCREATE INDEXcompilation,before the expression string is persisted into SYSTEM.CATALOG, so equivalent paths
produce identical expression strings; (b) during query rewrite, to the WHERE-clause
parse nodes, so a differently-spelled path rewrites to the canonical form and then
matches by
ParseNode.equals.Everything else —
IndexMaintainer.buildRowKey,IndexRegionObserver.preBatchMutate,partial-index
WHEREcompilation,CONSISTENCYmodes,INCLUDEsemantics — is reusedwithout modification.
4. Components
4.1
BsonPathNormalizerphoenix-core-client/src/main/java/org/apache/phoenix/parse/bson/.ParseNodetree and, whenever it finds aBsonValueParseNodewhose path literal (second argument) is a constant string,replaces that literal with the canonical form of the same path.
$.or$if present (Phoenix'sBSON_VALUEuses paths withoutthe
$.prefix — confirmed byBson5IT.java:111which uses'rather[3].outline. clock', no leading$). Accept both forms on input; emit no-prefix form.quotes (
['a']→.a); otherwise keep exact bracketed form.*), filter expressions ([?(...)]), recursive descent (..),and slice syntax (
[a:b]) with a clear SQLException that identifies the offendingsegment.
(type, default). Those must match byte-for-byte between DDL and query, as today.
4.2 Fast-path guard in
IndexExpressionParseNodeRewriterIndexExpressionParseNodeRewriter's constructor today parses every index column'sexpression string (
IndexExpressionParseNodeRewriter.java:62-75). For a table with tenBSON-path indexes, that's ten
SQLParser.parseCondition(...)calls per query.Change:
hasBsonIndexcheck on the indexPTableat construction time. If no indexcolumn has a
BsonValueParseNode-rooted expression, skip the BSON normalizerinvocation on the WHERE clause entirely. This is the M6 fast-path from review.
indexedParseNodeToColumnParseNodeMapwith the already-canonicalParseNode (see §4.1); when normalizing the WHERE clause, look up by canonical form.
4.3 Numeric sort order verification + fix
This is the one place where existing production behavior is likely wrong and we must
change real code.
For
BSON_VALUE(doc, 'x', 'DOUBLE')used as an index key, the write path currentlyroutes through
BsonValueFunction.evaluate→PDouble.INSTANCE.toBytes(double). Thisuses
Double.doubleToLongBitsraw bytes; they are not order-preserving underunsigned byte comparison across sign boundaries (negatives sort after positives). The
same issue applies to
PFloat.PInteger,PLong,PSmallint,PTinyintusePhoenix's offset-encoded integers which are order-preserving.
PDecimaluses its ownencoding which is order-preserving.
Action:
negatives and positives, runs
BSON_VALUE(...) BETWEEN -1 AND 1, and asserts correctness.Phoenix's existing order-preserving encoder for those types. This may already exist
in
IndexUtil.getIndexColumnDataType/PDataType.coerceBytes— must be traced onactual execution paths, not assumed.
BSON_VALUEtype codes in a range predicate.This is the one fix that must happen regardless of everything else — it is a latent
correctness bug, not a feature gap.
4.4 Predicate rewrite coverage for range / BETWEEN / IN
Phoenix's scan-range derivation over an indexed column already supports all of
=,<,<=,>,>=,BETWEEN,IN, and!=— seeWhereCompiler. The machinery worksonce the LHS of the predicate is matched to an index column, which the
IndexExpressionParseNodeRewriterdoes today.So the work here is not new rewrite code; it is test coverage and verification that
canonicalized BSON path predicates flow through existing scan-range derivation for all
predicate forms. Concretely:
BsonPathIndexPredicateIT: for each of=,<,<=,>,>=,BETWEEN,IN,!=, assert the plan uses the index and the result set matches a no-index baseline.Cover VARCHAR, BIGINT, DOUBLE, DECIMAL, DATE, BOOLEAN paths.
LIKE,IS NULL/IS NOT NULL(the latter works today via partial index, not via rewrite — a user's explicitIS NOT NULLpredicate hits the index because Phoenix's scan machinery treatsnon-empty key as present; this is the existing behavior in
Bson5IT),CASTwrappingthe BSON_VALUE on the query side, arithmetic wrappers.
4.5 Observability
phoenix.index.bson.rewrite.hitandphoenix.index.bson.rewrite.miss— client-sidecounters tagged with
table_name,index_name. Incremented whenever the rewriterruns against a table with BSON indexes.
[BSON path: <canonical-path>, type: <TYPE>]to the existing RANGE SCAN plan line.The existing code path for plan-line generation lives in
ExplainPlan/ScanPlan.getExplainSteps(); adding a suffix from IndexMaintainermetadata is a small change.
4.6 Operator sugar (
->and->>), optional separate phaseThe reviewer correctly noted this was smuggled in. Separated out: add
->and->>operators to the ANTLR grammar (
PhoenixSQL.g), with PG-equivalent semantics:bson_col -> 'field'→BSON_VALUE(bson_col, 'field', 'BSON')(returns sub-document)bson_col ->> 'field'→BSON_VALUE(bson_col, 'field', 'VARCHAR')(returns scalar asstring — matches PG
->>behavior)bson_col -> 'a' -> 'b' ->> 'c'→BSON_VALUE(bson_col, 'a.b.c', 'VARCHAR').Desugaring happens in the parse-tree phase, producing canonical
BsonValueParseNode.Owns its own ticket and grammar-review cycle. Not blocking the indexing improvements.
5. What does not change
IndexMaintainer.buildRowKey— unchanged. No new "is_bson_path" protobuf flag (M6 ofthe review: redundant). No sparse-null skip branch — the existing null-in-index-key
encoding plus partial-index
WHEREalready gives users both dense and sparse options.IndexRegionObserver— unchanged. Existing pre-image/post-image logic is already correct.MetaDataClient.createIndex— unchanged. TheisJsonFragmentguard does not blockBSON and does not need relaxation. The determinism and stateless gates pass today.
SYSTEM.CATALOGschema — unchanged.CREATE INDEX— unchanged. No mandatoryAS <type>(type is alreadyan argument of
BSON_VALUE). No reservedUSING PATHkeyword in this scope (GIN is aseparate design; reserve in that design if needed).
6. Error handling and edge cases
BSON_VALUEreturns default; index encodes null; behavior matches todayreturnDefaultValue→ emptyptr; index encodes null; if user has partial-indexWHERE ... IS NOT NULL, row is skipped from indexBsonValueFunction.evaluatethrowsIllegalArgumentException("function data type does not match with actual data type"). This aborts the mutation. (Verified atBsonValueFunction.java:164-165.)BsonPathNormalizerthrows SQLException pointing at offending segment'a.b'vs'$.a.b')'a.b'; secondCREATE INDEXgets existing duplicate-index errorCREATE INDEX; leave existing catalog rows alone. Queries still match the non-canonical path string byte-for-byte.The mutation-aborting behavior on type mismatch is a latent surprise that the
reviewer flagged (as part of Mod4). Filed as a separate issue to decide whether to keep
throwing, coerce-to-null, or add a new
BSON_VALUEoverload with lenient semantics.Out of scope for this design — do not change
BsonValueFunctionbehavior here.7. Phased delivery
Each phase is one PHOENIX JIRA ticket, mergeable independently, passing all existing
tests. Master is coherent after each phase.
Phase 0 — Verify the numeric sort-order correctness bug
Phase 1 —
BsonPathNormalizer(unwired)parse/bson/, package-private.$.stripping, bracketed/dot form equivalence,rejection of unsupported syntax.
Phase 2 — Fix numeric sort-order (if Phase 0 confirmed it)
BsonPathIndexPredicateIT.produce correct scan results. Provide an
ALTER INDEX ... REBUILDnote.migration that marks existing DOUBLE-path indexes as requiring rebuild.
Phase 3 — Wire the normalizer into DDL and query rewrite
MetaDataClient.createIndex: callBsonPathNormalizeron each indexed parse-nodebefore computing
expressionStr.IndexExpressionParseNodeRewriter: callBsonPathNormalizeron each indexedexpression after parsing, and on the WHERE clause before map lookup. Add the
hasBsonIndexfast-path guard.Bson5ITmust still pass without modification — its paths already round-tripthrough a no-op canonicalization.
BsonPathCanonicalizationIT: same index created two ways ('a.b'vs'$.a.b')→ second fails as duplicate; query with either spelling hits the same index.
phoenix.index.bson.normalize.enabled, defaulttrue. Flip off torevert to byte-for-byte matching if the normalizer misbehaves.
Bson5ITgreen;BsonPathCanonicalizationITgreen; no perf regression onnon-BSON-table query benchmarks.
Phase 4 — Predicate coverage for range / BETWEEN / IN
BsonPathIndexPredicateIT: exhaustive matrix of (predicate type) × (BSON_VALUE outputtype). Assert plan uses index and results match no-index baseline.
once the LHS matches. If any predicate form is silently not matching, this phase
files a follow-up ticket rather than forcing a v1 fix.
Phase 5 — Observability
phoenix.index.bson.rewrite.hit/.miss, tagged per index.[BSON path: <path>, type: <TYPE>]on RANGE SCAN lines over aBSON-path index.
phoenix-pherfscenario: write+read mix against a BSON-path index; publish a baselinereport as an artifact on the phase JIRA.
BsonPathIndexPredicateIT;perf report attached.
(the cost of the
hasBsonIndexcheck). < 10% write-path p99 overhead on a workloadwith one BSON-path index over a 4KB document. If exceeded, revisit the fast-path.
Phase 6 (optional, separate ticket) — Operator sugar
->/->>PhoenixSQL.g.BsonValueParseNodeat parse time; then canonicalization and everythingelse works unchanged.
>) resolvedin the parser.
Phase 7 (out of scope for this spec)
8. Rollback strategy
operators can
ALTER INDEX … DISABLEand fall back to full scan.phoenix.index.bson.normalize.enabled=falsereverts tobyte-for-byte matching. Existing indexes stay correctly maintained either way.
on master without the grammar bump until confident.
9. Testing strategy
Unit:
BsonPathNormalizercovered by golden-file tests; fast-path check inIndexExpressionParseNodeRewriterhas dedicated tests for the no-BSON-index shortcircuit.
Integration:
BsonPathCanonicalizationIT(Phase 3),BsonPathIndexPredicateIT(Phase 4),
BsonPathNumericSortOrderIT(Phase 0/2), plus continued passage ofexisting
Bson5IT.Correctness invariant: for any query
Qand matching BSON-path indexI, theresult set with
Ienabled must equal the result set afterALTER INDEX I DISABLE.Encoded as a randomized IT.
Upgrade test: create indexes on pre-change master, bounce to post-change master,
verify queries still match; DOUBLE-path indexes are marked for rebuild.
Perf test:
phoenix-pherfscenarios for (a) write-path overhead with one BSONindex on a 4KB doc, (b) query-path overhead on a table with no BSON indexes.
BSON Path Functional Indexes — User Guide
This is a short companion to the design spec at
docs/superpowers/specs/2026-05-05-bson-path-functional-indexes-design.md.What you can do today
Define a secondary index on a path inside a
BSONcolumn:Queries that name the same canonical BSON path will use the index automatically:
Both forms canonicalize to
BSON_VALUE(DOC, '$.customer.id', 'VARCHAR')and hit the index.Sparse semantics
If a row's BSON document does not contain the indexed path, no index entry is written for
that row (sparse index). Consequence: you cannot use a BSON path index to find missing-path
rows via
IS NULL.Type contract
BSON_VALUE's third argument fixes the SQL type of the indexed key. Match the WHERE clause tothe same type: index built
AS BIGINTrequires the predicate to be a numeric literal, not astring. v1 does not yet rewrite
CAST(BSON_VALUE(...) AS BIGINT) = 1for you.Predicate forms that hit the index
BSON_VALUE(doc, p, 'VARCHAR') = 'x'BSON_VALUE(doc, p, 'VARCHAR') IN (...)BSON_VALUE(doc, p, 'VARCHAR') BETWEEN ...BSON_VALUE(doc, p, 'VARCHAR') > 'x'UPPER(BSON_VALUE(doc, p, 'VARCHAR')) = 'X'BSON_VALUE(doc, p, 'VARCHAR') LIKE 'a%'BSON_VALUE(doc, p, 'VARCHAR') IS NULLPath language supported in v1
$.a.b.c$.a[0],$.a[10][3]$['weird key'],$["odd"]a.b,a[0](canonicalized to$.a.b)$.*,$[*]$[?(@.x>1)]$..x$[0:2]Feature flags
falsephoenix.index.bson.enabledtrueCREATE INDEXon BSON paths is rejectedphoenix.index.bson.rewrite.enabledtrueObservability
Client-process counters in
org.apache.phoenix.monitoring.BsonPathMetrics:getSparseSkips()— number of UPSERT rows that hit a missing-path branch and wereskipped from the index.
getRewriteHits()— number of WHERE-clause sub-expressions that matched a BSON path indexafter canonicalization.
getRewriteMisses()— number of BSON-path WHERE expressions that did not match any indexedexpression (typically: wrapped LHS, or no relevant index defined).
What's not yet supported
USING PATHis reserved but notimplemented.
IS NULLrewrite,LIKE, function-wrapped LHS.->/->>operator sugar.promoted to Phoenix's
MetricInfoenum yet.