Skip to content

Arena-backed tensor-of-tensors: 8-byte view inner cells through einsum/contraction#548

Merged
evaleev merged 26 commits into
masterfrom
feature/arena_tensor
May 19, 2026
Merged

Arena-backed tensor-of-tensors: 8-byte view inner cells through einsum/contraction#548
evaleev merged 26 commits into
masterfrom
feature/arena_tensor

Conversation

@evaleev
Copy link
Copy Markdown
Member

@evaleev evaleev commented May 19, 2026

Summary

Deploys TA::Tensor<TA::ArenaTensor<T>> — a tensor-of-tensors (ToT) tile
type whose inner cells are 8-byte non-owning views into an arena "slab"
owned by the outer tile — through TiledArray's einsum and contraction
machinery. This shrinks the ToT inner-tile footprint dramatically (from
~304 B to ~8 B per cell) while keeping Tensor<ArenaTensor> behaving like
an ordinary Tensor<Tensor> at the expression-DSL and einsum level.

Highlights:

  • Arena infrastructure — allocator, slab plan helpers, and ToT
    construction kernels (arena_outer_init, make_nested_tile,
    arena_trivial_{unary,binary}); arena-aware fill/set/init_elements.
  • Tile ops — arena-aware add/subt/scale/neg, and a corrected
    Mult consume path: in-place mult_to on shallow arena-ToT operand
    tiles corrupts shared slabs, so view-cell results are routed through a
    fresh-result whole-tile op (uses_tile_op_).
  • einsum / contraction engine — regime-A (outer-Hadamard) arena plans
    and dispatch; ToT×ToT Hadamard with view inner cells routed via outer
    ops; permuted inner contractions handled via slab-level hoist;
    MultEngine support for Hadamard-outer × inner-contraction.
  • Multi-rank einsumreplicate_array path now builds fresh
    slab-backed tiles instead of copying 8-byte views (which would dangle),
    and the Replicator reserve hint is guarded for pmaps without a known
    local size (HashPmap).
  • Bug fix — permuting Tensor::axpy_to now initializes an empty
    target instead of asserting in inplace_tensor_op; this also fixes a
    pre-existing einsum abort on different_nested_ranks.

Test plan

  • Full TA serial unit suite (tiledarray/unit/run-np-1) passes 100%.
  • ToT einsum end-to-end harness and arena-kernel tests pass.
  • different_nested_ranks einsum case (previously aborting) passes.
  • Downstream MPQC csv-cck validation tests pass (debug + release),
    including the np=2 he10-csv-cck case.
  • CI green on this PR (the prior CI failure — ContractionArenaPlan
    not nameable for non-ToT operands — is fixed in 6e6e8efd6).

zhihao-deng and others added 26 commits May 13, 2026 22:03
- arena_sizeof_invariant_suite: drop platform-specific absolute baselines
  (328/16/248 were Apple-arm64/libc++ only); keep relative
  ImplLayoutAllocator == ImplLayoutMaster invariant + monostate static_asserts.
- cont_engine: reset arena_plan_ after std::move into op_ so later reads
  see "no plan" rather than a moved-from optional.
- arena_kernels: one-line intent note on trivial kernels' tight packing.
Treat ArenaTensor as a first-class tensor in TA's trait machinery so
kernel-level dispatches (tensor_reduce, inplace_tensor_op, tensor_op,
unary, binary, ...) and operators match the same overloads they do for
TA::Tensor<double> -- without bespoke arena overloads in kernels.h.

The key insight: ArenaTensor is structurally a flat contiguous tensor
(.data() + .size()); the only reason it was kept out of is_tensor_helper
was to avoid being dragged into value-returning operator paths it can't
fulfill (a view has no allocator to materialize fresh storage). Address
that as a separate gate rather than excluding ArenaTensor from being a
tensor at all.

Trait split:
- is_tensor_helper<ArenaTensor> = true. is_contiguous_tensor_helper too.
- is_tensor_view<T> -- new predicate; true for non-owning views that
  lack value-returning member arithmetic (ArenaTensor, btas::TensorView).
  TensorInterface is *not* a view here because it materializes a fresh
  result from member add/subt/mult/etc.
- ta_ops_match_tensor (operators_body.ipp value-returning gate): adds
  `&& !is_tensor_view_v<T>`; views opt out of `+`, `-`, `*`, unary `-`,
  tensor*scalar, scalar*tensor, permute*tensor.
- ta_ops_match_tensor_inplace (new; operators_body.ipp compound-
  assignment gate): default is `is_nested_tensor_v<T>`, which accepts
  views since they can be mutated in place.
- btas.h specializes both predicates to false for btas::Tensor (was
  already done for the freestanding one).

ArenaTensor surface:
- Member compound operators (operator+=, -=, *=, scalar *=) routing to
  free CPOs.
- Member in-place CPO mirrors (add_to, subt_to, mult_to, scale_to,
  neg_to) so tile_interface paths that call `arg.add_to(other)` work
  uniformly.
- No member operator+/-/* or value-returning member add/subt/mult/
  scale/neg/clone. Those would require allocation.

Tensor.h changes:
- value_converter falls through to identity for views (rebind-on-copy,
  no clone).
- Tensor(range, value) ctor: when value_type is a view, copy each cell
  by value instead of calling Clone (which delegates to a missing
  member).
- Permutation paths (Tensor::neg(perm), subt(right, perm), scale(perm),
  inner-permute in copy-with-perm ctor) bail with TA_EXCEPTION for view
  inner cells -- views can't permute in place.
- Drop the view-aware Tensor<View>::scale_to/add_to/subt_to/mult_to
  overloads added in earlier iterations; the legacy ones now work
  because ArenaTensor has compound operators.
- Tighten one residual ambiguity: subt(Right) const value-returning
  excludes the arena-pair case (handled by a dedicated overload above).

Kernels.h: drop the ad-hoc tensor_reduce(ArenaTensor) and
inplace_tensor_op(ArenaTensor) overloads added in earlier iterations.

Tests:
- New arena_tensor.cpp / arena_tensor_kernels.cpp test suites covering
  ArenaTensor sizeof invariants, SIMD alignment, member compound
  operators, in-place CPO mirrors, ToT reductions (sum/product/
  squared_norm), serialization round-trip, regime-A einsum dispatch,
  and the in-place ops smoketest.
- Update the is_tensor_view_v predicate test to reflect that
  TensorInterface is no longer in the view set.
Introduce a fused in-place AXPY CPO so the einsum generic fallback and the
contraction engine's scale inner-product path no longer build a scaled
temporary tile. The temporary required value-returning `scale`, which a
view tile type (ArenaTensor) cannot provide -- it has no allocator. The
fused form needs only in-place mutation, which views support.

New CPO (tile_interface/add.h):
- axpy_to(result, arg, factor)            -- result += arg * factor
- axpy_to(result, arg, factor, perm)      -- result += (perm ^ arg) * factor
Semantics are BLAS AXPY: `_to` marks the in-place form (cf. add_to vs add).
Distinct from add_to(result, arg, factor), whose legacy semantics is
`(result + arg) * factor` -- that one scales the accumulated result too,
so it is not a drop-in for fused accumulation.

Member overloads:
- Tensor::axpy_to(right, factor[, perm]) -- the lambda body dispatches by
  element type so one body serves flat and ToT tensors (leaf: l += r*f;
  cell: l.axpy_to(r, f)). The perm form bails for view inner cells.
- ArenaTensor::axpy_to(other, factor[, perm]) -- routes to the free
  arena axpy_to; perm form rejects (views cannot permute in place).
- has_member_function_axpy_to_anyreturn detector generated in
  type_traits.h.

Renamed the existing arena free function `axpy(dst, alpha, src)` to
`axpy_to(dst, src, alpha)` -- it is in-place, so the `_to` suffix and the
`(result, arg, factor)` argument order now match TA's CPO convention.

einsum/tiledarray.h: the generic-fallback mixed scalar x ToT loop now does
`axpy_to(el, tensor, scalar)` instead of `add_to(el, scale(tensor,
scalar))`.

cont_engine.h: the scale inner-product fallback drops the value-returning
`scal_op` lambda. The Contraction outer product is now a fused
`axpy_to(result, tot, scalar[, perm])` (this is the long-standing
"TODO implement X-permuting AXPY"); the Hadamard outer product keeps the
value-returning `scale` assignment, gated to non-view result cells.

Tests: axpy_to correctness for flat Tensor<double> and ToT
Tensor<ArenaTensor> (verifies axpy semantics -- factor scales only the
added operand).
- merge arena_tensor_kernels.h into arena_kernels.h; add make_nested_tile,
  a two-pass ToT outer-tile builder that dispatches on arena vs owning
  inner tiles (declare inner ranges -> allocate one slab -> fill cells)
- DistArray: add the ToT range_fn constructor and init_tiles_nested; route
  fill_local/fill/fill_random/init_elements/set through arena-aware tile
  builders (for_each_local_tile_inplace, make_arena_nested_tile) for
  Tensor<ArenaTensor> inners
- ArenaTensor::operator=: two regimes -- deep element copy when the
  assignee is bound, shallow rebind when null; drop the move-assignment so
  rvalues follow the same regimes
- type_traits: add default_freestanding_tensor (view -> owning-tensor map)
- tests/tot_construction.cpp: construction, fill, init_elements, set, and a
  multi-rank serialization round-trip, over plain and arena ToT inner tiles
- einsum.cpp: breadcrumb on the pre-existing different_nested_ranks abort
Route the elementwise binary/unary tile ops on Tensor<ArenaTensor> through
the arena binary kernel so c = a + b, a - b, scaled variants, scalar
scaling, and negation work for arena-inner tensors-of-tensors via the
expression DSL. add gains dedicated arena overloads (factor / perm /
factor+perm); subt and mult gain in-body arena branches. Permuted ToT ops
fall through to the non-permuted kernel on a trivial permutation and throw
otherwise, since cell reordering of arena inners is not yet supported.

Extend tot_construction.cpp with a generic run_tot_expr harness and
add/subt/mult/scale/neg/scaled-add/scaled-subt cases for both Tensor and
ArenaTensor inners. Hadamard ToT mult for arena inners is left unwired --
it routes through MultEngine/ContEngine, which needs a value-returning
inner tile op an arena cell cannot provide; tracked by TODO(arena-tot-mult).
… ops

MultEngine::init_struct always calls ContEngine::init_inner_tile_op, which
unconditionally instantiates a value-returning inner mult/contraction op on
the inner cell type. A non-owning view inner cell (ArenaTensor) cannot host
such an op, so a("i;j")*b("i;j") on DistArray<Tensor<ArenaTensor>> failed to
compile -- even though that inner op is dead code for a pure Hadamard:
MultEngine::make_tile_op passes no inner op and the outer Mult tile op
already recurses through Tensor<ArenaTensor>::mult.

Split init_inner_tile_op into a thin dispatcher and an init_inner_tile_op_-
owning_ helper holding the existing owning-cell builder. The one case that
cannot instantiate that helper -- ToT x ToT with a view inner cell -- is
handled directly: pure Hadamard needs no inner op, everything else throws
(deferred). tot x t scale paths keep their existing view-aware handling
since tot_x_tot is false there. Re-enables the mult_arena_inner test.
Adds test_tot_einsum_contraction -- runs TA::einsum("ij;mo,ij;on->ij;mn")
(outer Hadamard, inner contraction) and checks the result against a
Tensor<Tensor<double>> reference run of the same expression and data.

Wired up for a TA::Tensor inner cell. The ArenaTensor-inner case is left
as a TODO: TA::einsum does not compile for DistArray<Tensor<ArenaTensor>>
because the legacy per-cell einsum path (tensor_hadamard / tensor_contract
and the element_*_op lambdas in einsum/tiledarray.h) value-returns inner
tensors and calls ArenaTensor::mult / ArenaTensor::permute. That legacy
path coexists with the regime-A arena path and is still instantiated; it
needs to be if-constexpr-guarded out for view inner cells.
TA::einsum for DistArray<Tensor<ArenaTensor>> did not compile: the
outer-Hadamard ("hadamard reduction") branch's legacy per-cell ToT x ToT
path value-returns inner tensors (tensor_hadamard / tensor_contract via the
element_*_op lambdas) and calls ArenaTensor::mult / ArenaTensor::permute,
none of which a non-owning view inner cell supports. That path coexists
with the regime-A arena fast path (run_regime_a_arena, tried first) and was
still being instantiated.

if-constexpr-guard the legacy ToT x ToT element loop out for
is_tensor_view_v inner cells. For a view inner cell only the regime-A path
can produce results; if it was inactive (a permuted inner contraction --
see TODO(arena-einsum-perm)) the guard throws instead of falling through.
The tot x t scale path is untouched: it uses in-place axpy_to and is
already view-safe.

Switches the einsum_contraction_arena_inner test to a regime-A annotation
("ijk;mo,ijk;on->ij;mn" -- outer Hadamard + outer contraction + inner
contraction); a pure-Hadamard outer is delegated to the expression DSL
instead. The arena result matches the Tensor<Tensor<double>> reference.
arena_inner_permute<OuterTensor>(src, inner_perm) builds a fresh
slab-backed ToT tile with the same outer layout as src, but with every
inner cell's range and data permuted by inner_perm
(result_cell(inner_perm * i) == src_cell(i)). It is the slab-level
counterpart of a per-cell permute: the owning tile allocates one new slab
via arena_outer_init and rewrites each cell with a strided scatter, so no
non-owning view inner cell is ever asked to value-return.

This is the primitive the owning Tensor<ArenaTensor>::permute and the
regime-A einsum hoist will use to handle inner-mode permutations without a
per-cell permute(const ArenaTensor&).

Tests cover a transpose and a rank-3 permutation over non-uniform inner
cells, checked against a hand-rolled reference.
Tensor<ArenaTensor>::permute previously routed through the generic
Tensor(other, perm) ctor, whose allocate-then-fill shape does not fit the
arena slab model: its inner-permute pass threw for view inner cells, and
its outer-permute path did not co-own the source slab.

Route arena-inner permute around that ctor: the outer part reorders the
8-byte cell views shallowly (sharing the source slab via keep-alive) with
arena_permute_shallow; a non-trivial inner part rewrites every cell into a
fresh slab with arena_inner_permute. This makes the owning tile's permute
carry out the inner-mode permutation -- no per-cell permute(ArenaTensor),
no view value-returns. Also wires up arena_permute_shallow, which existed
but was previously unused.

Test: a bipartite (outer + inner) transpose of a Tensor<ArenaTensor> tile.
regime-A einsum previously bailed (plan inactive -> legacy fallback, which
cannot run for ArenaTensor inners) whenever an inner contraction was not
canonically aligned -- i.e. TensorContractionPlan::do_perm.{A,B,C} set.

Instead of bailing, run_regime_a_arena now hoists the inner permutations to
slab-level rewrites: each operand tile whose cells need reordering is run
through arena_inner_permute (by c_plan.perm.{A,B}) before the per-cell op,
and the result tile is run through arena_inner_permute (by perm.C.inv())
after accumulation. The per-cell op is therefore always a single canonical
GEMM -- accumulate() now calls fused_contraction_inplace with the plan's
(canonical) GemmHelper directly. No per-cell view permute, no value-return;
tensor_contract / tensor_contract_to are untouched.

This unifies the path for arena and plain TA::Tensor inner cells (both are
arena-eligible outers). Tests: a canonical and a non-canonical
("ijk;om,ijk;on->ij;nm") inner contraction, arena and TA::Tensor inners,
each checked against a Tensor<Tensor<double>> reference.
regime-A's inner-Hadamard path previously only accepted the no-perm and
perm_b branches, and even perm_b was wrong: fused_hadamard_inplace does a
flat r += l * rr, which aborts the is_range_set_congruent check whenever
an operand's inner cells are not already in C-layout.

Treat a non-canonical inner Hadamard the same as a non-canonical inner
contraction: run_regime_a_arena now hoists each operand's inner
permutation (h_plan.perm.AC / perm.BC) to a slab-level arena_inner_permute
rewrite, so both operands reach C-layout before the per-cell op. This
fixes the pre-existing perm_b bug and adds perm_to_c / perm_a / else
coverage; make_regime_a_arena_plan no longer bails on those branches.

Adds regime-A-exercising cases to einsum_manual/equal_nested_ranks
(permuted inner contraction and permuted inner Hadamard, no outer
permutation) so manual_eval is an independent oracle, plus matching
arena/Tensor inner cases in tot_construction.
The 3-arg Tensor::mult(right, perm) only had an arena branch for the
tot x tot case; a plain scalar tile times an arena ToT tile (t x tot)
fell through to the generic binary() path, whose element op evaluates
`scalar * ArenaTensor` as a value. ArenaTensor is a non-owning view and
has no value-returning operator*, so that failed to compile.

This surfaced from MPQC: TA::einsum(plain_array, tot_array, ...) recurses
into einsum's pure-Hadamard branch (C = A * B), which the expression DSL
lowers to a Mult tile op of Tensor<double> x Tensor<ArenaTensor<double>>.

Route t x tot through the existing 2-arg arena overload (scales each
inner cell into a fresh slab) and apply any non-trivial result
permutation as a shallow outer reindex of that slab.

Adds einsum_t_x_tot_{tensor,arena}_inner cases checking the arena path
against an identical Tensor<double>-inner reference run.
Three gaps surfaced wiring TA::Tensor<TA::ArenaTensor<T>> through MPQC's
CSV coupled-cluster layer:

- The 3-arg Tensor::mult(right, perm) handled t x tot but not the mirror
  tot x t (arena ToT tile times a plain scalar tile); it fell through to
  the generic binary() whose element op needs a value-returning
  ArenaTensor * scalar. Route it through the 2-arg arena overload like
  t x tot.

- ArenaTensor had no sum(): einsum's DeNest path (sum_tot_2_tos) reduces
  each inner cell to a scalar via the free sum(), which calls .sum(). A
  scalar reduction allocates nothing, so it is valid on a view.

- size_of(Tensor<ArenaTensor>) recurses per inner cell into
  size_of(ArenaTensor), which did not exist. Add an overload counting the
  one-pointer view plus its in-arena cell; summed over the outer tile
  this accounts for the slab.
Make Tensor<ArenaTensor> tiles work through the einsum and SUMMA
contraction machinery, as exercised by MPQC's CSV-CCk integral
transformation:

- einsum: hoist the temporary sub-World vector ahead of the AB/C
  ArrayTerms so a sub-World outlives the DistArrays bound to it;
  lazy_deleter would otherwise dereference a destroyed World while
  unwinding.
- einsum regime-A: drop the over-conservative bail on a non-identity
  result permutation -- run_regime_a_arena already applies it
  (tile.permute(pc), as the legacy path does).
- cont_engine: support inner contraction (incl. inner outer-product)
  for view inner cells, routed through the arena fast path. A
  non-identity inner result permutation rides on op_'s post-processing
  permute step rather than the (perm-free) fused inner op.
- contraction SUMMA: union-shape the arena ToT result across K-panels
  via reserve_and_construct / grow_to_cover.
- make_with_new_trange: arena-ToT-aware retile that pulls source tiles
  and deep-copies inner cells instead of rebinding source arena slabs.
…nsume

Completes Tensor<ArenaTensor> support through the expression engine:

- MultEngine::make_tile_op routes a Hadamard-outer / contraction-inner
  ToT product to a whole-tile arena op (arena_hadamard_tile_op_) instead
  of a value-returning per-cell op, which a view inner cell cannot host.
- ContEngine builds that op as arena_hadamard_inner_contract: a fresh
  slab-backed result shaped per-cell by the inner GEMM, with the inner
  result permutation applied as a slab-level post-pass.
- Mult gains a whole-tile-op constructor (tile_op_tag); eval() delegates
  to it when set.
- Mult's consume eval overloads no longer do in-place mult_to on
  view-inner-cell (ArenaTensor) tiles: such a tile is a shallow handle
  whose arena slab may be aliased by a persistent array, so consuming it
  corrupts that operand. View-cell tiles always produce a fresh result.
einsum's special-Hadamard branch replicates the small operand via
replicate_array -> make_replicated -> replicate_tensor. Two gaps broke
this for an arena tensor-of-tensors on a multi-rank run:

- Replicator's ctor called pmap()->local_size() purely as a reserve()
  hint, which asserts on a pmap that does not precompute local size
  (e.g. HashPmap). Skip the hint when known_local_size() is false.

- replicate_tensor std::copy'd the 8-byte ArenaTensor view cells, so the
  replicated tile aliased the source tile's arena slab and dangled once
  the (temporary, post-make_replicated) source array was destroyed.
  Build the replicated tile as a fresh slab-backed tile and deep-copy
  each replicated inner cell's element data.
ContractionArenaPlan::operand_inner_ranges declared its return type as
std::vector<Result::value_type::range_type>, ill-formed when Result is a
plain (non-ToT) tensor. make_contraction_arena_plan names the class in
its return type unconditionally (before an if-constexpr returns nullopt
for non-ToT), so the bad type was a hard error -- it broke the ta_test
build. Give operand_inner_ranges a deduced return type so the type is
only formed when the function is actually instantiated.
Tensor::axpy_to(arg, factor, perm) went straight to inplace_binary,
which asserts the target is non-empty. The first contribution into an
unallocated tensor -- e.g. a tensor-of-tensors contraction result inner
cell, when the inner op carries a permutation -- therefore aborted at
kernels.h:395. Mirror the non-permuting axpy_to overload: when the
target is empty, initialize it to factor * (perm ^ arg).

Fixes the einsum_manual/different_nested_ranks "ik;mn,j->ijk;nm" case
(mixed-nested-rank outer product with an inner-mode permutation) and
drops its TODO(tot-einsum-empty-result) breadcrumb.
ArenaTensor carried both a member `operator*=(Scalar)` and was accepted by
the free `operator*=(T&&, N)` in operators_body.ipp (views satisfy
ta_ops_match_tensor_inplace). The two template candidates tie under
gcc-13's overload resolution, so `arena_cell *= scalar` -- exercised by
Tensor<ArenaTensor>::scale_to / subt_to -- fails to compile as ambiguous
on that toolchain (newer gcc and clang pick a candidate and build).

Remove the member; the free operator is the single provider of
`view *= scalar`, matching how TA::Tensor itself relies on it. The
tensor-tensor compound members stay -- they are non-template and win
cleanly over the free templates.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an arena-backed tensor-of-tensors tile type (TA::Tensor<TA::ArenaTensor<T>>) and integrates it through TiledArray’s expression-DSL, einsum, and contraction machinery to drastically reduce inner-cell footprint while preserving existing ToT semantics.

Changes:

  • Adds arena infrastructure + ArenaTensor view semantics and type traits to distinguish owning tensors from non-owning views during operator dispatch.
  • Extends tile ops and einsum/contraction engines to correctly handle view inner-cells (including avoiding unsafe in-place Hadamard paths and enabling slab-aware contraction planning).
  • Updates DistArray construction/replication/retile paths for arena-ToT correctness and adds extensive unit + case-level test coverage.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/TiledArray/tensor/arena.h Adds arena allocator/resource utilities used to back arena-ToT slabs and related helpers.
src/TiledArray/tensor/arena_tensor.h Defines ArenaTensor (non-owning view tensor) behavior and kernel/operator hooks.
src/TiledArray/tensor/type_traits.h Introduces is_tensor_view_v and related traits to gate operator dispatch for view tensors.
src/TiledArray/dist_array.h Adds/adjusts ToT constructors and in-place mutation paths (fill*, set, init_elements, etc.) for arena-ToT.
src/TiledArray/array_impl.h Special-cases retile/reshape for arena-ToT to avoid dangling views by deep-copy rebuild.
src/TiledArray/tile_op/mult.h Routes view-result Hadamard ops through whole-tile ops to avoid slab alias corruption.
src/TiledArray/tile_op/contract_reduce.h Adds arena-aware contraction reduction support (plan storage + growth/merge handling).
src/TiledArray/einsum/tiledarray.h Integrates arena-ToT into einsum paths (replication/deep-copy, regime-A dispatch, lifetime fixes).
src/TiledArray/einsum/cont_engine.h Wires arena planning/dispatch into contraction engine execution.
src/TiledArray/einsum/mult_engine.h Extends mult engine behavior for outer-Hadamard × inner-contraction patterns with arena-ToT.
src/TiledArray/tensor/kernels.h Adds/updates kernel entry points used by arena-aware contraction paths.
tests/tot_construction.cpp New end-to-end tests for ToT construction + expression-DSL/einsum behavior for Tensor and ArenaTensor inners.
tests/arena.cpp Adds unit tests for arena allocator/resource behavior and invariants.
tests/arena_tensor.cpp Adds unit tests for ArenaTensor view semantics and operations.
tests/einsum.cpp Extends existing einsum tests to cover new nested-rank / arena-ToT scenarios.
tests/cases/CMakeLists.txt (and tests/cases/*) Adds case binaries/targets for additional coverage/validation scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +64 to +67
std::shared_ptr<T[]> slice(std::size_t offset, std::size_t /*n_elem*/) const {
TA_ASSERT(slab_);
TA_ASSERT(offset % alignof(T) == 0);
TA_ASSERT(offset <= capacity_);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid for this branch. It is resolved by the stacked PR #549, which rewrites Arena entirely: the one-shot slice() is removed in favor of claim_bytes(bytes, alignment), which bounds-checks every allocation against its page. Rather than patch slice() here only for #549 to delete it, leaving as-is.

Comment on lines +141 to +142
auto h = arena_->claim<std::byte>(arena_align_up(bytes, alignment));
return h.get();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid catch — and fixed in the stacked PR #549: the rewritten Arena exposes claim_bytes(bytes, alignment) that aligns the bump cursor to the request, and ArenaResource::do_allocate calls it, so the requested alignment is honored.

Comment on lines +1905 to +1912
std::vector<Future<value_type>> done;
for (const auto& index : *(pmap())) {
if (is_zero(index)) continue;
Future<value_type>& fut = find_local(index);
Future<value_type> mutated = w.taskq.add(
[op](value_type& tile) -> value_type {
op(tile);
return tile;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value_type here is the outer ToT tile (TA::Tensor<...>), which is a shallow-copy handle — return tile; copies a reference-counted handle (a pointer plus a refcount bump), not the tile's element data, and the resulting Future<value_type> just carries that handle. So this is not an expensive copy. Returning void would be marginally tidier but is neither a correctness nor a performance issue; leaving as-is.

Comment on lines +982 to +991
std::int64_t fill_local(const V& value = V(), bool skip_set = false) {
if constexpr (detail::is_tensor_of_tensor_v<value_type> &&
is_arena_tensor_v<element_type>) {
return for_each_local_tile_inplace<fence>([value](value_type& outer) {
for (std::size_t o = 0; o < outer.size(); ++o) {
auto& cell = outer.data()[o];
if (cell.empty()) continue; // skip deliberately-null cells
cell = value; // deep copy into the bound arena cell
}
});
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional and documented (see the \note on fill_local): for an arena-backed ToT the array is shaped first — every inner cell allocated — and fill_local is then a pure in-place mutator, so the tiles necessarily already exist and skip_set (whose role is to tolerate already-set tiles) has nothing to act on. The non-arena path keeps the standard skip_set semantics. Happy to add an explicit "skip_set is ignored on the arena-ToT branch" note if that reads clearer.

Comment on lines +1006 to +1024
std::map<std::size_t, Tile> src_tile_cache;
auto source_cell_at =
[&](const auto& e) -> const typename Tile::value_type* {
if (!source_elements.includes(e)) return nullptr;
const auto src_tile_idx = source_array.trange().element_to_tile(e);
const auto src_ord =
source_array.trange().tiles_range().ordinal(src_tile_idx);
auto it = src_tile_cache.find(src_ord);
if (it == src_tile_cache.end()) {
it = src_tile_cache
.emplace(src_ord, source_array.is_zero(src_tile_idx)
? Tile{}
: source_array.get(src_tile_idx).get())
.first;
}

// loop over every target tile combination
TA::Range target_tile_ord_extent(target_tile_ord_extent_range);
for (auto& target_tile_ord : target_tile_ord_extent) {
TA::Index target_tile_idx(rank);
container::svector<TA::Range1> target_tile_rngs1(rank);
const Tile& st = it->second;
if (st.empty()) return nullptr;
return &st(e);
};
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gather-based retile is deliberate for arena ToT (see the comment opening this branch): the generic scatter path (write_tile_block) would rebind the target's null inner cells onto the source tiles' arena slabs, leaving dangling views once the source array is destroyed — so the target rank must pull source tiles and deep-copy. Fetches are de-duplicated per rank via src_tile_cache, so the cost is O(distinct source tiles)/rank, not per target tile. Batching/prefetching those remote fetches is a worthwhile follow-up optimization but is orthogonal to correctness; noting it for later.

Comment on lines +123 to +125
/// `tensor/arena_tensor.h` (`ArenaTensor`, `detail::TensorInterface`) and
/// `external/btas.h` (`btas::TensorView`). Declared here so the operator-body
/// predicates below can consult it without including arena_tensor.h.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the stacked PR #549 (commit a02f28e): the comment no longer lists TensorInterface as an is_tensor_view specialization (it has none, and is deliberately not a view), and now notes that TensorInterface/TensorMap is excluded because it has value-returning member arithmetic.

@evaleev evaleev merged commit 11f6e8d into master May 19, 2026
13 checks passed
@evaleev evaleev deleted the feature/arena_tensor branch May 19, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants