Arena-backed tensor-of-tensors: 8-byte view inner cells through einsum/contraction by evaleev · Pull Request #548 · ValeevGroup/tiledarray

evaleev · 2026-05-19T08:31:48Z

Summary

Deploys TA::Tensor<TA::ArenaTensor<T>> — a tensor-of-tensors (ToT) tile
type whose inner cells are 8-byte non-owning views into an arena "slab"
owned by the outer tile — through TiledArray's einsum and contraction
machinery. This shrinks the ToT inner-tile footprint dramatically (from
~304 B to ~8 B per cell) while keeping Tensor<ArenaTensor> behaving like
an ordinary Tensor<Tensor> at the expression-DSL and einsum level.

Highlights:

Arena infrastructure — allocator, slab plan helpers, and ToT
construction kernels (arena_outer_init, make_nested_tile,
arena_trivial_{unary,binary}); arena-aware fill/set/init_elements.
Tile ops — arena-aware add/subt/scale/neg, and a corrected
Mult consume path: in-place mult_to on shallow arena-ToT operand
tiles corrupts shared slabs, so view-cell results are routed through a
fresh-result whole-tile op (uses_tile_op_).
einsum / contraction engine — regime-A (outer-Hadamard) arena plans
and dispatch; ToT×ToT Hadamard with view inner cells routed via outer
ops; permuted inner contractions handled via slab-level hoist;
MultEngine support for Hadamard-outer × inner-contraction.
Multi-rank einsum — replicate_array path now builds fresh
slab-backed tiles instead of copying 8-byte views (which would dangle),
and the Replicator reserve hint is guarded for pmaps without a known
local size (HashPmap).
Bug fix — permuting Tensor::axpy_to now initializes an empty
target instead of asserting in inplace_tensor_op; this also fixes a
pre-existing einsum abort on different_nested_ranks.

Test plan

Full TA serial unit suite (tiledarray/unit/run-np-1) passes 100%.
ToT einsum end-to-end harness and arena-kernel tests pass.
different_nested_ranks einsum case (previously aborting) passes.
Downstream MPQC csv-cck validation tests pass (debug + release),
including the np=2 he10-csv-cck case.
CI green on this PR (the prior CI failure — ContractionArenaPlan
not nameable for non-ToT operands — is fixed in 6e6e8efd6).

…e binaries

- arena_sizeof_invariant_suite: drop platform-specific absolute baselines (328/16/248 were Apple-arm64/libc++ only); keep relative ImplLayoutAllocator == ImplLayoutMaster invariant + monostate static_asserts. - cont_engine: reset arena_plan_ after std::move into op_ so later reads see "no plan" rather than a moved-from optional. - arena_kernels: one-line intent note on trivial kernels' tight packing.

Treat ArenaTensor as a first-class tensor in TA's trait machinery so kernel-level dispatches (tensor_reduce, inplace_tensor_op, tensor_op, unary, binary, ...) and operators match the same overloads they do for TA::Tensor<double> -- without bespoke arena overloads in kernels.h. The key insight: ArenaTensor is structurally a flat contiguous tensor (.data() + .size()); the only reason it was kept out of is_tensor_helper was to avoid being dragged into value-returning operator paths it can't fulfill (a view has no allocator to materialize fresh storage). Address that as a separate gate rather than excluding ArenaTensor from being a tensor at all. Trait split: - is_tensor_helper<ArenaTensor> = true. is_contiguous_tensor_helper too. - is_tensor_view<T> -- new predicate; true for non-owning views that lack value-returning member arithmetic (ArenaTensor, btas::TensorView). TensorInterface is *not* a view here because it materializes a fresh result from member add/subt/mult/etc. - ta_ops_match_tensor (operators_body.ipp value-returning gate): adds `&& !is_tensor_view_v<T>`; views opt out of `+`, `-`, `*`, unary `-`, tensor*scalar, scalar*tensor, permute*tensor. - ta_ops_match_tensor_inplace (new; operators_body.ipp compound- assignment gate): default is `is_nested_tensor_v<T>`, which accepts views since they can be mutated in place. - btas.h specializes both predicates to false for btas::Tensor (was already done for the freestanding one). ArenaTensor surface: - Member compound operators (operator+=, -=, *=, scalar *=) routing to free CPOs. - Member in-place CPO mirrors (add_to, subt_to, mult_to, scale_to, neg_to) so tile_interface paths that call `arg.add_to(other)` work uniformly. - No member operator+/-/* or value-returning member add/subt/mult/ scale/neg/clone. Those would require allocation. Tensor.h changes: - value_converter falls through to identity for views (rebind-on-copy, no clone). - Tensor(range, value) ctor: when value_type is a view, copy each cell by value instead of calling Clone (which delegates to a missing member). - Permutation paths (Tensor::neg(perm), subt(right, perm), scale(perm), inner-permute in copy-with-perm ctor) bail with TA_EXCEPTION for view inner cells -- views can't permute in place. - Drop the view-aware Tensor<View>::scale_to/add_to/subt_to/mult_to overloads added in earlier iterations; the legacy ones now work because ArenaTensor has compound operators. - Tighten one residual ambiguity: subt(Right) const value-returning excludes the arena-pair case (handled by a dedicated overload above). Kernels.h: drop the ad-hoc tensor_reduce(ArenaTensor) and inplace_tensor_op(ArenaTensor) overloads added in earlier iterations. Tests: - New arena_tensor.cpp / arena_tensor_kernels.cpp test suites covering ArenaTensor sizeof invariants, SIMD alignment, member compound operators, in-place CPO mirrors, ToT reductions (sum/product/ squared_norm), serialization round-trip, regime-A einsum dispatch, and the in-place ops smoketest. - Update the is_tensor_view_v predicate test to reflect that TensorInterface is no longer in the view set.

Introduce a fused in-place AXPY CPO so the einsum generic fallback and the contraction engine's scale inner-product path no longer build a scaled temporary tile. The temporary required value-returning `scale`, which a view tile type (ArenaTensor) cannot provide -- it has no allocator. The fused form needs only in-place mutation, which views support. New CPO (tile_interface/add.h): - axpy_to(result, arg, factor) -- result += arg * factor - axpy_to(result, arg, factor, perm) -- result += (perm ^ arg) * factor Semantics are BLAS AXPY: `_to` marks the in-place form (cf. add_to vs add). Distinct from add_to(result, arg, factor), whose legacy semantics is `(result + arg) * factor` -- that one scales the accumulated result too, so it is not a drop-in for fused accumulation. Member overloads: - Tensor::axpy_to(right, factor[, perm]) -- the lambda body dispatches by element type so one body serves flat and ToT tensors (leaf: l += r*f; cell: l.axpy_to(r, f)). The perm form bails for view inner cells. - ArenaTensor::axpy_to(other, factor[, perm]) -- routes to the free arena axpy_to; perm form rejects (views cannot permute in place). - has_member_function_axpy_to_anyreturn detector generated in type_traits.h. Renamed the existing arena free function `axpy(dst, alpha, src)` to `axpy_to(dst, src, alpha)` -- it is in-place, so the `_to` suffix and the `(result, arg, factor)` argument order now match TA's CPO convention. einsum/tiledarray.h: the generic-fallback mixed scalar x ToT loop now does `axpy_to(el, tensor, scalar)` instead of `add_to(el, scale(tensor, scalar))`. cont_engine.h: the scale inner-product fallback drops the value-returning `scal_op` lambda. The Contraction outer product is now a fused `axpy_to(result, tot, scalar[, perm])` (this is the long-standing "TODO implement X-permuting AXPY"); the Hadamard outer product keeps the value-returning `scale` assignment, gated to non-view result cells. Tests: axpy_to correctness for flat Tensor<double> and ToT Tensor<ArenaTensor> (verifies axpy semantics -- factor scales only the added operand).

- merge arena_tensor_kernels.h into arena_kernels.h; add make_nested_tile, a two-pass ToT outer-tile builder that dispatches on arena vs owning inner tiles (declare inner ranges -> allocate one slab -> fill cells) - DistArray: add the ToT range_fn constructor and init_tiles_nested; route fill_local/fill/fill_random/init_elements/set through arena-aware tile builders (for_each_local_tile_inplace, make_arena_nested_tile) for Tensor<ArenaTensor> inners - ArenaTensor::operator=: two regimes -- deep element copy when the assignee is bound, shallow rebind when null; drop the move-assignment so rvalues follow the same regimes - type_traits: add default_freestanding_tensor (view -> owning-tensor map) - tests/tot_construction.cpp: construction, fill, init_elements, set, and a multi-rank serialization round-trip, over plain and arena ToT inner tiles - einsum.cpp: breadcrumb on the pre-existing different_nested_ranks abort

Route the elementwise binary/unary tile ops on Tensor<ArenaTensor> through the arena binary kernel so c = a + b, a - b, scaled variants, scalar scaling, and negation work for arena-inner tensors-of-tensors via the expression DSL. add gains dedicated arena overloads (factor / perm / factor+perm); subt and mult gain in-body arena branches. Permuted ToT ops fall through to the non-permuted kernel on a trivial permutation and throw otherwise, since cell reordering of arena inners is not yet supported. Extend tot_construction.cpp with a generic run_tot_expr harness and add/subt/mult/scale/neg/scaled-add/scaled-subt cases for both Tensor and ArenaTensor inners. Hadamard ToT mult for arena inners is left unwired -- it routes through MultEngine/ContEngine, which needs a value-returning inner tile op an arena cell cannot provide; tracked by TODO(arena-tot-mult).

… ops MultEngine::init_struct always calls ContEngine::init_inner_tile_op, which unconditionally instantiates a value-returning inner mult/contraction op on the inner cell type. A non-owning view inner cell (ArenaTensor) cannot host such an op, so a("i;j")*b("i;j") on DistArray<Tensor<ArenaTensor>> failed to compile -- even though that inner op is dead code for a pure Hadamard: MultEngine::make_tile_op passes no inner op and the outer Mult tile op already recurses through Tensor<ArenaTensor>::mult. Split init_inner_tile_op into a thin dispatcher and an init_inner_tile_op_- owning_ helper holding the existing owning-cell builder. The one case that cannot instantiate that helper -- ToT x ToT with a view inner cell -- is handled directly: pure Hadamard needs no inner op, everything else throws (deferred). tot x t scale paths keep their existing view-aware handling since tot_x_tot is false there. Re-enables the mult_arena_inner test.

Adds test_tot_einsum_contraction -- runs TA::einsum("ij;mo,ij;on->ij;mn") (outer Hadamard, inner contraction) and checks the result against a Tensor<Tensor<double>> reference run of the same expression and data. Wired up for a TA::Tensor inner cell. The ArenaTensor-inner case is left as a TODO: TA::einsum does not compile for DistArray<Tensor<ArenaTensor>> because the legacy per-cell einsum path (tensor_hadamard / tensor_contract and the element_*_op lambdas in einsum/tiledarray.h) value-returns inner tensors and calls ArenaTensor::mult / ArenaTensor::permute. That legacy path coexists with the regime-A arena path and is still instantiated; it needs to be if-constexpr-guarded out for view inner cells.

TA::einsum for DistArray<Tensor<ArenaTensor>> did not compile: the outer-Hadamard ("hadamard reduction") branch's legacy per-cell ToT x ToT path value-returns inner tensors (tensor_hadamard / tensor_contract via the element_*_op lambdas) and calls ArenaTensor::mult / ArenaTensor::permute, none of which a non-owning view inner cell supports. That path coexists with the regime-A arena fast path (run_regime_a_arena, tried first) and was still being instantiated. if-constexpr-guard the legacy ToT x ToT element loop out for is_tensor_view_v inner cells. For a view inner cell only the regime-A path can produce results; if it was inactive (a permuted inner contraction -- see TODO(arena-einsum-perm)) the guard throws instead of falling through. The tot x t scale path is untouched: it uses in-place axpy_to and is already view-safe. Switches the einsum_contraction_arena_inner test to a regime-A annotation ("ijk;mo,ijk;on->ij;mn" -- outer Hadamard + outer contraction + inner contraction); a pure-Hadamard outer is delegated to the expression DSL instead. The arena result matches the Tensor<Tensor<double>> reference.

arena_inner_permute<OuterTensor>(src, inner_perm) builds a fresh slab-backed ToT tile with the same outer layout as src, but with every inner cell's range and data permuted by inner_perm (result_cell(inner_perm * i) == src_cell(i)). It is the slab-level counterpart of a per-cell permute: the owning tile allocates one new slab via arena_outer_init and rewrites each cell with a strided scatter, so no non-owning view inner cell is ever asked to value-return. This is the primitive the owning Tensor<ArenaTensor>::permute and the regime-A einsum hoist will use to handle inner-mode permutations without a per-cell permute(const ArenaTensor&). Tests cover a transpose and a rank-3 permutation over non-uniform inner cells, checked against a hand-rolled reference.

Tensor<ArenaTensor>::permute previously routed through the generic Tensor(other, perm) ctor, whose allocate-then-fill shape does not fit the arena slab model: its inner-permute pass threw for view inner cells, and its outer-permute path did not co-own the source slab. Route arena-inner permute around that ctor: the outer part reorders the 8-byte cell views shallowly (sharing the source slab via keep-alive) with arena_permute_shallow; a non-trivial inner part rewrites every cell into a fresh slab with arena_inner_permute. This makes the owning tile's permute carry out the inner-mode permutation -- no per-cell permute(ArenaTensor), no view value-returns. Also wires up arena_permute_shallow, which existed but was previously unused. Test: a bipartite (outer + inner) transpose of a Tensor<ArenaTensor> tile.

regime-A einsum previously bailed (plan inactive -> legacy fallback, which cannot run for ArenaTensor inners) whenever an inner contraction was not canonically aligned -- i.e. TensorContractionPlan::do_perm.{A,B,C} set. Instead of bailing, run_regime_a_arena now hoists the inner permutations to slab-level rewrites: each operand tile whose cells need reordering is run through arena_inner_permute (by c_plan.perm.{A,B}) before the per-cell op, and the result tile is run through arena_inner_permute (by perm.C.inv()) after accumulation. The per-cell op is therefore always a single canonical GEMM -- accumulate() now calls fused_contraction_inplace with the plan's (canonical) GemmHelper directly. No per-cell view permute, no value-return; tensor_contract / tensor_contract_to are untouched. This unifies the path for arena and plain TA::Tensor inner cells (both are arena-eligible outers). Tests: a canonical and a non-canonical ("ijk;om,ijk;on->ij;nm") inner contraction, arena and TA::Tensor inners, each checked against a Tensor<Tensor<double>> reference.

regime-A's inner-Hadamard path previously only accepted the no-perm and perm_b branches, and even perm_b was wrong: fused_hadamard_inplace does a flat r += l * rr, which aborts the is_range_set_congruent check whenever an operand's inner cells are not already in C-layout. Treat a non-canonical inner Hadamard the same as a non-canonical inner contraction: run_regime_a_arena now hoists each operand's inner permutation (h_plan.perm.AC / perm.BC) to a slab-level arena_inner_permute rewrite, so both operands reach C-layout before the per-cell op. This fixes the pre-existing perm_b bug and adds perm_to_c / perm_a / else coverage; make_regime_a_arena_plan no longer bails on those branches. Adds regime-A-exercising cases to einsum_manual/equal_nested_ranks (permuted inner contraction and permuted inner Hadamard, no outer permutation) so manual_eval is an independent oracle, plus matching arena/Tensor inner cases in tot_construction.

The 3-arg Tensor::mult(right, perm) only had an arena branch for the tot x tot case; a plain scalar tile times an arena ToT tile (t x tot) fell through to the generic binary() path, whose element op evaluates `scalar * ArenaTensor` as a value. ArenaTensor is a non-owning view and has no value-returning operator*, so that failed to compile. This surfaced from MPQC: TA::einsum(plain_array, tot_array, ...) recurses into einsum's pure-Hadamard branch (C = A * B), which the expression DSL lowers to a Mult tile op of Tensor<double> x Tensor<ArenaTensor<double>>. Route t x tot through the existing 2-arg arena overload (scales each inner cell into a fresh slab) and apply any non-trivial result permutation as a shallow outer reindex of that slab. Adds einsum_t_x_tot_{tensor,arena}_inner cases checking the arena path against an identical Tensor<double>-inner reference run.

Three gaps surfaced wiring TA::Tensor<TA::ArenaTensor<T>> through MPQC's CSV coupled-cluster layer: - The 3-arg Tensor::mult(right, perm) handled t x tot but not the mirror tot x t (arena ToT tile times a plain scalar tile); it fell through to the generic binary() whose element op needs a value-returning ArenaTensor * scalar. Route it through the 2-arg arena overload like t x tot. - ArenaTensor had no sum(): einsum's DeNest path (sum_tot_2_tos) reduces each inner cell to a scalar via the free sum(), which calls .sum(). A scalar reduction allocates nothing, so it is valid on a view. - size_of(Tensor<ArenaTensor>) recurses per inner cell into size_of(ArenaTensor), which did not exist. Add an overload counting the one-pointer view plus its in-arena cell; summed over the outer tile this accounts for the slab.

Make Tensor<ArenaTensor> tiles work through the einsum and SUMMA contraction machinery, as exercised by MPQC's CSV-CCk integral transformation: - einsum: hoist the temporary sub-World vector ahead of the AB/C ArrayTerms so a sub-World outlives the DistArrays bound to it; lazy_deleter would otherwise dereference a destroyed World while unwinding. - einsum regime-A: drop the over-conservative bail on a non-identity result permutation -- run_regime_a_arena already applies it (tile.permute(pc), as the legacy path does). - cont_engine: support inner contraction (incl. inner outer-product) for view inner cells, routed through the arena fast path. A non-identity inner result permutation rides on op_'s post-processing permute step rather than the (perm-free) fused inner op. - contraction SUMMA: union-shape the arena ToT result across K-panels via reserve_and_construct / grow_to_cover. - make_with_new_trange: arena-ToT-aware retile that pulls source tiles and deep-copies inner cells instead of rebinding source arena slabs.

…nsume Completes Tensor<ArenaTensor> support through the expression engine: - MultEngine::make_tile_op routes a Hadamard-outer / contraction-inner ToT product to a whole-tile arena op (arena_hadamard_tile_op_) instead of a value-returning per-cell op, which a view inner cell cannot host. - ContEngine builds that op as arena_hadamard_inner_contract: a fresh slab-backed result shaped per-cell by the inner GEMM, with the inner result permutation applied as a slab-level post-pass. - Mult gains a whole-tile-op constructor (tile_op_tag); eval() delegates to it when set. - Mult's consume eval overloads no longer do in-place mult_to on view-inner-cell (ArenaTensor) tiles: such a tile is a shallow handle whose arena slab may be aliased by a persistent array, so consuming it corrupts that operand. View-cell tiles always produce a fresh result.

einsum's special-Hadamard branch replicates the small operand via replicate_array -> make_replicated -> replicate_tensor. Two gaps broke this for an arena tensor-of-tensors on a multi-rank run: - Replicator's ctor called pmap()->local_size() purely as a reserve() hint, which asserts on a pmap that does not precompute local size (e.g. HashPmap). Skip the hint when known_local_size() is false. - replicate_tensor std::copy'd the 8-byte ArenaTensor view cells, so the replicated tile aliased the source tile's arena slab and dangled once the (temporary, post-make_replicated) source array was destroyed. Build the replicated tile as a fresh slab-backed tile and deep-copy each replicated inner cell's element data.

ContractionArenaPlan::operand_inner_ranges declared its return type as std::vector<Result::value_type::range_type>, ill-formed when Result is a plain (non-ToT) tensor. make_contraction_arena_plan names the class in its return type unconditionally (before an if-constexpr returns nullopt for non-ToT), so the bad type was a hard error -- it broke the ta_test build. Give operand_inner_ranges a deduced return type so the type is only formed when the function is actually instantiated.

Tensor::axpy_to(arg, factor, perm) went straight to inplace_binary, which asserts the target is non-empty. The first contribution into an unallocated tensor -- e.g. a tensor-of-tensors contraction result inner cell, when the inner op carries a permutation -- therefore aborted at kernels.h:395. Mirror the non-permuting axpy_to overload: when the target is empty, initialize it to factor * (perm ^ arg). Fixes the einsum_manual/different_nested_ranks "ik;mn,j->ijk;nm" case (mixed-nested-rank outer product with an inner-mode permutation) and drops its TODO(tot-einsum-empty-result) breadcrumb.

ArenaTensor carried both a member `operator*=(Scalar)` and was accepted by the free `operator*=(T&&, N)` in operators_body.ipp (views satisfy ta_ops_match_tensor_inplace). The two template candidates tie under gcc-13's overload resolution, so `arena_cell *= scalar` -- exercised by Tensor<ArenaTensor>::scale_to / subt_to -- fails to compile as ambiguous on that toolchain (newer gcc and clang pick a candidate and build). Remove the member; the free operator is the single provider of `view *= scalar`, matching how TA::Tensor itself relies on it. The tensor-tensor compound members stay -- they are non-template and win cleanly over the free templates.

Copilot

Pull request overview

This PR introduces an arena-backed tensor-of-tensors tile type (TA::Tensor<TA::ArenaTensor<T>>) and integrates it through TiledArray’s expression-DSL, einsum, and contraction machinery to drastically reduce inner-cell footprint while preserving existing ToT semantics.

Changes:

Adds arena infrastructure + ArenaTensor view semantics and type traits to distinguish owning tensors from non-owning views during operator dispatch.
Extends tile ops and einsum/contraction engines to correctly handle view inner-cells (including avoiding unsafe in-place Hadamard paths and enabling slab-aware contraction planning).
Updates DistArray construction/replication/retile paths for arena-ToT correctness and adds extensive unit + case-level test coverage.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/TiledArray/tensor/arena.h	Adds arena allocator/resource utilities used to back arena-ToT slabs and related helpers.
src/TiledArray/tensor/arena_tensor.h	Defines `ArenaTensor` (non-owning view tensor) behavior and kernel/operator hooks.
src/TiledArray/tensor/type_traits.h	Introduces `is_tensor_view_v` and related traits to gate operator dispatch for view tensors.
src/TiledArray/dist_array.h	Adds/adjusts ToT constructors and in-place mutation paths (`fill*`, `set`, `init_elements`, etc.) for arena-ToT.
src/TiledArray/array_impl.h	Special-cases retile/reshape for arena-ToT to avoid dangling views by deep-copy rebuild.
src/TiledArray/tile_op/mult.h	Routes view-result Hadamard ops through whole-tile ops to avoid slab alias corruption.
src/TiledArray/tile_op/contract_reduce.h	Adds arena-aware contraction reduction support (plan storage + growth/merge handling).
src/TiledArray/einsum/tiledarray.h	Integrates arena-ToT into einsum paths (replication/deep-copy, regime-A dispatch, lifetime fixes).
src/TiledArray/einsum/cont_engine.h	Wires arena planning/dispatch into contraction engine execution.
src/TiledArray/einsum/mult_engine.h	Extends mult engine behavior for outer-Hadamard × inner-contraction patterns with arena-ToT.
src/TiledArray/tensor/kernels.h	Adds/updates kernel entry points used by arena-aware contraction paths.
tests/tot_construction.cpp	New end-to-end tests for ToT construction + expression-DSL/einsum behavior for Tensor and ArenaTensor inners.
tests/arena.cpp	Adds unit tests for arena allocator/resource behavior and invariants.
tests/arena_tensor.cpp	Adds unit tests for `ArenaTensor` view semantics and operations.
tests/einsum.cpp	Extends existing einsum tests to cover new nested-rank / arena-ToT scenarios.
tests/cases/CMakeLists.txt (and tests/cases/*)	Adds case binaries/targets for additional coverage/validation scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

evaleev · 2026-05-19T20:48:11Z

+  std::shared_ptr<T[]> slice(std::size_t offset, std::size_t /*n_elem*/) const {
+    TA_ASSERT(slab_);
+    TA_ASSERT(offset % alignof(T) == 0);
+    TA_ASSERT(offset <= capacity_);


Valid for this branch. It is resolved by the stacked PR #549, which rewrites Arena entirely: the one-shot slice() is removed in favor of claim_bytes(bytes, alignment), which bounds-checks every allocation against its page. Rather than patch slice() here only for #549 to delete it, leaving as-is.

evaleev · 2026-05-19T20:48:13Z

+    auto h = arena_->claim<std::byte>(arena_align_up(bytes, alignment));
+    return h.get();


Valid catch — and fixed in the stacked PR #549: the rewritten Arena exposes claim_bytes(bytes, alignment) that aligns the bump cursor to the request, and ArenaResource::do_allocate calls it, so the requested alignment is honored.

evaleev · 2026-05-19T20:48:15Z

+    std::vector<Future<value_type>> done;
+    for (const auto& index : *(pmap())) {
+      if (is_zero(index)) continue;
+      Future<value_type>& fut = find_local(index);
+      Future<value_type> mutated = w.taskq.add(
+          [op](value_type& tile) -> value_type {
+            op(tile);
+            return tile;


value_type here is the outer ToT tile (TA::Tensor<...>), which is a shallow-copy handle — return tile; copies a reference-counted handle (a pointer plus a refcount bump), not the tile's element data, and the resulting Future<value_type> just carries that handle. So this is not an expensive copy. Returning void would be marginally tidier but is neither a correctness nor a performance issue; leaving as-is.

evaleev · 2026-05-19T20:48:18Z

+  std::int64_t fill_local(const V& value = V(), bool skip_set = false) {
+    if constexpr (detail::is_tensor_of_tensor_v<value_type> &&
+                  is_arena_tensor_v<element_type>) {
+      return for_each_local_tile_inplace<fence>([value](value_type& outer) {
+        for (std::size_t o = 0; o < outer.size(); ++o) {
+          auto& cell = outer.data()[o];
+          if (cell.empty()) continue;  // skip deliberately-null cells
+          cell = value;                // deep copy into the bound arena cell
+        }
+      });


Intentional and documented (see the \note on fill_local): for an arena-backed ToT the array is shaped first — every inner cell allocated — and fill_local is then a pure in-place mutator, so the tiles necessarily already exist and skip_set (whose role is to tolerate already-set tiles) has nothing to act on. The non-arena path keeps the standard skip_set semantics. Happy to add an explicit "skip_set is ignored on the arena-ToT branch" note if that reads clearer.

evaleev · 2026-05-19T20:48:20Z

+    std::map<std::size_t, Tile> src_tile_cache;
+    auto source_cell_at =
+        [&](const auto& e) -> const typename Tile::value_type* {
+      if (!source_elements.includes(e)) return nullptr;
+      const auto src_tile_idx = source_array.trange().element_to_tile(e);
+      const auto src_ord =
+          source_array.trange().tiles_range().ordinal(src_tile_idx);
+      auto it = src_tile_cache.find(src_ord);
+      if (it == src_tile_cache.end()) {
+        it = src_tile_cache
+                 .emplace(src_ord, source_array.is_zero(src_tile_idx)
+                                       ? Tile{}
+                                       : source_array.get(src_tile_idx).get())
+                 .first;
      }
-
-      // loop over every target tile combination
-      TA::Range target_tile_ord_extent(target_tile_ord_extent_range);
-      for (auto& target_tile_ord : target_tile_ord_extent) {
-        TA::Index target_tile_idx(rank);
-        container::svector<TA::Range1> target_tile_rngs1(rank);
+      const Tile& st = it->second;
+      if (st.empty()) return nullptr;
+      return &st(e);
+    };


The gather-based retile is deliberate for arena ToT (see the comment opening this branch): the generic scatter path (write_tile_block) would rebind the target's null inner cells onto the source tiles' arena slabs, leaving dangling views once the source array is destroyed — so the target rank must pull source tiles and deep-copy. Fetches are de-duplicated per rank via src_tile_cache, so the cost is O(distinct source tiles)/rank, not per target tile. Batching/prefetching those remote fetches is a worthwhile follow-up optimization but is orthogonal to correctness; noting it for later.

evaleev · 2026-05-19T20:48:22Z

+/// `tensor/arena_tensor.h` (`ArenaTensor`, `detail::TensorInterface`) and
+/// `external/btas.h` (`btas::TensorView`). Declared here so the operator-body
+/// predicates below can consult it without including arena_tensor.h.


Fixed in the stacked PR #549 (commit a02f28e): the comment no longer lists TensorInterface as an is_tensor_view specialization (it has none, and is deliberately not a view), and now notes that TensorInterface/TensorMap is excluded because it has value-returning member arithmetic.

zhihao-deng and others added 26 commits May 13, 2026 22:03

arena: add allocator + plan helper + tests

1c757e3

arena_kernels: add ToT kernels + tests

52fdeaa

arena_einsum: regime-A (outer-Hadamard) plans + dispatch + tests

7e6b58a

tensor: route ToT trivial ops through arena kernels + tests

582937a

cont_engine: thread arena plan + zero-overhead sizeof gate

463aa6e

einsum + tests/cases: hook regime-A arena into einsum + add hec_* cas…

d9e6a59

…e binaries

evaleev force-pushed the feature/arena_tensor branch from 8bdc40f to 5db3289 Compare May 19, 2026 09:07

This was referenced May 19, 2026

Arena ToT: incremental multi-page construction #549

Merged

Arena allocator for ToT einsum (cases involving outer-Hadamard) #545

Closed

evaleev requested a review from Copilot May 19, 2026 20:32

Copilot started reviewing on behalf of evaleev May 19, 2026 20:32 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

evaleev merged commit 11f6e8d into master May 19, 2026
13 checks passed

evaleev deleted the feature/arena_tensor branch May 19, 2026 20:54

evaleev mentioned this pull request May 20, 2026

build: bump TiledArray tag to master (arena PRs) ValeevGroup/SeQuant#519

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arena-backed tensor-of-tensors: 8-byte view inner cells through einsum/contraction#548

Arena-backed tensor-of-tensors: 8-byte view inner cells through einsum/contraction#548
evaleev merged 26 commits into
masterfrom
feature/arena_tensor

evaleev commented May 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

evaleev May 19, 2026

Uh oh!

evaleev May 19, 2026

Uh oh!

evaleev May 19, 2026

Uh oh!

evaleev May 19, 2026

Uh oh!

evaleev May 19, 2026

Uh oh!

evaleev May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		auto h = arena_->claim<std::byte>(arena_align_up(bytes, alignment));
		return h.get();

Conversation

evaleev commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

evaleev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

evaleev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

evaleev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

evaleev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

evaleev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

evaleev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

evaleev commented May 19, 2026 •

edited

Loading