Arena-backed tensor-of-tensors: 8-byte view inner cells through einsum/contraction#548
Conversation
- arena_sizeof_invariant_suite: drop platform-specific absolute baselines (328/16/248 were Apple-arm64/libc++ only); keep relative ImplLayoutAllocator == ImplLayoutMaster invariant + monostate static_asserts. - cont_engine: reset arena_plan_ after std::move into op_ so later reads see "no plan" rather than a moved-from optional. - arena_kernels: one-line intent note on trivial kernels' tight packing.
Treat ArenaTensor as a first-class tensor in TA's trait machinery so kernel-level dispatches (tensor_reduce, inplace_tensor_op, tensor_op, unary, binary, ...) and operators match the same overloads they do for TA::Tensor<double> -- without bespoke arena overloads in kernels.h. The key insight: ArenaTensor is structurally a flat contiguous tensor (.data() + .size()); the only reason it was kept out of is_tensor_helper was to avoid being dragged into value-returning operator paths it can't fulfill (a view has no allocator to materialize fresh storage). Address that as a separate gate rather than excluding ArenaTensor from being a tensor at all. Trait split: - is_tensor_helper<ArenaTensor> = true. is_contiguous_tensor_helper too. - is_tensor_view<T> -- new predicate; true for non-owning views that lack value-returning member arithmetic (ArenaTensor, btas::TensorView). TensorInterface is *not* a view here because it materializes a fresh result from member add/subt/mult/etc. - ta_ops_match_tensor (operators_body.ipp value-returning gate): adds `&& !is_tensor_view_v<T>`; views opt out of `+`, `-`, `*`, unary `-`, tensor*scalar, scalar*tensor, permute*tensor. - ta_ops_match_tensor_inplace (new; operators_body.ipp compound- assignment gate): default is `is_nested_tensor_v<T>`, which accepts views since they can be mutated in place. - btas.h specializes both predicates to false for btas::Tensor (was already done for the freestanding one). ArenaTensor surface: - Member compound operators (operator+=, -=, *=, scalar *=) routing to free CPOs. - Member in-place CPO mirrors (add_to, subt_to, mult_to, scale_to, neg_to) so tile_interface paths that call `arg.add_to(other)` work uniformly. - No member operator+/-/* or value-returning member add/subt/mult/ scale/neg/clone. Those would require allocation. Tensor.h changes: - value_converter falls through to identity for views (rebind-on-copy, no clone). - Tensor(range, value) ctor: when value_type is a view, copy each cell by value instead of calling Clone (which delegates to a missing member). - Permutation paths (Tensor::neg(perm), subt(right, perm), scale(perm), inner-permute in copy-with-perm ctor) bail with TA_EXCEPTION for view inner cells -- views can't permute in place. - Drop the view-aware Tensor<View>::scale_to/add_to/subt_to/mult_to overloads added in earlier iterations; the legacy ones now work because ArenaTensor has compound operators. - Tighten one residual ambiguity: subt(Right) const value-returning excludes the arena-pair case (handled by a dedicated overload above). Kernels.h: drop the ad-hoc tensor_reduce(ArenaTensor) and inplace_tensor_op(ArenaTensor) overloads added in earlier iterations. Tests: - New arena_tensor.cpp / arena_tensor_kernels.cpp test suites covering ArenaTensor sizeof invariants, SIMD alignment, member compound operators, in-place CPO mirrors, ToT reductions (sum/product/ squared_norm), serialization round-trip, regime-A einsum dispatch, and the in-place ops smoketest. - Update the is_tensor_view_v predicate test to reflect that TensorInterface is no longer in the view set.
Introduce a fused in-place AXPY CPO so the einsum generic fallback and the contraction engine's scale inner-product path no longer build a scaled temporary tile. The temporary required value-returning `scale`, which a view tile type (ArenaTensor) cannot provide -- it has no allocator. The fused form needs only in-place mutation, which views support. New CPO (tile_interface/add.h): - axpy_to(result, arg, factor) -- result += arg * factor - axpy_to(result, arg, factor, perm) -- result += (perm ^ arg) * factor Semantics are BLAS AXPY: `_to` marks the in-place form (cf. add_to vs add). Distinct from add_to(result, arg, factor), whose legacy semantics is `(result + arg) * factor` -- that one scales the accumulated result too, so it is not a drop-in for fused accumulation. Member overloads: - Tensor::axpy_to(right, factor[, perm]) -- the lambda body dispatches by element type so one body serves flat and ToT tensors (leaf: l += r*f; cell: l.axpy_to(r, f)). The perm form bails for view inner cells. - ArenaTensor::axpy_to(other, factor[, perm]) -- routes to the free arena axpy_to; perm form rejects (views cannot permute in place). - has_member_function_axpy_to_anyreturn detector generated in type_traits.h. Renamed the existing arena free function `axpy(dst, alpha, src)` to `axpy_to(dst, src, alpha)` -- it is in-place, so the `_to` suffix and the `(result, arg, factor)` argument order now match TA's CPO convention. einsum/tiledarray.h: the generic-fallback mixed scalar x ToT loop now does `axpy_to(el, tensor, scalar)` instead of `add_to(el, scale(tensor, scalar))`. cont_engine.h: the scale inner-product fallback drops the value-returning `scal_op` lambda. The Contraction outer product is now a fused `axpy_to(result, tot, scalar[, perm])` (this is the long-standing "TODO implement X-permuting AXPY"); the Hadamard outer product keeps the value-returning `scale` assignment, gated to non-view result cells. Tests: axpy_to correctness for flat Tensor<double> and ToT Tensor<ArenaTensor> (verifies axpy semantics -- factor scales only the added operand).
- merge arena_tensor_kernels.h into arena_kernels.h; add make_nested_tile, a two-pass ToT outer-tile builder that dispatches on arena vs owning inner tiles (declare inner ranges -> allocate one slab -> fill cells) - DistArray: add the ToT range_fn constructor and init_tiles_nested; route fill_local/fill/fill_random/init_elements/set through arena-aware tile builders (for_each_local_tile_inplace, make_arena_nested_tile) for Tensor<ArenaTensor> inners - ArenaTensor::operator=: two regimes -- deep element copy when the assignee is bound, shallow rebind when null; drop the move-assignment so rvalues follow the same regimes - type_traits: add default_freestanding_tensor (view -> owning-tensor map) - tests/tot_construction.cpp: construction, fill, init_elements, set, and a multi-rank serialization round-trip, over plain and arena ToT inner tiles - einsum.cpp: breadcrumb on the pre-existing different_nested_ranks abort
Route the elementwise binary/unary tile ops on Tensor<ArenaTensor> through the arena binary kernel so c = a + b, a - b, scaled variants, scalar scaling, and negation work for arena-inner tensors-of-tensors via the expression DSL. add gains dedicated arena overloads (factor / perm / factor+perm); subt and mult gain in-body arena branches. Permuted ToT ops fall through to the non-permuted kernel on a trivial permutation and throw otherwise, since cell reordering of arena inners is not yet supported. Extend tot_construction.cpp with a generic run_tot_expr harness and add/subt/mult/scale/neg/scaled-add/scaled-subt cases for both Tensor and ArenaTensor inners. Hadamard ToT mult for arena inners is left unwired -- it routes through MultEngine/ContEngine, which needs a value-returning inner tile op an arena cell cannot provide; tracked by TODO(arena-tot-mult).
… ops
MultEngine::init_struct always calls ContEngine::init_inner_tile_op, which
unconditionally instantiates a value-returning inner mult/contraction op on
the inner cell type. A non-owning view inner cell (ArenaTensor) cannot host
such an op, so a("i;j")*b("i;j") on DistArray<Tensor<ArenaTensor>> failed to
compile -- even though that inner op is dead code for a pure Hadamard:
MultEngine::make_tile_op passes no inner op and the outer Mult tile op
already recurses through Tensor<ArenaTensor>::mult.
Split init_inner_tile_op into a thin dispatcher and an init_inner_tile_op_-
owning_ helper holding the existing owning-cell builder. The one case that
cannot instantiate that helper -- ToT x ToT with a view inner cell -- is
handled directly: pure Hadamard needs no inner op, everything else throws
(deferred). tot x t scale paths keep their existing view-aware handling
since tot_x_tot is false there. Re-enables the mult_arena_inner test.
Adds test_tot_einsum_contraction -- runs TA::einsum("ij;mo,ij;on->ij;mn")
(outer Hadamard, inner contraction) and checks the result against a
Tensor<Tensor<double>> reference run of the same expression and data.
Wired up for a TA::Tensor inner cell. The ArenaTensor-inner case is left
as a TODO: TA::einsum does not compile for DistArray<Tensor<ArenaTensor>>
because the legacy per-cell einsum path (tensor_hadamard / tensor_contract
and the element_*_op lambdas in einsum/tiledarray.h) value-returns inner
tensors and calls ArenaTensor::mult / ArenaTensor::permute. That legacy
path coexists with the regime-A arena path and is still instantiated; it
needs to be if-constexpr-guarded out for view inner cells.
TA::einsum for DistArray<Tensor<ArenaTensor>> did not compile: the
outer-Hadamard ("hadamard reduction") branch's legacy per-cell ToT x ToT
path value-returns inner tensors (tensor_hadamard / tensor_contract via the
element_*_op lambdas) and calls ArenaTensor::mult / ArenaTensor::permute,
none of which a non-owning view inner cell supports. That path coexists
with the regime-A arena fast path (run_regime_a_arena, tried first) and was
still being instantiated.
if-constexpr-guard the legacy ToT x ToT element loop out for
is_tensor_view_v inner cells. For a view inner cell only the regime-A path
can produce results; if it was inactive (a permuted inner contraction --
see TODO(arena-einsum-perm)) the guard throws instead of falling through.
The tot x t scale path is untouched: it uses in-place axpy_to and is
already view-safe.
Switches the einsum_contraction_arena_inner test to a regime-A annotation
("ijk;mo,ijk;on->ij;mn" -- outer Hadamard + outer contraction + inner
contraction); a pure-Hadamard outer is delegated to the expression DSL
instead. The arena result matches the Tensor<Tensor<double>> reference.
arena_inner_permute<OuterTensor>(src, inner_perm) builds a fresh slab-backed ToT tile with the same outer layout as src, but with every inner cell's range and data permuted by inner_perm (result_cell(inner_perm * i) == src_cell(i)). It is the slab-level counterpart of a per-cell permute: the owning tile allocates one new slab via arena_outer_init and rewrites each cell with a strided scatter, so no non-owning view inner cell is ever asked to value-return. This is the primitive the owning Tensor<ArenaTensor>::permute and the regime-A einsum hoist will use to handle inner-mode permutations without a per-cell permute(const ArenaTensor&). Tests cover a transpose and a rank-3 permutation over non-uniform inner cells, checked against a hand-rolled reference.
Tensor<ArenaTensor>::permute previously routed through the generic Tensor(other, perm) ctor, whose allocate-then-fill shape does not fit the arena slab model: its inner-permute pass threw for view inner cells, and its outer-permute path did not co-own the source slab. Route arena-inner permute around that ctor: the outer part reorders the 8-byte cell views shallowly (sharing the source slab via keep-alive) with arena_permute_shallow; a non-trivial inner part rewrites every cell into a fresh slab with arena_inner_permute. This makes the owning tile's permute carry out the inner-mode permutation -- no per-cell permute(ArenaTensor), no view value-returns. Also wires up arena_permute_shallow, which existed but was previously unused. Test: a bipartite (outer + inner) transpose of a Tensor<ArenaTensor> tile.
regime-A einsum previously bailed (plan inactive -> legacy fallback, which
cannot run for ArenaTensor inners) whenever an inner contraction was not
canonically aligned -- i.e. TensorContractionPlan::do_perm.{A,B,C} set.
Instead of bailing, run_regime_a_arena now hoists the inner permutations to
slab-level rewrites: each operand tile whose cells need reordering is run
through arena_inner_permute (by c_plan.perm.{A,B}) before the per-cell op,
and the result tile is run through arena_inner_permute (by perm.C.inv())
after accumulation. The per-cell op is therefore always a single canonical
GEMM -- accumulate() now calls fused_contraction_inplace with the plan's
(canonical) GemmHelper directly. No per-cell view permute, no value-return;
tensor_contract / tensor_contract_to are untouched.
This unifies the path for arena and plain TA::Tensor inner cells (both are
arena-eligible outers). Tests: a canonical and a non-canonical
("ijk;om,ijk;on->ij;nm") inner contraction, arena and TA::Tensor inners,
each checked against a Tensor<Tensor<double>> reference.
regime-A's inner-Hadamard path previously only accepted the no-perm and perm_b branches, and even perm_b was wrong: fused_hadamard_inplace does a flat r += l * rr, which aborts the is_range_set_congruent check whenever an operand's inner cells are not already in C-layout. Treat a non-canonical inner Hadamard the same as a non-canonical inner contraction: run_regime_a_arena now hoists each operand's inner permutation (h_plan.perm.AC / perm.BC) to a slab-level arena_inner_permute rewrite, so both operands reach C-layout before the per-cell op. This fixes the pre-existing perm_b bug and adds perm_to_c / perm_a / else coverage; make_regime_a_arena_plan no longer bails on those branches. Adds regime-A-exercising cases to einsum_manual/equal_nested_ranks (permuted inner contraction and permuted inner Hadamard, no outer permutation) so manual_eval is an independent oracle, plus matching arena/Tensor inner cases in tot_construction.
The 3-arg Tensor::mult(right, perm) only had an arena branch for the
tot x tot case; a plain scalar tile times an arena ToT tile (t x tot)
fell through to the generic binary() path, whose element op evaluates
`scalar * ArenaTensor` as a value. ArenaTensor is a non-owning view and
has no value-returning operator*, so that failed to compile.
This surfaced from MPQC: TA::einsum(plain_array, tot_array, ...) recurses
into einsum's pure-Hadamard branch (C = A * B), which the expression DSL
lowers to a Mult tile op of Tensor<double> x Tensor<ArenaTensor<double>>.
Route t x tot through the existing 2-arg arena overload (scales each
inner cell into a fresh slab) and apply any non-trivial result
permutation as a shallow outer reindex of that slab.
Adds einsum_t_x_tot_{tensor,arena}_inner cases checking the arena path
against an identical Tensor<double>-inner reference run.
Three gaps surfaced wiring TA::Tensor<TA::ArenaTensor<T>> through MPQC's CSV coupled-cluster layer: - The 3-arg Tensor::mult(right, perm) handled t x tot but not the mirror tot x t (arena ToT tile times a plain scalar tile); it fell through to the generic binary() whose element op needs a value-returning ArenaTensor * scalar. Route it through the 2-arg arena overload like t x tot. - ArenaTensor had no sum(): einsum's DeNest path (sum_tot_2_tos) reduces each inner cell to a scalar via the free sum(), which calls .sum(). A scalar reduction allocates nothing, so it is valid on a view. - size_of(Tensor<ArenaTensor>) recurses per inner cell into size_of(ArenaTensor), which did not exist. Add an overload counting the one-pointer view plus its in-arena cell; summed over the outer tile this accounts for the slab.
Make Tensor<ArenaTensor> tiles work through the einsum and SUMMA contraction machinery, as exercised by MPQC's CSV-CCk integral transformation: - einsum: hoist the temporary sub-World vector ahead of the AB/C ArrayTerms so a sub-World outlives the DistArrays bound to it; lazy_deleter would otherwise dereference a destroyed World while unwinding. - einsum regime-A: drop the over-conservative bail on a non-identity result permutation -- run_regime_a_arena already applies it (tile.permute(pc), as the legacy path does). - cont_engine: support inner contraction (incl. inner outer-product) for view inner cells, routed through the arena fast path. A non-identity inner result permutation rides on op_'s post-processing permute step rather than the (perm-free) fused inner op. - contraction SUMMA: union-shape the arena ToT result across K-panels via reserve_and_construct / grow_to_cover. - make_with_new_trange: arena-ToT-aware retile that pulls source tiles and deep-copies inner cells instead of rebinding source arena slabs.
…nsume Completes Tensor<ArenaTensor> support through the expression engine: - MultEngine::make_tile_op routes a Hadamard-outer / contraction-inner ToT product to a whole-tile arena op (arena_hadamard_tile_op_) instead of a value-returning per-cell op, which a view inner cell cannot host. - ContEngine builds that op as arena_hadamard_inner_contract: a fresh slab-backed result shaped per-cell by the inner GEMM, with the inner result permutation applied as a slab-level post-pass. - Mult gains a whole-tile-op constructor (tile_op_tag); eval() delegates to it when set. - Mult's consume eval overloads no longer do in-place mult_to on view-inner-cell (ArenaTensor) tiles: such a tile is a shallow handle whose arena slab may be aliased by a persistent array, so consuming it corrupts that operand. View-cell tiles always produce a fresh result.
einsum's special-Hadamard branch replicates the small operand via replicate_array -> make_replicated -> replicate_tensor. Two gaps broke this for an arena tensor-of-tensors on a multi-rank run: - Replicator's ctor called pmap()->local_size() purely as a reserve() hint, which asserts on a pmap that does not precompute local size (e.g. HashPmap). Skip the hint when known_local_size() is false. - replicate_tensor std::copy'd the 8-byte ArenaTensor view cells, so the replicated tile aliased the source tile's arena slab and dangled once the (temporary, post-make_replicated) source array was destroyed. Build the replicated tile as a fresh slab-backed tile and deep-copy each replicated inner cell's element data.
ContractionArenaPlan::operand_inner_ranges declared its return type as std::vector<Result::value_type::range_type>, ill-formed when Result is a plain (non-ToT) tensor. make_contraction_arena_plan names the class in its return type unconditionally (before an if-constexpr returns nullopt for non-ToT), so the bad type was a hard error -- it broke the ta_test build. Give operand_inner_ranges a deduced return type so the type is only formed when the function is actually instantiated.
Tensor::axpy_to(arg, factor, perm) went straight to inplace_binary, which asserts the target is non-empty. The first contribution into an unallocated tensor -- e.g. a tensor-of-tensors contraction result inner cell, when the inner op carries a permutation -- therefore aborted at kernels.h:395. Mirror the non-permuting axpy_to overload: when the target is empty, initialize it to factor * (perm ^ arg). Fixes the einsum_manual/different_nested_ranks "ik;mn,j->ijk;nm" case (mixed-nested-rank outer product with an inner-mode permutation) and drops its TODO(tot-einsum-empty-result) breadcrumb.
ArenaTensor carried both a member `operator*=(Scalar)` and was accepted by the free `operator*=(T&&, N)` in operators_body.ipp (views satisfy ta_ops_match_tensor_inplace). The two template candidates tie under gcc-13's overload resolution, so `arena_cell *= scalar` -- exercised by Tensor<ArenaTensor>::scale_to / subt_to -- fails to compile as ambiguous on that toolchain (newer gcc and clang pick a candidate and build). Remove the member; the free operator is the single provider of `view *= scalar`, matching how TA::Tensor itself relies on it. The tensor-tensor compound members stay -- they are non-template and win cleanly over the free templates.
8bdc40f to
5db3289
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces an arena-backed tensor-of-tensors tile type (TA::Tensor<TA::ArenaTensor<T>>) and integrates it through TiledArray’s expression-DSL, einsum, and contraction machinery to drastically reduce inner-cell footprint while preserving existing ToT semantics.
Changes:
- Adds arena infrastructure +
ArenaTensorview semantics and type traits to distinguish owning tensors from non-owning views during operator dispatch. - Extends tile ops and einsum/contraction engines to correctly handle view inner-cells (including avoiding unsafe in-place Hadamard paths and enabling slab-aware contraction planning).
- Updates DistArray construction/replication/retile paths for arena-ToT correctness and adds extensive unit + case-level test coverage.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/TiledArray/tensor/arena.h | Adds arena allocator/resource utilities used to back arena-ToT slabs and related helpers. |
| src/TiledArray/tensor/arena_tensor.h | Defines ArenaTensor (non-owning view tensor) behavior and kernel/operator hooks. |
| src/TiledArray/tensor/type_traits.h | Introduces is_tensor_view_v and related traits to gate operator dispatch for view tensors. |
| src/TiledArray/dist_array.h | Adds/adjusts ToT constructors and in-place mutation paths (fill*, set, init_elements, etc.) for arena-ToT. |
| src/TiledArray/array_impl.h | Special-cases retile/reshape for arena-ToT to avoid dangling views by deep-copy rebuild. |
| src/TiledArray/tile_op/mult.h | Routes view-result Hadamard ops through whole-tile ops to avoid slab alias corruption. |
| src/TiledArray/tile_op/contract_reduce.h | Adds arena-aware contraction reduction support (plan storage + growth/merge handling). |
| src/TiledArray/einsum/tiledarray.h | Integrates arena-ToT into einsum paths (replication/deep-copy, regime-A dispatch, lifetime fixes). |
| src/TiledArray/einsum/cont_engine.h | Wires arena planning/dispatch into contraction engine execution. |
| src/TiledArray/einsum/mult_engine.h | Extends mult engine behavior for outer-Hadamard × inner-contraction patterns with arena-ToT. |
| src/TiledArray/tensor/kernels.h | Adds/updates kernel entry points used by arena-aware contraction paths. |
| tests/tot_construction.cpp | New end-to-end tests for ToT construction + expression-DSL/einsum behavior for Tensor and ArenaTensor inners. |
| tests/arena.cpp | Adds unit tests for arena allocator/resource behavior and invariants. |
| tests/arena_tensor.cpp | Adds unit tests for ArenaTensor view semantics and operations. |
| tests/einsum.cpp | Extends existing einsum tests to cover new nested-rank / arena-ToT scenarios. |
| tests/cases/CMakeLists.txt (and tests/cases/*) | Adds case binaries/targets for additional coverage/validation scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::shared_ptr<T[]> slice(std::size_t offset, std::size_t /*n_elem*/) const { | ||
| TA_ASSERT(slab_); | ||
| TA_ASSERT(offset % alignof(T) == 0); | ||
| TA_ASSERT(offset <= capacity_); |
There was a problem hiding this comment.
| auto h = arena_->claim<std::byte>(arena_align_up(bytes, alignment)); | ||
| return h.get(); |
There was a problem hiding this comment.
Valid catch — and fixed in the stacked PR #549: the rewritten Arena exposes claim_bytes(bytes, alignment) that aligns the bump cursor to the request, and ArenaResource::do_allocate calls it, so the requested alignment is honored.
| std::vector<Future<value_type>> done; | ||
| for (const auto& index : *(pmap())) { | ||
| if (is_zero(index)) continue; | ||
| Future<value_type>& fut = find_local(index); | ||
| Future<value_type> mutated = w.taskq.add( | ||
| [op](value_type& tile) -> value_type { | ||
| op(tile); | ||
| return tile; |
There was a problem hiding this comment.
value_type here is the outer ToT tile (TA::Tensor<...>), which is a shallow-copy handle — return tile; copies a reference-counted handle (a pointer plus a refcount bump), not the tile's element data, and the resulting Future<value_type> just carries that handle. So this is not an expensive copy. Returning void would be marginally tidier but is neither a correctness nor a performance issue; leaving as-is.
| std::int64_t fill_local(const V& value = V(), bool skip_set = false) { | ||
| if constexpr (detail::is_tensor_of_tensor_v<value_type> && | ||
| is_arena_tensor_v<element_type>) { | ||
| return for_each_local_tile_inplace<fence>([value](value_type& outer) { | ||
| for (std::size_t o = 0; o < outer.size(); ++o) { | ||
| auto& cell = outer.data()[o]; | ||
| if (cell.empty()) continue; // skip deliberately-null cells | ||
| cell = value; // deep copy into the bound arena cell | ||
| } | ||
| }); |
There was a problem hiding this comment.
Intentional and documented (see the \note on fill_local): for an arena-backed ToT the array is shaped first — every inner cell allocated — and fill_local is then a pure in-place mutator, so the tiles necessarily already exist and skip_set (whose role is to tolerate already-set tiles) has nothing to act on. The non-arena path keeps the standard skip_set semantics. Happy to add an explicit "skip_set is ignored on the arena-ToT branch" note if that reads clearer.
| std::map<std::size_t, Tile> src_tile_cache; | ||
| auto source_cell_at = | ||
| [&](const auto& e) -> const typename Tile::value_type* { | ||
| if (!source_elements.includes(e)) return nullptr; | ||
| const auto src_tile_idx = source_array.trange().element_to_tile(e); | ||
| const auto src_ord = | ||
| source_array.trange().tiles_range().ordinal(src_tile_idx); | ||
| auto it = src_tile_cache.find(src_ord); | ||
| if (it == src_tile_cache.end()) { | ||
| it = src_tile_cache | ||
| .emplace(src_ord, source_array.is_zero(src_tile_idx) | ||
| ? Tile{} | ||
| : source_array.get(src_tile_idx).get()) | ||
| .first; | ||
| } | ||
|
|
||
| // loop over every target tile combination | ||
| TA::Range target_tile_ord_extent(target_tile_ord_extent_range); | ||
| for (auto& target_tile_ord : target_tile_ord_extent) { | ||
| TA::Index target_tile_idx(rank); | ||
| container::svector<TA::Range1> target_tile_rngs1(rank); | ||
| const Tile& st = it->second; | ||
| if (st.empty()) return nullptr; | ||
| return &st(e); | ||
| }; |
There was a problem hiding this comment.
The gather-based retile is deliberate for arena ToT (see the comment opening this branch): the generic scatter path (write_tile_block) would rebind the target's null inner cells onto the source tiles' arena slabs, leaving dangling views once the source array is destroyed — so the target rank must pull source tiles and deep-copy. Fetches are de-duplicated per rank via src_tile_cache, so the cost is O(distinct source tiles)/rank, not per target tile. Batching/prefetching those remote fetches is a worthwhile follow-up optimization but is orthogonal to correctness; noting it for later.
| /// `tensor/arena_tensor.h` (`ArenaTensor`, `detail::TensorInterface`) and | ||
| /// `external/btas.h` (`btas::TensorView`). Declared here so the operator-body | ||
| /// predicates below can consult it without including arena_tensor.h. |
There was a problem hiding this comment.
Summary
Deploys
TA::Tensor<TA::ArenaTensor<T>>— a tensor-of-tensors (ToT) tiletype whose inner cells are 8-byte non-owning views into an arena "slab"
owned by the outer tile — through TiledArray's einsum and contraction
machinery. This shrinks the ToT inner-tile footprint dramatically (from
~304 B to ~8 B per cell) while keeping
Tensor<ArenaTensor>behaving likean ordinary
Tensor<Tensor>at the expression-DSL andeinsumlevel.Highlights:
construction kernels (
arena_outer_init,make_nested_tile,arena_trivial_{unary,binary}); arena-awarefill/set/init_elements.add/subt/scale/neg, and a correctedMultconsume path: in-placemult_toon shallow arena-ToT operandtiles corrupts shared slabs, so view-cell results are routed through a
fresh-result whole-tile op (
uses_tile_op_).and dispatch; ToT×ToT Hadamard with view inner cells routed via outer
ops; permuted inner contractions handled via slab-level hoist;
MultEngine support for Hadamard-outer × inner-contraction.
replicate_arraypath now builds freshslab-backed tiles instead of copying 8-byte views (which would dangle),
and the
Replicatorreserve hint is guarded for pmaps without a knownlocal size (
HashPmap).Tensor::axpy_tonow initializes an emptytarget instead of asserting in
inplace_tensor_op; this also fixes apre-existing
einsumabort ondifferent_nested_ranks.Test plan
tiledarray/unit/run-np-1) passes 100%.different_nested_rankseinsum case (previously aborting) passes.including the np=2
he10-csv-cckcase.ContractionArenaPlannot nameable for non-ToT operands — is fixed in
6e6e8efd6).