Skip to content

feat: SoA binary streaming output for high-throughput pipelines#206

Open
filiprumenovski wants to merge 2 commits into
CompOmics:masterfrom
filiprumenovski:feat/binary-soa-output
Open

feat: SoA binary streaming output for high-throughput pipelines#206
filiprumenovski wants to merge 2 commits into
CompOmics:masterfrom
filiprumenovski:feat/binary-soa-output

Conversation

@filiprumenovski
Copy link
Copy Markdown

@filiprumenovski filiprumenovski commented May 10, 2026

What

Adds --format BinarySoa (4) — a binary output that emits each scan as [128-byte header][filter string][f64 mz array][f32 intensity array][optional trailer dump]. Format is documented in BINARY_SOA_FORMAT.md.

Both --stdout and --output produce the same byte format.

Why

For pipelines that consume TRFP output and don't need mzML's portability, the XML round-trip is meaningful overhead. This format skips it — downstream consumers cast the bytes into native arrays directly.

Implementation notes

  • Output wrapped in a 1 MB BufferedStream. On a 3.7 GB Orbitrap DDA file this dropped sys time from ~72s to ~3s vs unbuffered.
  • mz array goes out as a span over the existing double[], no copy.
  • Intensity narrowing (f64→f32) uses ArrayPool<float>.Shared and a tight loop the JIT vectorizes.
  • 128-byte fixed scalar header per scan is built into a reusable buffer with inline little-endian writers.
  • The optional trailer dump captures every key/value pair from ILogEntryAccess verbatim, so consumers can pick out instrument-specific fields (lock-mass calibration, AGC, conversion params, etc.) without us having to enumerate them.
  • EThcD detection: uses FindLastReaction + SupplementalActivation == TriState.On to tag the spectrum as 5 (EThcD) rather than the supplemental HCD/CID's type.

Compatibility

Strictly additive:

  • New OutputFormat.BinarySoa = 4 enum value before None. Existing format paths untouched.
  • SpectrumWriter.ConfigureWriter adds a branch for binary destinations (raw FileStream, no text-encoding wrapper). --gzip works on file output.
  • CLI -f/--format help updated. Numeric and case-insensitive name parsing inherits the existing ParseToEnum helper, so --format 4, --format BinarySoa, --format binarysoa all work.

Tested

Built clean on .NET 8 (macOS arm64 dev with Rosetta'd x64 dotnet for the Thermo DLL; CI Ubuntu should be unaffected).

Smoke-tested on:

  • bundled Data/small.RAW (48 spectra, FTMS+ITMS hybrid)
  • bundled Data/small2.RAW (95 spectra, ETD-capable instrument with reagent-ion trailer fields)
  • a 143,136-spectrum / ~60M-peak Orbitrap DDA file from PXD028735 (0 errors, 0 warnings)

--stdout and --output produce byte-exact identical streams in all three cases.

Happy to add a WriterTests entry in the same shape as the MzML/Parquet ones if you'd prefer the test be in-tree.

…mers

Adds a new --format=BinarySoa (4) output that emits each scan as a
self-describing binary record laid out as Structure-of-Arrays (mz f64,
intensity f32). Designed for downstream pipelines (Rust engines, GPU
rescorers, columnar database loaders) that prefer zero-copy ingestion
over portable XML.

The format is fully documented in BINARY_SOA_FORMAT.md and consists of:

  - 32-byte file header (magic "RCIASTR1", format_version, flags)
  - per-spectrum records with a 128-byte fixed scalar header capturing
    every commonly-needed field (rt, precursor mz, isolation window,
    collision energy, FAIMS CV, ion injection time, base peak, TIC,
    low/high mass, charge, master scan, activation type, ...) with
    graceful nullability via NaN floats and -1 int sentinels
  - an optional verbatim trailer key/value dump preserving every
    per-scan vendor-reported field (AGC target, conversion parameters,
    lock-mass calibration, etc.) without selective filtering
  - SoA peak arrays (f64 mz, then f32 intensity), naturally aligned
  - u32 = 0 EOF marker

Both --stdout and --output produce the identical byte format, so a
file written with --output can be played back through the same
downstream consumer that reads from a streaming pipe.

Performance notes from a 3.7 GB Orbitrap DDA benchmark
(143k spectra, 60M peaks):

  - Output is wrapped in a 1 MB BufferedStream to coalesce small
    writes into few large pipe syscalls (sys time dropped ~22x in
    measurement vs naive per-element writes)
  - mz array emitted via zero-copy MemoryMarshal.AsBytes over the
    existing double[]
  - intensity narrowing (f64->f32) uses ArrayPool<float>.Shared and
    a tight loop the JIT auto-vectorizes
  - per-spectrum header is built into a reusable 128-byte buffer with
    inline little-endian writers (no BinaryWriter virtual calls)
  - metadata block built into a reusable MemoryStream that's reset
    (not freed) between scans

Activation type encoding handles EThcD correctly: when the instrument
reports SupplementalActivation == TriState.On AND the primary reaction
is ETD/ECD followed by HCD/CID, the encoded byte is 5 (EThcD) rather
than the supplemental's HCD/CID value.

Compatibility:

  - Additive: new OutputFormat.BinarySoa = 4 enum value, existing
    formats (MGF, mzML, IndexMzML, Parquet) untouched
  - SpectrumWriter.ConfigureWriter handles BinarySoa as a binary
    destination (no text-encoded StreamWriter wrapper, optional gzip
    via --gzip)
  - CLI help text updated to document the new format
@caetera caetera requested review from caetera and ypriverol May 11, 2026 13:45
@caetera caetera added the enhancement New feature or request label May 11, 2026
@ypriverol
Copy link
Copy Markdown
Contributor

@filiprumenovski can this be aligned; at least the column names with the parquet representation?

Comment thread Writer/BinarySoaSpectrumWriter.cs Outdated
precursorMz = CalculateSelectedIonMz(reaction, monoisotopicMz, isolationWidthTrailer);
collisionEnergy = (float)reaction.CollisionEnergy;

double iw = isolationWidthTrailer ?? reaction.IsolationWidth;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isolation window offset needs to be implemented to make the output similar to other formats

int filterLen = filterBytes.Length;
if (filterLen > MaxFilterStringLen)
{
Log.Warn($"Filter string for scan {scanNumber} truncated from {filterLen} to {MaxFilterStringLen} bytes");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general issued warning affect the error code through parseInput.NewWarn(). The warning is "silenced" here. Is it intended?

Comment thread BINARY_SOA_FORMAT.md Outdated
| 2 | HCD (Higher-Energy Collisional Dissociation) |
| 3 | ETD (Electron Transfer Dissociation) |
| 4 | ECD (Electron Capture Dissociation) |
| 5 | EThcD (ETD + HCD supplemental) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it palnned to support ETciD?

Comment thread OutputFormat.cs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the code for None. Since we do major version update now I think it is justified

Comment thread MainClass.cs Outdated
{
"f=|format=",
"The spectra output format: 0 for MGF, 1 for mzML, 2 for indexed mzML, 3 for Parquet, 4 for None (no output); both numeric and text (case insensitive) value recognized. Defaults to indexed mzML if no format is specified.",
"The spectra output format: 0 for MGF, 1 for mzML, 2 for indexed mzML, 3 for Parquet, 4 for BinarySoa (RCIA streaming binary, see BINARY_SOA_FORMAT.md), 5 for None (no output); both numeric and text (case insensitive) value recognized. Defaults to indexed mzML if no format is specified.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to include reference to format specification in the help output?

@caetera
Copy link
Copy Markdown
Contributor

caetera commented May 11, 2026

Hi @filiprumenovski,
thank you for implementing the new format. I did a quick review of changes and added some inline comments/considerations.
I have a design question as well. I doubt if the format specification belongs to this repository. Should we have a separate place for the specification? @ypriverol, what is your view on this?

@filiprumenovski
Copy link
Copy Markdown
Author

filiprumenovski commented May 13, 2026

Thanks for the review. I pushed an update addressing the inline comments and tightened the BinarySoa implementation around compatibility, ergonomics, and the hot path.

What changed:

  • Aligned the isolation window fields with the parquet representation. The binary header now exposes isolation_lower and isolation_upper as absolute bounds, and the writer applies IsolationWidthOffset instead of assuming a symmetric window.
  • Fixed warning accounting. Truncation paths now call ParseInput.NewWarn(), so --warningsAreErrors behaves consistently with the rest of TRFP.
  • Documented ETciD/EThcD handling. Supplemental ETD/ECD plus HCD/CID is encoded as activation type 5.
  • Removed the format-spec reference from CLI help, so normal help output stays focused on usage rather than internal documentation.
  • Updated README/help text so BinarySoa, --chargeData, and --noiseData are represented consistently with the existing CLI.
  • Extended BinarySoa so it can carry optional charge arrays and noise arrays when those existing TRFP flags are requested. Default output remains the fast mz/intensity-only path.
  • Added focused BinarySoa tests for centroid/profile output shape and precursor/isolation metadata.

The main design intent is that BinarySoa should be useful for high-throughput consumers without becoming a second mzML. The default record stays compact and predictable: fixed scalar header, filter string, mz array, intensity array, optional trailer metadata. Extra per-peak arrays are opt-in through existing TRFP flags, so readers that only need mz/intensity can stay simple and fast, while users who need charge/noise data can still preserve it.

On the format specification question: I agree it is worth deciding where the canonical spec should live. For this PR I kept it in-repo so the implementation and byte layout evolve together during review. If we want the spec to live elsewhere long term, I can move it or reduce the in-repo file to a short implementation note once we agree on the destination.

One design question I would like your thoughts on: would you prefer this to be an Arrow IPC stream instead of a custom binary stream? Since TRFP already has parquet support, Arrow IPC would make this closer to "streaming parquet-style columnar data" and could reuse a familiar ecosystem for schemas, column names, and cross-language readers. The tradeoff is that the current BinarySoa layout is intentionally minimal and very cheap for pipe-based consumers: no row groups, no schema negotiation, no generic columnar container overhead, and direct scan-record framing. I am open to either direction if the maintainers prefer standard columnar interoperability over the smallest possible streaming format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants