Skip to content

Generated parse() re-decodes grammar tables on every call #630

@avityuk

Description

@avityuk

Hi! First, thanks for grmtools — I'm looking at using it in a project
that parses a lot of small inputs in a tight loop, and while profiling I
noticed something I wanted to flag.

The generated parse() from lrpar_mod! calls
::lrpar::ctbuilder::_reconstitute(__GRM_DATA, __STABLE_DATA) on every
invocation, which does two bincode::decode_from_slice calls to rebuild
the YaccGrammar and StateTable. For a single whole-buffer parse this
is invisible, but if you call parse() many times — e.g. parsing many
small chunks, or parsing inside a hot loop — that decode ends up
dominating. RTParserBuilder::new only borrows the tables, so the work
is pure repeat overhead.

In addition, the per-call setup in lrpar::parser has a couple of small
redundancies:

  • Parser::token_cost is stored as Box<&dyn Fn> — a heap allocation
    around an already-borrowed reference, on every parse.
  • parse_map / parse_actions collect the lexer iterator into a
    Vec<Result<_, _>> and then iterate it again to build the lexeme
    vec, instead of a single collect::<Result<Vec<_>, _>>().
  • The iter_tidxs cost-positive assert! loop runs in release builds,
    even though it's really a builder-side contract.

On the calc_actions example, with 200k iterations of a few short
expressions in release mode, I'm seeing roughly ~2.06 µs/parse → ~0.80 µs/parse
(about a 2.5× speedup) once these are addressed. Most of that is the
table decode; the rest is the small setup items.

I have a working patch and would be happy to send a PR. Two things I'd
appreciate guidance on before I do:

  1. Public API. To cache the tables across calls, generated code
    needs a concrete type to put in a OnceLock. Naming
    (YaccGrammar<S>, StateTable<S>) directly would force every
    downstream crate to depend on lrtable, which seems like a
    regression. My current approach is to introduce an opaque
    lrpar::ParserTables<S> newtype with grm() / stable()
    accessors, re-exported at the crate root, so generated code only
    names ::lrpar::ParserTables. Does that shape sound reasonable, or
    would you prefer it #[doc(hidden)], or somewhere else entirely?

  2. _reconstitute signature. It's already #[doc(hidden)], but its
    return type would change from a tuple to ParserTables<S>. Since
    generated code is regenerated against the matching lrpar version,
    I think this is fine, but happy to keep the old function and add a
    new one if you'd rather not touch it.

Happy to open the PR as-is, or to iterate on the design first — whatever
works best for you. Thanks again!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions