Skip to content

Rewrite output pipeline; add paginated bnz list#2

Open
schuay wants to merge 3 commits into
victorgomes:mainfrom
schuay:feat-output-rewrite
Open

Rewrite output pipeline; add paginated bnz list#2
schuay wants to merge 3 commits into
victorgomes:mainfrom
schuay:feat-output-rewrite

Conversation

@schuay
Copy link
Copy Markdown
Contributor

@schuay schuay commented May 15, 2026

  • Rewrite output pipeline as flat-tree HTML to Turndown markdown
  • Tighten focused output after triage
  • bnz list: paginate, parse rich rows, filter --since

schuay added 3 commits May 15, 2026 12:35
The previous extractor read element.innerText from the rendered page and
ran anchored regexes to recover structure. That approach could silently
drop content when Buganizer's UI changed, and it lost link targets,
heading levels, and code-block fidelity along the way.

The new pipeline is three layers:

  1. Flatten: an in-page recursive walker descends light + shadow DOM,
     substitutes <slot> elements with assignedNodes({flatten:true}), and
     emits HTML mirroring the rendered flat tree.
  2. Convert: Node-side Turndown (with the GFM plugin) turns that HTML
     into markdown. Custom rules handle Polymer-specific compaction:
     drop action buttons / icon controls / empty avatars, flatten
     inline links, collapse the issue-metadata sidebar to one bullet
     per field, drop per-field-change time-of-day stamps, drop
     system-generated history events that have no body.
  3. Render: bug.js adds a synthesized header and appendix
     (attachments, downloaded reproducers).

What remains of the structured-parsing layer are tiny targeted
extractors for things the CLI itself acts on: the testcase-key URL
inside an issue, the [Command line] flags on a CF page, attachment
URLs collected during the walk, reproducer-download endpoints.

Other changes bundled in:

- Split into lib/url, lib/dom, lib/browser, lib/cache, lib/render.
- bug list <query>: run a Buganizer search and emit the results.
- Disk cache at ~/.config/bug-cli/cache/ with a 5-minute TTL.
  --refresh busts a single fetch; --no-cache disables.
- --download-original (cf testcases) and --download-attachments[=DIR]
  (any issue) using the authenticated request context.
- A single Chromium launch handles multiple targets in one invocation
  (bug 1 2 3, bug cf a b c).
- Default output is focused (page chrome dropped, sidebar compacted,
  empty fields suppressed). Pass --full to disable the filter.
Five fixes surfaced while triaging a batch of open issues:

- Drop the redundant page-rendered <h2>Issue <id></h2> at the top of
  issue-details-wrapper (Turndown rule), and drop b-issue-id-picker /
  issue-chip-indicators / b-access-limits-chip from the walker. These
  were echoing the issue id and an empty 'Visibility' label that our
  synthesized header / sidebar already cover.

- Synthesize a 'Status: X  .  Type: Y  .  Priority: Z  .  Severity: W'
  summary line just under the URL. Buganizer doesn't render a chip
  for Status=New, so the status would otherwise only appear at the
  bottom of the page. Extracted via regex from the compacted sidebar.

- Pad b-formatted-date-time with a leading space so the comment header
  no longer renders '[victorgomes#2](url)2026-05-15 04:27' (link jammed against
  timestamp with no separator).

- Structure attachment listings: <b-attachment-viewer> renders as
  '- **filename** -- size -- [View](url) [Download](url)' instead of
  paragraph flow.

- Introduce a qsa() helper to iterate domino's NodeList (Turndown's
  Node-side HTML parser); its NodeList lacks Symbol.iterator, so
  for...of silently iterated zero times. Earlier rules that used
  for...of on querySelectorAll happened to fall through to acceptable
  defaults; the new attachment rule blew up outright.
Before this, bnz list returned only the first page of search results and
parsed bare issue links from it. The new pipeline walks Buganizer's
pagination and parses the rich result table:

- dumpPaginated() clicks the 'Go to next page' button until disabled
  (Buganizer ignores URL pagination parameters). --max-pages=N caps
  at N pages of 50 (default 30).
- extractSearchRows() parses the markdown table Turndown emits:
  priority, type, title, assignee, status, 7d-views, id, modified.
  extractSearchHits is kept as a fallback for pages that don't render
  the table (empty results, error stubs).
- --since=<dur|date> filters hits by their LAST MODIFIED column.
  Duration syntax accepts h/d/w/m (e.g. 7d, 1w); anything else falls
  through to Date.parse, so ISO dates work too.
- listCmd dedups by issue id across pages and reports pagesFetched
  and filteredOut in both the markdown summary line and json output.
- renderListMarkdown emits a markdown table when hits have structured
  fields, falling back to the bullet list when only id/title/url are
  available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant