SystemsBioinformatics
diff --git a/‎vignettes/writing_data_guides.Rmd‎
Lines changed: 360 additions & 0 deletions b/‎vignettes/writing_data_guides.Rmd‎
Lines changed: 360 additions & 0 deletions
@@ -0,0 +1,360 @@
+---
+title: "Writing data guides"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{writing_data_guides}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  echo = FALSE
+)
+```
+
+```{r setup}
+library(excelDataGuide)
+```
+
+# Writing a guide
+
+Every spreadsheet template should be accompanied by a data guide — a YAML file
+that documents where data are located within the template and how they should be
+interpreted. This guide serves as the blueprint that tells the `excelDataGuide`
+package exactly where to find and how to read your data.
+
+A data guide is a human-editable, computer-readable YAML file. YAML syntax is
+simple: use colons (`:`) for key-value pairs, hyphens (`-`) for lists, and
+indentation for nesting. This section walks you through constructing a guide
+step by step.
+
+## Step 1: Define the header
+
+Every guide begins with five pieces of metadata that describe the template itself:
+
+``` yaml
+guide.version: '1.0'
+template.name: myassay
+template.min.version: '1.0'
+template.max.version: '2.0'
+plate.format: 96
+```
+
+- **`guide.version`** (*required*): The version of the guide itself (not the template).
+  Use `major.minor` format, e.g., `'1.0'` or `'2.1'`. This allows you to track
+  changes to the guide independently from the template.
+
+- **`template.name`** (*required*): A short, descriptive name for the template
+  (e.g., `'myassay'`, `'fluorescence_assay'`). This is used to identify which
+  template a guide belongs to and is checked at runtime if you enable template
+  name validation.
+
+- **`template.min.version`** (*required*): The minimum template version for which
+  this guide is compatible. Use `major.minor` format (e.g., `'1.0'`).
+
+- **`template.max.version`** (*required*, can be null): The maximum template version
+  for which this guide is compatible. If there is no upper limit, use `~` (YAML
+  null) to indicate "any newer version is OK". For example, `template.max.version: ~`
+  means the guide works with all template versions from `template.min.version`
+  onwards.
+
+- **`plate.format`** (*conditionally required*): The microplate format used in your
+  experiments. Valid values are `24`, `48`, `96`, or `384` (referring to the
+  number of wells). This is **required only if your guide includes any
+  `platedata` locations** (see below). The guide uses this to validate that
+  your platedata ranges have the correct dimensions.
+
+## Step 2: Add your first location — template version cell
+
+The `.template` location is a special, **required** location that records the
+template version from a single cell. This allows the package to check at runtime
+that the data file was created with a compatible template version.
+
+``` yaml
+locations:
+  - sheet: description
+    type: cells
+    varname: .template
+    translate: false
+    variables:
+      - name: version
+        cell: B2
+```
+
+**What this does:**
+- Reads cell `B2` from the sheet named `description`
+- Stores its value in a variable called `version` within the `.template` group
+- `translate: false` means no translation is applied (the cell value is used as-is)
+
+This location **must be present in every guide**. The package uses the version
+value to check compatibility with `template.min.version` and `template.max.version`.
+
+## Step 3: Add metadata as key-value pairs
+
+Metadata such as experimenter names, dates, and study IDs are typically stored as
+key-value pairs in your template. One location entry can index multiple ranges.
+
+``` yaml
+  - sheet: description
+    type: keyvalue
+    varname: metadata
+    translate: true
+    atomicclass:
+      - character
+      - character
+      - date
+      - numeric
+    ranges:
+      - A5:B8
+      - A10:B12
+```
+
+**What this does:**
+- Reads two ranges (`A5:B8` and `A10:B12`) from the `description` sheet
+- The left column of each range contains keys (variable names); the right column contains values
+- `translate: true` means the keys will be translated from long names (in the
+  spreadsheet) to short names (used in your script) using the `translations`
+  section defined later
+- `atomicclass` is a list of data types, one per key-value pair in order. In this
+  example, the first three key-value pairs should be coerced to character, character,
+  and date respectively, then numeric for any additional pairs (coercion cycles
+  through the list if needed). All values are coerced to character by default if
+  `atomicclass` is omitted.
+
+**Combining multiple ranges into one variable:** Multiple ranges are combined into
+a single named list (`metadata` in this example). This is convenient when metadata
+is scattered across multiple sheets or ranges but logically belongs together.
+
+## Step 4: Add experiment parameters
+
+Parameters that are constant across all experiments (e.g., acceptance criteria,
+standard concentrations defined in SOPs) are best stored in a dedicated `_parameters`
+sheet. They are indexed like metadata:
+
+``` yaml
+  - sheet: _parameters
+    type: keyvalue
+    varname: parameters
+    translate: false
+    atomicclass: numeric
+    ranges:
+      - A2:B5
+      - A8:B10
+```
+
+Here, all values will be coerced to numeric. Use `translate: false` if your
+parameter names are already in the format you want in the script.
+
+## Step 5: Add measured plate data
+
+If your template records plate data (e.g., fluorescence measurements from a 96-well
+plate), use the `platedata` type:
+
+``` yaml
+  - sheet: _data
+    type: platedata
+    varname: plate
+    translate: false
+    atomicclass:
+      - character
+      - numeric
+      - numeric
+    ranges:
+      - A1:M9
+      - A11:M19
+```
+
+**What this does:**
+- Reads data in plate format (e.g., well positions and measurements)
+- Each range is a block of data in the same row-column layout as the physical plate
+- Multiple ranges are stacked vertically into a single data frame
+- The resulting data frame includes columns `row` and `col` (well positions) plus
+  one column per measured variable, each coerced to its corresponding `atomicclass`
+
+In this example, the first three variables will be character, numeric, numeric
+respectively. The first range (`A1:M9`) represents rows A–H of the plate (8 rows)
+with 12 columns of data.
+
+## Step 6: Add tabular data
+
+Tables with column headers and multiple rows of data are indexed using the `table`
+type:
+
+``` yaml
+  - sheet: results
+    type: table
+    varname: userresults
+    translate: false
+    atomicclass: numeric
+    ranges:
+      - C3:E10
+```
+
+**What this does:**
+- Reads a table from range `C3:E10`
+- The first row contains column names (variable names)
+- All values are coerced to the specified `atomicclass`
+- The result is a data frame, not a list
+
+## Step 7: Add single-cell values
+
+Sometimes you need to read isolated values from single cells and give them names
+in the guide (rather than relying on names in the spreadsheet):
+
+``` yaml
+  - sheet: analysis
+    type: cells
+    varname: qc_checks
+    translate: false
+    atomicclass: numeric
+    variables:
+      - name: spread_well_1
+        cell: G6
+      - name: spread_well_2
+        cell: G33
+```
+
+**What this does:**
+- Reads individual cells and assigns them names specified in the guide
+- `spread_well_1` gets the value from cell `G6`, `spread_well_2` from `G33`
+- All values are coerced to numeric
+- The result is a named list
+
+This is useful when values are scattered or when the spreadsheet's cell-naming
+scheme doesn't align with your script's variable-naming scheme.
+
+## Step 8: Add translations
+
+If your spreadsheet uses long, user-friendly variable names (e.g., "Date of experiment")
+but your script prefers short, code-friendly names (e.g., `date`), define a
+`translations` section:
+
+``` yaml
+translations:
+  - long: "Date of experiment"
+    short: date
+  - long: "Experimenter name"
+    short: experimenter
+  - long: "Study identifier"
+    short: studyID
+  - long: "Plate identifier"
+    short: plateID
+```
+
+**When to use translations:**
+- Set `translate: true` in a location to apply these translations to its variable
+  names
+- Translations are optional; if you don't define any, set `translate: false` in
+  all locations
+- The package provides `long_to_shortnames()` and `short_to_longnames()` functions
+  to convert between formats
+
+## Complete example guide
+
+Below is a complete, working guide for a small fluorescence assay template:
+
+``` yaml
+guide.version: '1.0'
+template.name: simple_fluorescence
+template.min.version: '1.0'
+template.max.version: ~
+plate.format: 96
+
+locations:
+  # Required: template version
+  - sheet: description
+    type: cells
+    varname: .template
+    translate: false
+    variables:
+      - name: version
+        cell: B2
+
+  # Metadata: user and experiment information
+  - sheet: description
+    type: keyvalue
+    varname: metadata
+    translate: true
+    atomicclass:
+      - character
+      - character
+      - date
+      - numeric
+    ranges:
+      - A5:B8
+
+  # Parameters: SOP-defined constants
+  - sheet: _parameters
+    type: keyvalue
+    varname: parameters
+    translate: false
+    atomicclass: numeric
+    ranges:
+      - A2:B5
+
+  # Plate data: fluorescence measurements
+  - sheet: _data
+    type: platedata
+    varname: fluorescence
+    translate: false
+    atomicclass: numeric
+    ranges:
+      - A1:M9
+      - A11:M19
+
+  # Quality control: derived metrics from analysis sheet
+  - sheet: analysis
+    type: cells
+    varname: qc
+    translate: false
+    atomicclass: numeric
+    variables:
+      - name: plate_uniformity
+        cell: D3
+      - name: signal_range
+        cell: D4
+
+translations:
+  - long: "Experimenter name"
+    short: experimenter
+  - long: "Date of experiment"
+    short: date
+  - long: "Temperature (°C)"
+    short: temperature
+  - long: "Sample concentration (µM)"
+    short: concentration
+```
+
+## Validation with CUE
+
+Before using your guide in your scripts, validate it against the CUE schema to
+catch syntax and structure errors early:
+
+``` bash
+# From the data-raw folder where the schema is located:
+cd data-raw
+./validate_and_sign.sh path/to/your/guide.yml
+```
+
+This command will:
+1. Check your guide against `excelguide_schema.cue`
+2. Abort with a detailed error message if validation fails
+3. Embed a SHA256 hash (`cue.verified` field) if validation succeeds
+
+The hash acts as proof that the guide was validated; you can verify it later
+with:
+
+``` bash
+./verify_guide.sh path/to/your/guide.yml
+```
+
+In R, you can also check the hash when loading a guide:
+
+``` r
+guide <- read_guide("path/to/guide.yml", verify_hash = TRUE)
+```
+
+This will warn if the hash is missing or abort if the hash is invalid.