|
| 1 | +--- |
| 2 | +title: "Writing data guides" |
| 3 | +output: rmarkdown::html_vignette |
| 4 | +vignette: > |
| 5 | + %\VignetteIndexEntry{writing_data_guides} |
| 6 | + %\VignetteEngine{knitr::rmarkdown} |
| 7 | + %\VignetteEncoding{UTF-8} |
| 8 | +--- |
| 9 | + |
| 10 | +```{r, include = FALSE} |
| 11 | +knitr::opts_chunk$set( |
| 12 | + collapse = TRUE, |
| 13 | + comment = "#>", |
| 14 | + echo = FALSE |
| 15 | +) |
| 16 | +``` |
| 17 | + |
| 18 | +```{r setup} |
| 19 | +library(excelDataGuide) |
| 20 | +``` |
| 21 | + |
| 22 | +# Writing a guide |
| 23 | + |
| 24 | +Every spreadsheet template should be accompanied by a data guide — a YAML file |
| 25 | +that documents where data are located within the template and how they should be |
| 26 | +interpreted. This guide serves as the blueprint that tells the `excelDataGuide` |
| 27 | +package exactly where to find and how to read your data. |
| 28 | + |
| 29 | +A data guide is a human-editable, computer-readable YAML file. YAML syntax is |
| 30 | +simple: use colons (`:`) for key-value pairs, hyphens (`-`) for lists, and |
| 31 | +indentation for nesting. This section walks you through constructing a guide |
| 32 | +step by step. |
| 33 | + |
| 34 | +## Step 1: Define the header |
| 35 | + |
| 36 | +Every guide begins with five pieces of metadata that describe the template itself: |
| 37 | + |
| 38 | +``` yaml |
| 39 | +guide.version: '1.0' |
| 40 | +template.name: myassay |
| 41 | +template.min.version: '1.0' |
| 42 | +template.max.version: '2.0' |
| 43 | +plate.format: 96 |
| 44 | +``` |
| 45 | +
|
| 46 | +- **`guide.version`** (*required*): The version of the guide itself (not the template). |
| 47 | + Use `major.minor` format, e.g., `'1.0'` or `'2.1'`. This allows you to track |
| 48 | + changes to the guide independently from the template. |
| 49 | + |
| 50 | +- **`template.name`** (*required*): A short, descriptive name for the template |
| 51 | + (e.g., `'myassay'`, `'fluorescence_assay'`). This is used to identify which |
| 52 | + template a guide belongs to and is checked at runtime if you enable template |
| 53 | + name validation. |
| 54 | + |
| 55 | +- **`template.min.version`** (*required*): The minimum template version for which |
| 56 | + this guide is compatible. Use `major.minor` format (e.g., `'1.0'`). |
| 57 | + |
| 58 | +- **`template.max.version`** (*required*, can be null): The maximum template version |
| 59 | + for which this guide is compatible. If there is no upper limit, use `~` (YAML |
| 60 | + null) to indicate "any newer version is OK". For example, `template.max.version: ~` |
| 61 | + means the guide works with all template versions from `template.min.version` |
| 62 | + onwards. |
| 63 | + |
| 64 | +- **`plate.format`** (*conditionally required*): The microplate format used in your |
| 65 | + experiments. Valid values are `24`, `48`, `96`, or `384` (referring to the |
| 66 | + number of wells). This is **required only if your guide includes any |
| 67 | + `platedata` locations** (see below). The guide uses this to validate that |
| 68 | + your platedata ranges have the correct dimensions. |
| 69 | + |
| 70 | +## Step 2: Add your first location — template version cell |
| 71 | + |
| 72 | +The `.template` location is a special, **required** location that records the |
| 73 | +template version from a single cell. This allows the package to check at runtime |
| 74 | +that the data file was created with a compatible template version. |
| 75 | + |
| 76 | +``` yaml |
| 77 | +locations: |
| 78 | + - sheet: description |
| 79 | + type: cells |
| 80 | + varname: .template |
| 81 | + translate: false |
| 82 | + variables: |
| 83 | + - name: version |
| 84 | + cell: B2 |
| 85 | +``` |
| 86 | + |
| 87 | +**What this does:** |
| 88 | +- Reads cell `B2` from the sheet named `description` |
| 89 | +- Stores its value in a variable called `version` within the `.template` group |
| 90 | +- `translate: false` means no translation is applied (the cell value is used as-is) |
| 91 | + |
| 92 | +This location **must be present in every guide**. The package uses the version |
| 93 | +value to check compatibility with `template.min.version` and `template.max.version`. |
| 94 | + |
| 95 | +## Step 3: Add metadata as key-value pairs |
| 96 | + |
| 97 | +Metadata such as experimenter names, dates, and study IDs are typically stored as |
| 98 | +key-value pairs in your template. One location entry can index multiple ranges. |
| 99 | + |
| 100 | +``` yaml |
| 101 | + - sheet: description |
| 102 | + type: keyvalue |
| 103 | + varname: metadata |
| 104 | + translate: true |
| 105 | + atomicclass: |
| 106 | + - character |
| 107 | + - character |
| 108 | + - date |
| 109 | + - numeric |
| 110 | + ranges: |
| 111 | + - A5:B8 |
| 112 | + - A10:B12 |
| 113 | +``` |
| 114 | + |
| 115 | +**What this does:** |
| 116 | +- Reads two ranges (`A5:B8` and `A10:B12`) from the `description` sheet |
| 117 | +- The left column of each range contains keys (variable names); the right column contains values |
| 118 | +- `translate: true` means the keys will be translated from long names (in the |
| 119 | + spreadsheet) to short names (used in your script) using the `translations` |
| 120 | + section defined later |
| 121 | +- `atomicclass` is a list of data types, one per key-value pair in order. In this |
| 122 | + example, the first three key-value pairs should be coerced to character, character, |
| 123 | + and date respectively, then numeric for any additional pairs (coercion cycles |
| 124 | + through the list if needed). All values are coerced to character by default if |
| 125 | + `atomicclass` is omitted. |
| 126 | + |
| 127 | +**Combining multiple ranges into one variable:** Multiple ranges are combined into |
| 128 | +a single named list (`metadata` in this example). This is convenient when metadata |
| 129 | +is scattered across multiple sheets or ranges but logically belongs together. |
| 130 | + |
| 131 | +## Step 4: Add experiment parameters |
| 132 | + |
| 133 | +Parameters that are constant across all experiments (e.g., acceptance criteria, |
| 134 | +standard concentrations defined in SOPs) are best stored in a dedicated `_parameters` |
| 135 | +sheet. They are indexed like metadata: |
| 136 | + |
| 137 | +``` yaml |
| 138 | + - sheet: _parameters |
| 139 | + type: keyvalue |
| 140 | + varname: parameters |
| 141 | + translate: false |
| 142 | + atomicclass: numeric |
| 143 | + ranges: |
| 144 | + - A2:B5 |
| 145 | + - A8:B10 |
| 146 | +``` |
| 147 | + |
| 148 | +Here, all values will be coerced to numeric. Use `translate: false` if your |
| 149 | +parameter names are already in the format you want in the script. |
| 150 | + |
| 151 | +## Step 5: Add measured plate data |
| 152 | + |
| 153 | +If your template records plate data (e.g., fluorescence measurements from a 96-well |
| 154 | +plate), use the `platedata` type: |
| 155 | + |
| 156 | +``` yaml |
| 157 | + - sheet: _data |
| 158 | + type: platedata |
| 159 | + varname: plate |
| 160 | + translate: false |
| 161 | + atomicclass: |
| 162 | + - character |
| 163 | + - numeric |
| 164 | + - numeric |
| 165 | + ranges: |
| 166 | + - A1:M9 |
| 167 | + - A11:M19 |
| 168 | +``` |
| 169 | + |
| 170 | +**What this does:** |
| 171 | +- Reads data in plate format (e.g., well positions and measurements) |
| 172 | +- Each range is a block of data in the same row-column layout as the physical plate |
| 173 | +- Multiple ranges are stacked vertically into a single data frame |
| 174 | +- The resulting data frame includes columns `row` and `col` (well positions) plus |
| 175 | + one column per measured variable, each coerced to its corresponding `atomicclass` |
| 176 | + |
| 177 | +In this example, the first three variables will be character, numeric, numeric |
| 178 | +respectively. The first range (`A1:M9`) represents rows A–H of the plate (8 rows) |
| 179 | +with 12 columns of data. |
| 180 | + |
| 181 | +## Step 6: Add tabular data |
| 182 | + |
| 183 | +Tables with column headers and multiple rows of data are indexed using the `table` |
| 184 | +type: |
| 185 | + |
| 186 | +``` yaml |
| 187 | + - sheet: results |
| 188 | + type: table |
| 189 | + varname: userresults |
| 190 | + translate: false |
| 191 | + atomicclass: numeric |
| 192 | + ranges: |
| 193 | + - C3:E10 |
| 194 | +``` |
| 195 | + |
| 196 | +**What this does:** |
| 197 | +- Reads a table from range `C3:E10` |
| 198 | +- The first row contains column names (variable names) |
| 199 | +- All values are coerced to the specified `atomicclass` |
| 200 | +- The result is a data frame, not a list |
| 201 | + |
| 202 | +## Step 7: Add single-cell values |
| 203 | + |
| 204 | +Sometimes you need to read isolated values from single cells and give them names |
| 205 | +in the guide (rather than relying on names in the spreadsheet): |
| 206 | + |
| 207 | +``` yaml |
| 208 | + - sheet: analysis |
| 209 | + type: cells |
| 210 | + varname: qc_checks |
| 211 | + translate: false |
| 212 | + atomicclass: numeric |
| 213 | + variables: |
| 214 | + - name: spread_well_1 |
| 215 | + cell: G6 |
| 216 | + - name: spread_well_2 |
| 217 | + cell: G33 |
| 218 | +``` |
| 219 | + |
| 220 | +**What this does:** |
| 221 | +- Reads individual cells and assigns them names specified in the guide |
| 222 | +- `spread_well_1` gets the value from cell `G6`, `spread_well_2` from `G33` |
| 223 | +- All values are coerced to numeric |
| 224 | +- The result is a named list |
| 225 | + |
| 226 | +This is useful when values are scattered or when the spreadsheet's cell-naming |
| 227 | +scheme doesn't align with your script's variable-naming scheme. |
| 228 | + |
| 229 | +## Step 8: Add translations |
| 230 | + |
| 231 | +If your spreadsheet uses long, user-friendly variable names (e.g., "Date of experiment") |
| 232 | +but your script prefers short, code-friendly names (e.g., `date`), define a |
| 233 | +`translations` section: |
| 234 | + |
| 235 | +``` yaml |
| 236 | +translations: |
| 237 | + - long: "Date of experiment" |
| 238 | + short: date |
| 239 | + - long: "Experimenter name" |
| 240 | + short: experimenter |
| 241 | + - long: "Study identifier" |
| 242 | + short: studyID |
| 243 | + - long: "Plate identifier" |
| 244 | + short: plateID |
| 245 | +``` |
| 246 | + |
| 247 | +**When to use translations:** |
| 248 | +- Set `translate: true` in a location to apply these translations to its variable |
| 249 | + names |
| 250 | +- Translations are optional; if you don't define any, set `translate: false` in |
| 251 | + all locations |
| 252 | +- The package provides `long_to_shortnames()` and `short_to_longnames()` functions |
| 253 | + to convert between formats |
| 254 | + |
| 255 | +## Complete example guide |
| 256 | + |
| 257 | +Below is a complete, working guide for a small fluorescence assay template: |
| 258 | + |
| 259 | +``` yaml |
| 260 | +guide.version: '1.0' |
| 261 | +template.name: simple_fluorescence |
| 262 | +template.min.version: '1.0' |
| 263 | +template.max.version: ~ |
| 264 | +plate.format: 96 |
| 265 | +
|
| 266 | +locations: |
| 267 | + # Required: template version |
| 268 | + - sheet: description |
| 269 | + type: cells |
| 270 | + varname: .template |
| 271 | + translate: false |
| 272 | + variables: |
| 273 | + - name: version |
| 274 | + cell: B2 |
| 275 | +
|
| 276 | + # Metadata: user and experiment information |
| 277 | + - sheet: description |
| 278 | + type: keyvalue |
| 279 | + varname: metadata |
| 280 | + translate: true |
| 281 | + atomicclass: |
| 282 | + - character |
| 283 | + - character |
| 284 | + - date |
| 285 | + - numeric |
| 286 | + ranges: |
| 287 | + - A5:B8 |
| 288 | +
|
| 289 | + # Parameters: SOP-defined constants |
| 290 | + - sheet: _parameters |
| 291 | + type: keyvalue |
| 292 | + varname: parameters |
| 293 | + translate: false |
| 294 | + atomicclass: numeric |
| 295 | + ranges: |
| 296 | + - A2:B5 |
| 297 | +
|
| 298 | + # Plate data: fluorescence measurements |
| 299 | + - sheet: _data |
| 300 | + type: platedata |
| 301 | + varname: fluorescence |
| 302 | + translate: false |
| 303 | + atomicclass: numeric |
| 304 | + ranges: |
| 305 | + - A1:M9 |
| 306 | + - A11:M19 |
| 307 | +
|
| 308 | + # Quality control: derived metrics from analysis sheet |
| 309 | + - sheet: analysis |
| 310 | + type: cells |
| 311 | + varname: qc |
| 312 | + translate: false |
| 313 | + atomicclass: numeric |
| 314 | + variables: |
| 315 | + - name: plate_uniformity |
| 316 | + cell: D3 |
| 317 | + - name: signal_range |
| 318 | + cell: D4 |
| 319 | +
|
| 320 | +translations: |
| 321 | + - long: "Experimenter name" |
| 322 | + short: experimenter |
| 323 | + - long: "Date of experiment" |
| 324 | + short: date |
| 325 | + - long: "Temperature (°C)" |
| 326 | + short: temperature |
| 327 | + - long: "Sample concentration (µM)" |
| 328 | + short: concentration |
| 329 | +``` |
| 330 | + |
| 331 | +## Validation with CUE |
| 332 | + |
| 333 | +Before using your guide in your scripts, validate it against the CUE schema to |
| 334 | +catch syntax and structure errors early: |
| 335 | + |
| 336 | +``` bash |
| 337 | +# From the data-raw folder where the schema is located: |
| 338 | +cd data-raw |
| 339 | +./validate_and_sign.sh path/to/your/guide.yml |
| 340 | +``` |
| 341 | + |
| 342 | +This command will: |
| 343 | +1. Check your guide against `excelguide_schema.cue` |
| 344 | +2. Abort with a detailed error message if validation fails |
| 345 | +3. Embed a SHA256 hash (`cue.verified` field) if validation succeeds |
| 346 | + |
| 347 | +The hash acts as proof that the guide was validated; you can verify it later |
| 348 | +with: |
| 349 | + |
| 350 | +``` bash |
| 351 | +./verify_guide.sh path/to/your/guide.yml |
| 352 | +``` |
| 353 | + |
| 354 | +In R, you can also check the hash when loading a guide: |
| 355 | + |
| 356 | +``` r |
| 357 | +guide <- read_guide("path/to/guide.yml", verify_hash = TRUE) |
| 358 | +``` |
| 359 | + |
| 360 | +This will warn if the hash is missing or abort if the hash is invalid. |
0 commit comments