Skip to content

Commit ab19d38

Browse files
committed
split into two documents
1 parent 7ce4cd8 commit ab19d38

2 files changed

Lines changed: 362 additions & 342 deletions

File tree

vignettes/writing_data_guides.Rmd

Lines changed: 360 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
---
2+
title: "Writing data guides"
3+
output: rmarkdown::html_vignette
4+
vignette: >
5+
%\VignetteIndexEntry{writing_data_guides}
6+
%\VignetteEngine{knitr::rmarkdown}
7+
%\VignetteEncoding{UTF-8}
8+
---
9+
10+
```{r, include = FALSE}
11+
knitr::opts_chunk$set(
12+
collapse = TRUE,
13+
comment = "#>",
14+
echo = FALSE
15+
)
16+
```
17+
18+
```{r setup}
19+
library(excelDataGuide)
20+
```
21+
22+
# Writing a guide
23+
24+
Every spreadsheet template should be accompanied by a data guide — a YAML file
25+
that documents where data are located within the template and how they should be
26+
interpreted. This guide serves as the blueprint that tells the `excelDataGuide`
27+
package exactly where to find and how to read your data.
28+
29+
A data guide is a human-editable, computer-readable YAML file. YAML syntax is
30+
simple: use colons (`:`) for key-value pairs, hyphens (`-`) for lists, and
31+
indentation for nesting. This section walks you through constructing a guide
32+
step by step.
33+
34+
## Step 1: Define the header
35+
36+
Every guide begins with five pieces of metadata that describe the template itself:
37+
38+
``` yaml
39+
guide.version: '1.0'
40+
template.name: myassay
41+
template.min.version: '1.0'
42+
template.max.version: '2.0'
43+
plate.format: 96
44+
```
45+
46+
- **`guide.version`** (*required*): The version of the guide itself (not the template).
47+
Use `major.minor` format, e.g., `'1.0'` or `'2.1'`. This allows you to track
48+
changes to the guide independently from the template.
49+
50+
- **`template.name`** (*required*): A short, descriptive name for the template
51+
(e.g., `'myassay'`, `'fluorescence_assay'`). This is used to identify which
52+
template a guide belongs to and is checked at runtime if you enable template
53+
name validation.
54+
55+
- **`template.min.version`** (*required*): The minimum template version for which
56+
this guide is compatible. Use `major.minor` format (e.g., `'1.0'`).
57+
58+
- **`template.max.version`** (*required*, can be null): The maximum template version
59+
for which this guide is compatible. If there is no upper limit, use `~` (YAML
60+
null) to indicate "any newer version is OK". For example, `template.max.version: ~`
61+
means the guide works with all template versions from `template.min.version`
62+
onwards.
63+
64+
- **`plate.format`** (*conditionally required*): The microplate format used in your
65+
experiments. Valid values are `24`, `48`, `96`, or `384` (referring to the
66+
number of wells). This is **required only if your guide includes any
67+
`platedata` locations** (see below). The guide uses this to validate that
68+
your platedata ranges have the correct dimensions.
69+
70+
## Step 2: Add your first location — template version cell
71+
72+
The `.template` location is a special, **required** location that records the
73+
template version from a single cell. This allows the package to check at runtime
74+
that the data file was created with a compatible template version.
75+
76+
``` yaml
77+
locations:
78+
- sheet: description
79+
type: cells
80+
varname: .template
81+
translate: false
82+
variables:
83+
- name: version
84+
cell: B2
85+
```
86+
87+
**What this does:**
88+
- Reads cell `B2` from the sheet named `description`
89+
- Stores its value in a variable called `version` within the `.template` group
90+
- `translate: false` means no translation is applied (the cell value is used as-is)
91+
92+
This location **must be present in every guide**. The package uses the version
93+
value to check compatibility with `template.min.version` and `template.max.version`.
94+
95+
## Step 3: Add metadata as key-value pairs
96+
97+
Metadata such as experimenter names, dates, and study IDs are typically stored as
98+
key-value pairs in your template. One location entry can index multiple ranges.
99+
100+
``` yaml
101+
- sheet: description
102+
type: keyvalue
103+
varname: metadata
104+
translate: true
105+
atomicclass:
106+
- character
107+
- character
108+
- date
109+
- numeric
110+
ranges:
111+
- A5:B8
112+
- A10:B12
113+
```
114+
115+
**What this does:**
116+
- Reads two ranges (`A5:B8` and `A10:B12`) from the `description` sheet
117+
- The left column of each range contains keys (variable names); the right column contains values
118+
- `translate: true` means the keys will be translated from long names (in the
119+
spreadsheet) to short names (used in your script) using the `translations`
120+
section defined later
121+
- `atomicclass` is a list of data types, one per key-value pair in order. In this
122+
example, the first three key-value pairs should be coerced to character, character,
123+
and date respectively, then numeric for any additional pairs (coercion cycles
124+
through the list if needed). All values are coerced to character by default if
125+
`atomicclass` is omitted.
126+
127+
**Combining multiple ranges into one variable:** Multiple ranges are combined into
128+
a single named list (`metadata` in this example). This is convenient when metadata
129+
is scattered across multiple sheets or ranges but logically belongs together.
130+
131+
## Step 4: Add experiment parameters
132+
133+
Parameters that are constant across all experiments (e.g., acceptance criteria,
134+
standard concentrations defined in SOPs) are best stored in a dedicated `_parameters`
135+
sheet. They are indexed like metadata:
136+
137+
``` yaml
138+
- sheet: _parameters
139+
type: keyvalue
140+
varname: parameters
141+
translate: false
142+
atomicclass: numeric
143+
ranges:
144+
- A2:B5
145+
- A8:B10
146+
```
147+
148+
Here, all values will be coerced to numeric. Use `translate: false` if your
149+
parameter names are already in the format you want in the script.
150+
151+
## Step 5: Add measured plate data
152+
153+
If your template records plate data (e.g., fluorescence measurements from a 96-well
154+
plate), use the `platedata` type:
155+
156+
``` yaml
157+
- sheet: _data
158+
type: platedata
159+
varname: plate
160+
translate: false
161+
atomicclass:
162+
- character
163+
- numeric
164+
- numeric
165+
ranges:
166+
- A1:M9
167+
- A11:M19
168+
```
169+
170+
**What this does:**
171+
- Reads data in plate format (e.g., well positions and measurements)
172+
- Each range is a block of data in the same row-column layout as the physical plate
173+
- Multiple ranges are stacked vertically into a single data frame
174+
- The resulting data frame includes columns `row` and `col` (well positions) plus
175+
one column per measured variable, each coerced to its corresponding `atomicclass`
176+
177+
In this example, the first three variables will be character, numeric, numeric
178+
respectively. The first range (`A1:M9`) represents rows A–H of the plate (8 rows)
179+
with 12 columns of data.
180+
181+
## Step 6: Add tabular data
182+
183+
Tables with column headers and multiple rows of data are indexed using the `table`
184+
type:
185+
186+
``` yaml
187+
- sheet: results
188+
type: table
189+
varname: userresults
190+
translate: false
191+
atomicclass: numeric
192+
ranges:
193+
- C3:E10
194+
```
195+
196+
**What this does:**
197+
- Reads a table from range `C3:E10`
198+
- The first row contains column names (variable names)
199+
- All values are coerced to the specified `atomicclass`
200+
- The result is a data frame, not a list
201+
202+
## Step 7: Add single-cell values
203+
204+
Sometimes you need to read isolated values from single cells and give them names
205+
in the guide (rather than relying on names in the spreadsheet):
206+
207+
``` yaml
208+
- sheet: analysis
209+
type: cells
210+
varname: qc_checks
211+
translate: false
212+
atomicclass: numeric
213+
variables:
214+
- name: spread_well_1
215+
cell: G6
216+
- name: spread_well_2
217+
cell: G33
218+
```
219+
220+
**What this does:**
221+
- Reads individual cells and assigns them names specified in the guide
222+
- `spread_well_1` gets the value from cell `G6`, `spread_well_2` from `G33`
223+
- All values are coerced to numeric
224+
- The result is a named list
225+
226+
This is useful when values are scattered or when the spreadsheet's cell-naming
227+
scheme doesn't align with your script's variable-naming scheme.
228+
229+
## Step 8: Add translations
230+
231+
If your spreadsheet uses long, user-friendly variable names (e.g., "Date of experiment")
232+
but your script prefers short, code-friendly names (e.g., `date`), define a
233+
`translations` section:
234+
235+
``` yaml
236+
translations:
237+
- long: "Date of experiment"
238+
short: date
239+
- long: "Experimenter name"
240+
short: experimenter
241+
- long: "Study identifier"
242+
short: studyID
243+
- long: "Plate identifier"
244+
short: plateID
245+
```
246+
247+
**When to use translations:**
248+
- Set `translate: true` in a location to apply these translations to its variable
249+
names
250+
- Translations are optional; if you don't define any, set `translate: false` in
251+
all locations
252+
- The package provides `long_to_shortnames()` and `short_to_longnames()` functions
253+
to convert between formats
254+
255+
## Complete example guide
256+
257+
Below is a complete, working guide for a small fluorescence assay template:
258+
259+
``` yaml
260+
guide.version: '1.0'
261+
template.name: simple_fluorescence
262+
template.min.version: '1.0'
263+
template.max.version: ~
264+
plate.format: 96
265+
266+
locations:
267+
# Required: template version
268+
- sheet: description
269+
type: cells
270+
varname: .template
271+
translate: false
272+
variables:
273+
- name: version
274+
cell: B2
275+
276+
# Metadata: user and experiment information
277+
- sheet: description
278+
type: keyvalue
279+
varname: metadata
280+
translate: true
281+
atomicclass:
282+
- character
283+
- character
284+
- date
285+
- numeric
286+
ranges:
287+
- A5:B8
288+
289+
# Parameters: SOP-defined constants
290+
- sheet: _parameters
291+
type: keyvalue
292+
varname: parameters
293+
translate: false
294+
atomicclass: numeric
295+
ranges:
296+
- A2:B5
297+
298+
# Plate data: fluorescence measurements
299+
- sheet: _data
300+
type: platedata
301+
varname: fluorescence
302+
translate: false
303+
atomicclass: numeric
304+
ranges:
305+
- A1:M9
306+
- A11:M19
307+
308+
# Quality control: derived metrics from analysis sheet
309+
- sheet: analysis
310+
type: cells
311+
varname: qc
312+
translate: false
313+
atomicclass: numeric
314+
variables:
315+
- name: plate_uniformity
316+
cell: D3
317+
- name: signal_range
318+
cell: D4
319+
320+
translations:
321+
- long: "Experimenter name"
322+
short: experimenter
323+
- long: "Date of experiment"
324+
short: date
325+
- long: "Temperature (°C)"
326+
short: temperature
327+
- long: "Sample concentration (µM)"
328+
short: concentration
329+
```
330+
331+
## Validation with CUE
332+
333+
Before using your guide in your scripts, validate it against the CUE schema to
334+
catch syntax and structure errors early:
335+
336+
``` bash
337+
# From the data-raw folder where the schema is located:
338+
cd data-raw
339+
./validate_and_sign.sh path/to/your/guide.yml
340+
```
341+
342+
This command will:
343+
1. Check your guide against `excelguide_schema.cue`
344+
2. Abort with a detailed error message if validation fails
345+
3. Embed a SHA256 hash (`cue.verified` field) if validation succeeds
346+
347+
The hash acts as proof that the guide was validated; you can verify it later
348+
with:
349+
350+
``` bash
351+
./verify_guide.sh path/to/your/guide.yml
352+
```
353+
354+
In R, you can also check the hash when loading a guide:
355+
356+
``` r
357+
guide <- read_guide("path/to/guide.yml", verify_hash = TRUE)
358+
```
359+
360+
This will warn if the hash is missing or abort if the hash is invalid.

0 commit comments

Comments
 (0)