Skip to content

Commit 92a0b5e

Browse files
committed
WIP improving the vignette
1 parent ab1b296 commit 92a0b5e

1 file changed

Lines changed: 85 additions & 75 deletions

File tree

vignettes/writing_templates_and_data_guides.Rmd

Lines changed: 85 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -21,63 +21,73 @@ library(excelDataGuide)
2121

2222
## Introduction
2323

24-
Spreadsheets are widely used in biochemical laboratories for both recording and
25-
analyzing experiments. When experiments become routine, spreadsheet templates
26-
are often created to streamline workflows and ensure consistency.
27-
28-
The goal of the **excelDataGuide** package is to enable the use of Excel
29-
spreadsheets alongside scripting environments as effective data analysis tools.
30-
While scripting languages offer more flexibility and powerespecially for
31-
analyzing large datasets across multiple workbooksthe spreadsheet remains the
24+
Spreadsheets are used in biochemical laboratories for both recording and
25+
analyzing experiments. In case of routine experiments, spreadsheet templates
26+
are often used to streamline workflows and ensure consistency.
27+
28+
The goal of the **excelDataGuide** package is to enable the use of Excel
29+
spreadsheets alongside scripting environments as effective data analysis tools.
30+
While scripting languages offer more flexibility and powerespecially for
31+
analyzing large datasets across multiple workbooksthe spreadsheet is the
3232
**primary source of all data**.
3333

3434
This **"single-source-of-truth"** approach ensures that both spreadsheet-based
35-
and script-based analyses rely on the same underlying data and parameters.
35+
and script-based analyses rely on the same underlying data and parameters.
3636
This includes:
3737

3838
- **Metadata**
3939
- **Experimental parameters** (e.g., acceptance criteria, concentrations)
4040
- **Measured data**
4141

42-
Parameters such as acceptance criteria are typically defined by standard
43-
operating procedures (SOPs) and fixed in the spreadsheet templates. Other
44-
values, such as experimental measurements or fitted parameters, vary per
42+
Parameters such as acceptance criteria are typically defined by standard
43+
operating procedures (SOPs) and can be stored in single cells in a
44+
spreadsheet and referred to by abolute referencing or by using named cells.
45+
Other values, such as experimental measurements or fitted parameters, vary per
4546
experiment and are entered by the user.
4647

47-
In some cases, it may be beneficial for the script to also use
48+
In some cases, it may be beneficial for the script to also use
4849
**calculated data** from the spreadsheet—especially when those calculations are
4950
automatically triggered upon user input. This decision depends on the specific
5051
analysis needs and the reliability of spreadsheet-based computations.
5152

52-
## Structuring a template
53+
## Spreadsheet templates and data guides
5354

54-
To provide a link between the data structures of programming languages and
55+
The solution that we propose here is to use a data guide, a file that describes
56+
in a concise manner where the data can be found in a template. Furthermore, it
57+
also describes which type of data it is (e.g, numeric or text or date) so that
58+
the data can be read in the correct format. The data guide is a yaml file. We
59+
choose this format because it has a simple structure, and can be easily read and
60+
edited in a text editor.
61+
62+
## Structuring a spreadsheet template
63+
64+
To provide a link between the data structures of programming languages and
5565
those in a spreadsheet we consider the following four types of data structures
5666
in a template:
5767

5868
- **keyvalue**: a key-value pair, where the key is a variable name and the value
59-
is the value of that variable. The key and value are placed in horizontally
69+
is the value of that variable. The key and value are placed in horizontally
6070
adjacent cells (columns). The key, or its translated short name (see below)
61-
is to be used as the parameter name in the scripts and should conform to
62-
variable naming rules for the scripting language used. The key is found in
63-
the left-most cell of a cell range. The value can be a single value (one cell)
71+
is to be used as the parameter name in the scripts and should conform to
72+
variable naming rules for the scripting language used. The key is found in
73+
the left-most cell of a cell range. The value can be a single value (one cell)
6474
or a vector of values (multiple cells).
6575
- **cells**: occasionally it may be more convenient to read values from single
66-
cells and provide the keys (names) of the corresponding variables in the data
67-
guide. These data will be stored as key-value pairs, but in contrast to the
76+
cells and provide the keys (names) of the corresponding variables in the data
77+
guide. These data will be stored as key-value pairs, but in contrast to the
6878
**keyvalue** data type where a variable name is provided in the template
6979
the data guide must provide a variable name.
7080
- **table**: tabular data where columns represent variables and rows represent
71-
items in which these variables are assessed. Column names are written in the
81+
items in which these variables are assessed. Column names are written in the
7282
first row and are used as variable names.
7383
- **platedata**: data are registered in the same row-column format as the
74-
microplate in which the experiment was performed. The first row contains the
75-
variable name in its left-most cell, and is followed by (integer) column names.
76-
Every subsequent row contains the row name (in capital letters) followed by the
77-
values for each well. Both variable name and data are read by the script. The
78-
column and row names are ignored. Therefore, the first row and column in the
84+
microplate in which the experiment was performed. The first row contains the
85+
variable name in its left-most cell, and is followed by (integer) column names.
86+
Every subsequent row contains the row name (in capital letters) followed by the
87+
values for each well. Both variable name and data are read by the script. The
88+
column and row names are ignored. Therefore, the first row and column in the
7989
range could also be empty, except for the variable name. Plate data are stored
80-
as tables in which, apart from the variables provided in the template two
90+
as tables in which, apart from the variables provided in the template two
8191
additional columns are added, namely row and column, corresponding to the row
8292
and column in a microplate.
8393

@@ -90,58 +100,58 @@ knitr::include_graphics("images/template_frontpage.png")
90100

91101
### A template has a version number
92102

93-
Unique template version numbers are a way to prevent misunderstandings
94-
between users and are also needed here to check whether a data guide is
95-
compatible with the template version.
103+
Unique template version numbers are a way to prevent misunderstandings
104+
between users and are also needed here to check whether a data guide is
105+
compatible with the template version.
96106

97-
**Version numbering rules**. We follow the R-package version rules. A version
107+
**Version numbering rules**. We follow the R-package version rules. A version
98108
number has the structure <kbd>major.minor</kbd> or <kbd>major.minor.patch</kbd>,
99-
where <kbd>major</kbd>, <kbd>minor</kbd> and <kbd>patch</kbd> are each
100-
integer values. A version consisting of only a major number is invalid, but
109+
where <kbd>major</kbd>, <kbd>minor</kbd> and <kbd>patch</kbd> are each
110+
integer values. A version consisting of only a major number is invalid, but
101111
will be interpreted as having a minor version <kbd>0</kbd>, *i.e.* a version
102112
"<kbd>2</kbd>" will be interpreted as "<kbd>2.0</kbd>".
103113

104114
In practice this means that the format of the cell in which the version number
105-
is recorded should formally be *text*, and not *general* or *number*. However,
106-
in the package we do provide functionality to interpret these fields as version
115+
is recorded should formally be *text*, and not *general* or *number*. However,
116+
in the package we do provide functionality to interpret these fields as version
107117
numbers even if the cells in the template have *general* or *number* format.
108118

109119
In the guide this is referred to as <kbd>guide.version</kbd>
110120

111121
**A template name is optional**. Preferably, a template also has a name as a way
112-
for users to refer to it. Note that the example in the figure above doesn't
122+
for users to refer to it. Note that the example in the figure above doesn't
113123
have a name.
114124

115125
In the guide this field is referred to as <kbd>template.name</kbd>.
116126

117-
**Checking compatibilty of template versions and a guide version**. We use
127+
**Checking compatibilty of template versions and a guide version**. We use
118128
template version numbers to check compatibility with a guide. In principle
119-
the same guide can be used for multiple versions of a template as long as the
120-
locations and names of variables indexed in the guide did not change in new
121-
template versions. This is the case when, for example, only explanatory texts
129+
the same guide can be used for multiple versions of a template as long as the
130+
locations and names of variables indexed in the guide did not change in new
131+
template versions. This is the case when, for example, only explanatory texts
122132
or calculations or data validity
123-
checks have changed in the template. When checking version compatibility we
133+
checks have changed in the template. When checking version compatibility we
124134
assume that a guide is compatible with a consecutive range of template versions
125135
between a minimal and a maximal version number.
126136

127-
In the guide these version numbers are referred to as <kbd>template.min.version</kbd>
137+
In the guide these version numbers are referred to as <kbd>template.min.version</kbd>
128138
and <kbd>template.max.version</kbd>.
129139

130140
### All cells are protected except those for data entry
131141

132-
Data entry cells have a distinct background color, here "marker yellow". All
142+
Data entry cells have a distinct background color, here "marker yellow". All
133143
other cells have protected status to prevent users from inadvertently changing
134144
them.
135145

136146
### Include comments
137147

138-
Refer to the SOP+ version
148+
Refer to the SOP+ version
139149

140150
### Built-in data entry checks
141151

142-
The validity of data entered by the users should be checked by validity checks,
152+
The validity of data entered by the users should be checked by validity checks,
143153
especially when misunderstandings are likely to happen. The validity checking
144-
capability by excel is limited. In cases where the data structure can not be
154+
capability by excel is limited. In cases where the data structure can not be
145155
properly described by a validity rule we add a comment next to the cell in which
146156
the data is entered.
147157

@@ -152,11 +162,11 @@ knitr::include_graphics("images/parameters.png")
152162
```
153163

154164

155-
Parameters needed for calculations, for example for acceptance criteria of
165+
Parameters needed for calculations, for example for acceptance criteria of
156166
measurements are best entered on a separate sheet, and referred to by absolute
157-
references in calculations. In the case of the example we have a separate
158-
hidden sheet called *_parameters* for this purpose. The information in this
159-
sheet is indexed in the data guide, and therefore available to R-scripts as
167+
references in calculations. In the case of the example we have a separate
168+
hidden sheet called *_parameters* for this purpose. The information in this
169+
sheet is indexed in the data guide, and therefore available to R-scripts as
160170
well.
161171

162172
### Use of hidden worksheets for data transfer
@@ -169,17 +179,17 @@ knitr::include_graphics("images/data.png")
169179

170180
## What else?
171181

172-
The keyvalue format will be mostly used for metadata and parameters. All keyvalue
182+
The keyvalue format will be mostly used for metadata and parameters. All keyvalue
173183
will be aggregated in a single named list called "keyvalue".
174184

175-
The platedata format will be used for measured data and data concerning
176-
concentrations in the plate wells. All ranges will be aggregated in a single
177-
data frame with reported variables as column names, including the column names
185+
The platedata format will be used for measured data and data concerning
186+
concentrations in the plate wells. All ranges will be aggregated in a single
187+
data frame with reported variables as column names, including the column names
178188
"row" and "col", corresponding to the row and column names of the plate.
179189

180190
## Constructing a guide
181191

182-
Every spreadsheet template should be accompanied by a data guide, and index
192+
Every spreadsheet template should be accompanied by a data guide, and index
183193
registering the location of different data structures in the template. This
184194
guide is a yaml file, a human editable and computer readable file format.
185195

@@ -230,30 +240,30 @@ A guide must contain the following elements:
230240

231241
- <kbd>guide.version</kbd>: the version of the guide
232242
- <kbd>template.name</kbd>: a name for the template
233-
- <kbd>template.min.version</kbd>: The minimal version of the template for which the guide
234-
can be used
243+
- <kbd>template.min.version</kbd>: The minimal version of the template for which the guide
244+
can be used
235245
with the guide
236-
- <kbd>template.max.version</kbd>: The maximal version of the template for which the guide
246+
- <kbd>template.max.version</kbd>: The maximal version of the template for which the guide
237247
can be used.
238248
- <kbd>locations</kbd>: the object containing the data locations
239-
- <kbd>translations</kbd>: the object containing the translations of variable names.
240-
Translations can be used both from extended ('long') format to short format
249+
- <kbd>translations</kbd>: the object containing the translations of variable names.
250+
Translations can be used both from extended ('long') format to short format
241251
and from short to long format. Two functions are provided by the package to
242252
perform these translations *vice versa*.
243253

244254
### Conditionally required element:
245255

246-
- <kbd>plate.format</kbd> the format of the microplates used for the experiments. This
247-
must be either of '24', '48', '96', or '384'. This is required when a
256+
- <kbd>plate.format</kbd> the format of the microplates used for the experiments. This
257+
must be either of '24', '48', '96', or '384'. This is required when a
248258
**platedata** element occurs in the **locations**. This plate format is used
249-
to check the correctness of dimensions of the ranges of **platedata**
259+
to check the correctness of dimensions of the ranges of **platedata**
250260
elements.
251261

252-
The elements in **locations** indicate where data are to be found, whereas the translation
253-
part contains translations between long and short names for variables. Short
254-
names are used as variable names in the scripts, whereas long names may be used
255-
in the spreadsheet, in particular when these are visible to the user. In that
256-
case the names should be translated before using them in the script. Reverse
262+
The elements in **locations** indicate where data are to be found, whereas the translation
263+
part contains translations between long and short names for variables. Short
264+
names are used as variable names in the scripts, whereas long names may be used
265+
in the spreadsheet, in particular when these are visible to the user. In that
266+
case the names should be translated before using them in the script. Reverse
257267
translations may be used by the script in the output document.
258268

259269
## Locations
@@ -268,16 +278,16 @@ translations may be used by the script in the output document.
268278

269279
### Optional element
270280

271-
- <kbd>atomicclass</kbd>: the class of the data in the ranges, Can have values "character",
272-
"numeric", "integer", "logical" or "date"., It can.be either a singleton or an array of class of the same
273-
length as the number of ranges. If a singleton then by default all values are converted to character.
281+
- <kbd>atomicclass</kbd>: the class of the data in the ranges, Can have values "character",
282+
"numeric", "integer", "logical" or "date"., It can.be either a singleton or an array of class of the same
283+
length as the number of ranges. If a singleton then by default all values are converted to character.
274284
If an atomicclass is given then values are coerced. Coercion is performed by the functions `as.character`, `as.numeric`,
275285
`as.integer`, `as.logical`, respectively, or in case of a date, by a function that produces a Date object.
276286

277287
### Checking against the excelDataGuide json schema
278288

279-
Correctness of the structure and syntax of a YAML file like a data guide can be
280-
checked against a JSON schema (See [json-schema-everywhere](https://json-schema-everywhere.github.io/yaml)).
289+
Correctness of the structure and syntax of a YAML file like a data guide can be
290+
checked against a JSON schema (See [json-schema-everywhere](https://json-schema-everywhere.github.io/yaml)).
281291
We provide a JSON schema called <kbd>excelguide_schema.json</kbd> in the folder
282-
<kbd>data-raw</kbd>. We use the [Polyglottal JSON Schema Validator](https://www.npmjs.com/package/pajv)
292+
<kbd>data-raw</kbd>. We use the [Polyglottal JSON Schema Validator](https://www.npmjs.com/package/pajv)
283293
to validate guides against this schema.

0 commit comments

Comments
 (0)