@@ -21,63 +21,73 @@ library(excelDataGuide)
2121
2222## Introduction
2323
24- Spreadsheets are widely used in biochemical laboratories for both recording and
25- analyzing experiments. When experiments become routine, spreadsheet templates
26- are often created to streamline workflows and ensure consistency.
27-
28- The goal of the ** excelDataGuide** package is to enable the use of Excel
29- spreadsheets alongside scripting environments as effective data analysis tools.
30- While scripting languages offer more flexibility and power— especially for
31- analyzing large datasets across multiple workbooks— the spreadsheet remains the
24+ Spreadsheets are used in biochemical laboratories for both recording and
25+ analyzing experiments. In case of routine experiments , spreadsheet templates
26+ are often used to streamline workflows and ensure consistency.
27+
28+ The goal of the ** excelDataGuide** package is to enable the use of Excel
29+ spreadsheets alongside scripting environments as effective data analysis tools.
30+ While scripting languages offer more flexibility and power — especially for
31+ analyzing large datasets across multiple workbooks — the spreadsheet is the
3232** primary source of all data** .
3333
3434This ** "single-source-of-truth"** approach ensures that both spreadsheet-based
35- and script-based analyses rely on the same underlying data and parameters.
35+ and script-based analyses rely on the same underlying data and parameters.
3636This includes:
3737
3838- ** Metadata**
3939- ** Experimental parameters** (e.g., acceptance criteria, concentrations)
4040- ** Measured data**
4141
42- Parameters such as acceptance criteria are typically defined by standard
43- operating procedures (SOPs) and fixed in the spreadsheet templates. Other
44- values, such as experimental measurements or fitted parameters, vary per
42+ Parameters such as acceptance criteria are typically defined by standard
43+ operating procedures (SOPs) and can be stored in single cells in a
44+ spreadsheet and referred to by abolute referencing or by using named cells.
45+ Other values, such as experimental measurements or fitted parameters, vary per
4546experiment and are entered by the user.
4647
47- In some cases, it may be beneficial for the script to also use
48+ In some cases, it may be beneficial for the script to also use
4849** calculated data** from the spreadsheet—especially when those calculations are
4950automatically triggered upon user input. This decision depends on the specific
5051analysis needs and the reliability of spreadsheet-based computations.
5152
52- ## Structuring a template
53+ ## Spreadsheet templates and data guides
5354
54- To provide a link between the data structures of programming languages and
55+ The solution that we propose here is to use a data guide, a file that describes
56+ in a concise manner where the data can be found in a template. Furthermore, it
57+ also describes which type of data it is (e.g, numeric or text or date) so that
58+ the data can be read in the correct format. The data guide is a yaml file. We
59+ choose this format because it has a simple structure, and can be easily read and
60+ edited in a text editor.
61+
62+ ## Structuring a spreadsheet template
63+
64+ To provide a link between the data structures of programming languages and
5565those in a spreadsheet we consider the following four types of data structures
5666in a template:
5767
5868- ** keyvalue** : a key-value pair, where the key is a variable name and the value
59- is the value of that variable. The key and value are placed in horizontally
69+ is the value of that variable. The key and value are placed in horizontally
6070 adjacent cells (columns). The key, or its translated short name (see below)
61- is to be used as the parameter name in the scripts and should conform to
62- variable naming rules for the scripting language used. The key is found in
63- the left-most cell of a cell range. The value can be a single value (one cell)
71+ is to be used as the parameter name in the scripts and should conform to
72+ variable naming rules for the scripting language used. The key is found in
73+ the left-most cell of a cell range. The value can be a single value (one cell)
6474 or a vector of values (multiple cells).
6575- ** cells** : occasionally it may be more convenient to read values from single
66- cells and provide the keys (names) of the corresponding variables in the data
67- guide. These data will be stored as key-value pairs, but in contrast to the
76+ cells and provide the keys (names) of the corresponding variables in the data
77+ guide. These data will be stored as key-value pairs, but in contrast to the
6878 ** keyvalue** data type where a variable name is provided in the template
6979 the data guide must provide a variable name.
7080- ** table** : tabular data where columns represent variables and rows represent
71- items in which these variables are assessed. Column names are written in the
81+ items in which these variables are assessed. Column names are written in the
7282 first row and are used as variable names.
7383- ** platedata** : data are registered in the same row-column format as the
74- microplate in which the experiment was performed. The first row contains the
75- variable name in its left-most cell, and is followed by (integer) column names.
76- Every subsequent row contains the row name (in capital letters) followed by the
77- values for each well. Both variable name and data are read by the script. The
78- column and row names are ignored. Therefore, the first row and column in the
84+ microplate in which the experiment was performed. The first row contains the
85+ variable name in its left-most cell, and is followed by (integer) column names.
86+ Every subsequent row contains the row name (in capital letters) followed by the
87+ values for each well. Both variable name and data are read by the script. The
88+ column and row names are ignored. Therefore, the first row and column in the
7989 range could also be empty, except for the variable name. Plate data are stored
80- as tables in which, apart from the variables provided in the template two
90+ as tables in which, apart from the variables provided in the template two
8191 additional columns are added, namely row and column, corresponding to the row
8292 and column in a microplate.
8393
@@ -90,58 +100,58 @@ knitr::include_graphics("images/template_frontpage.png")
90100
91101### A template has a version number
92102
93- Unique template version numbers are a way to prevent misunderstandings
94- between users and are also needed here to check whether a data guide is
95- compatible with the template version.
103+ Unique template version numbers are a way to prevent misunderstandings
104+ between users and are also needed here to check whether a data guide is
105+ compatible with the template version.
96106
97- ** Version numbering rules** . We follow the R-package version rules. A version
107+ ** Version numbering rules** . We follow the R-package version rules. A version
98108number has the structure <kbd >major.minor</kbd > or <kbd >major.minor.patch</kbd >,
99- where <kbd >major</kbd >, <kbd >minor</kbd > and <kbd >patch</kbd > are each
100- integer values. A version consisting of only a major number is invalid, but
109+ where <kbd >major</kbd >, <kbd >minor</kbd > and <kbd >patch</kbd > are each
110+ integer values. A version consisting of only a major number is invalid, but
101111will be interpreted as having a minor version <kbd >0</kbd >, * i.e.* a version
102112"<kbd >2</kbd >" will be interpreted as "<kbd >2.0</kbd >".
103113
104114In practice this means that the format of the cell in which the version number
105- is recorded should formally be * text* , and not * general* or * number* . However,
106- in the package we do provide functionality to interpret these fields as version
115+ is recorded should formally be * text* , and not * general* or * number* . However,
116+ in the package we do provide functionality to interpret these fields as version
107117numbers even if the cells in the template have * general* or * number* format.
108118
109119In the guide this is referred to as <kbd >guide.version</kbd >
110120
111121** A template name is optional** . Preferably, a template also has a name as a way
112- for users to refer to it. Note that the example in the figure above doesn't
122+ for users to refer to it. Note that the example in the figure above doesn't
113123have a name.
114124
115125In the guide this field is referred to as <kbd >template.name</kbd >.
116126
117- ** Checking compatibilty of template versions and a guide version** . We use
127+ ** Checking compatibilty of template versions and a guide version** . We use
118128template version numbers to check compatibility with a guide. In principle
119- the same guide can be used for multiple versions of a template as long as the
120- locations and names of variables indexed in the guide did not change in new
121- template versions. This is the case when, for example, only explanatory texts
129+ the same guide can be used for multiple versions of a template as long as the
130+ locations and names of variables indexed in the guide did not change in new
131+ template versions. This is the case when, for example, only explanatory texts
122132or calculations or data validity
123- checks have changed in the template. When checking version compatibility we
133+ checks have changed in the template. When checking version compatibility we
124134assume that a guide is compatible with a consecutive range of template versions
125135between a minimal and a maximal version number.
126136
127- In the guide these version numbers are referred to as <kbd >template.min.version</kbd >
137+ In the guide these version numbers are referred to as <kbd >template.min.version</kbd >
128138and <kbd >template.max.version</kbd >.
129139
130140### All cells are protected except those for data entry
131141
132- Data entry cells have a distinct background color, here "marker yellow". All
142+ Data entry cells have a distinct background color, here "marker yellow". All
133143other cells have protected status to prevent users from inadvertently changing
134144them.
135145
136146### Include comments
137147
138- Refer to the SOP+ version
148+ Refer to the SOP+ version
139149
140150### Built-in data entry checks
141151
142- The validity of data entered by the users should be checked by validity checks,
152+ The validity of data entered by the users should be checked by validity checks,
143153especially when misunderstandings are likely to happen. The validity checking
144- capability by excel is limited. In cases where the data structure can not be
154+ capability by excel is limited. In cases where the data structure can not be
145155properly described by a validity rule we add a comment next to the cell in which
146156the data is entered.
147157
@@ -152,11 +162,11 @@ knitr::include_graphics("images/parameters.png")
152162```
153163
154164
155- Parameters needed for calculations, for example for acceptance criteria of
165+ Parameters needed for calculations, for example for acceptance criteria of
156166measurements are best entered on a separate sheet, and referred to by absolute
157- references in calculations. In the case of the example we have a separate
158- hidden sheet called * _ parameters* for this purpose. The information in this
159- sheet is indexed in the data guide, and therefore available to R-scripts as
167+ references in calculations. In the case of the example we have a separate
168+ hidden sheet called * _ parameters* for this purpose. The information in this
169+ sheet is indexed in the data guide, and therefore available to R-scripts as
160170well.
161171
162172### Use of hidden worksheets for data transfer
@@ -169,17 +179,17 @@ knitr::include_graphics("images/data.png")
169179
170180## What else?
171181
172- The keyvalue format will be mostly used for metadata and parameters. All keyvalue
182+ The keyvalue format will be mostly used for metadata and parameters. All keyvalue
173183will be aggregated in a single named list called "keyvalue".
174184
175- The platedata format will be used for measured data and data concerning
176- concentrations in the plate wells. All ranges will be aggregated in a single
177- data frame with reported variables as column names, including the column names
185+ The platedata format will be used for measured data and data concerning
186+ concentrations in the plate wells. All ranges will be aggregated in a single
187+ data frame with reported variables as column names, including the column names
178188"row" and "col", corresponding to the row and column names of the plate.
179189
180190## Constructing a guide
181191
182- Every spreadsheet template should be accompanied by a data guide, and index
192+ Every spreadsheet template should be accompanied by a data guide, and index
183193registering the location of different data structures in the template. This
184194guide is a yaml file, a human editable and computer readable file format.
185195
@@ -230,30 +240,30 @@ A guide must contain the following elements:
230240
231241- <kbd >guide.version</kbd >: the version of the guide
232242- <kbd >template.name</kbd >: a name for the template
233- - <kbd >template.min.version</kbd >: The minimal version of the template for which the guide
234- can be used
243+ - <kbd >template.min.version</kbd >: The minimal version of the template for which the guide
244+ can be used
235245 with the guide
236- - <kbd >template.max.version</kbd >: The maximal version of the template for which the guide
246+ - <kbd >template.max.version</kbd >: The maximal version of the template for which the guide
237247 can be used.
238248- <kbd >locations</kbd >: the object containing the data locations
239- - <kbd >translations</kbd >: the object containing the translations of variable names.
240- Translations can be used both from extended ('long') format to short format
249+ - <kbd >translations</kbd >: the object containing the translations of variable names.
250+ Translations can be used both from extended ('long') format to short format
241251 and from short to long format. Two functions are provided by the package to
242252 perform these translations * vice versa* .
243253
244254### Conditionally required element:
245255
246- - <kbd >plate.format</kbd > the format of the microplates used for the experiments. This
247- must be either of '24', '48', '96', or '384'. This is required when a
256+ - <kbd >plate.format</kbd > the format of the microplates used for the experiments. This
257+ must be either of '24', '48', '96', or '384'. This is required when a
248258 ** platedata** element occurs in the ** locations** . This plate format is used
249- to check the correctness of dimensions of the ranges of ** platedata**
259+ to check the correctness of dimensions of the ranges of ** platedata**
250260 elements.
251261
252- The elements in ** locations** indicate where data are to be found, whereas the translation
253- part contains translations between long and short names for variables. Short
254- names are used as variable names in the scripts, whereas long names may be used
255- in the spreadsheet, in particular when these are visible to the user. In that
256- case the names should be translated before using them in the script. Reverse
262+ The elements in ** locations** indicate where data are to be found, whereas the translation
263+ part contains translations between long and short names for variables. Short
264+ names are used as variable names in the scripts, whereas long names may be used
265+ in the spreadsheet, in particular when these are visible to the user. In that
266+ case the names should be translated before using them in the script. Reverse
257267translations may be used by the script in the output document.
258268
259269## Locations
@@ -268,16 +278,16 @@ translations may be used by the script in the output document.
268278
269279### Optional element
270280
271- - <kbd >atomicclass</kbd >: the class of the data in the ranges, Can have values "character",
272- "numeric", "integer", "logical" or "date"., It can.be either a singleton or an array of class of the same
273- length as the number of ranges. If a singleton then by default all values are converted to character.
281+ - <kbd >atomicclass</kbd >: the class of the data in the ranges, Can have values "character",
282+ "numeric", "integer", "logical" or "date"., It can.be either a singleton or an array of class of the same
283+ length as the number of ranges. If a singleton then by default all values are converted to character.
274284 If an atomicclass is given then values are coerced. Coercion is performed by the functions ` as.character ` , ` as.numeric ` ,
275285 ` as.integer ` , ` as.logical ` , respectively, or in case of a date, by a function that produces a Date object.
276286
277287### Checking against the excelDataGuide json schema
278288
279- Correctness of the structure and syntax of a YAML file like a data guide can be
280- checked against a JSON schema (See [ json-schema-everywhere] ( https://json-schema-everywhere.github.io/yaml ) ).
289+ Correctness of the structure and syntax of a YAML file like a data guide can be
290+ checked against a JSON schema (See [ json-schema-everywhere] ( https://json-schema-everywhere.github.io/yaml ) ).
281291We provide a JSON schema called <kbd >excelguide_schema.json</kbd > in the folder
282- <kbd >data-raw</kbd >. We use the [ Polyglottal JSON Schema Validator] ( https://www.npmjs.com/package/pajv )
292+ <kbd >data-raw</kbd >. We use the [ Polyglottal JSON Schema Validator] ( https://www.npmjs.com/package/pajv )
283293to validate guides against this schema.
0 commit comments