The coidb package runs a Snakemake workflow under the hood which contains steps to filter the BOLD public data to only the COI-5P marker gene, remove leading and trailing gaps and sequences with internal gaps and ambiguous nucleotides. It also applies a length filtering and only keeps records assigned to a BOLD BIN. Steps are also taken to ensure that taxonomic lineages are unique by prefixing duplicated taxonomic labels or removing BOLD BINs with unassigned records. The filtered sequences are then dereplicated by clustering sequences within each BOLD BIN using vsearch. A consensus taxonomy is calculated using an 80% consensus threshold starting from species and moving up in the taxonomy tree.
Finally, fasta and tab separated files compatible with SINTAX, DADA2 and QIIME2 are generated.
Check if you have pixi installed on your system by running:
pixi --versionif pixi is not installed, run the following:
curl -fsSL https://pixi.sh/install.sh | sh- Clone the GitHub repository and change into the
coidbdirectory
git clone git@github.com:insect-biome-atlas/coidb.git
cd coidb- Run the following to install the environment and the
coidbpackage:
pixi run install- Activate an interactive shell with the environment and package installed:
pixi shellIf everything worked you should then be able to run coidb run -h. Proceed to
the Running coidb section to see usage information.
You can pull a Docker image for the latest version of coidb by running:
docker pull ghcr.io/insect-biome-atlas/coidbWe recommend to run coidb on a system with at least 4 cores and 16 GB RAM.
During runs, roughly 75-100 GB of disk space will be used which will be reduced
to ~6 GB upon completion. The full run takes roughly 3 hours on a MacBook Pro
Laptop running with 4 cores.
The coidb package uses public barcode reference libraries from BOLD to build reference fasta files compatible with tools like SINTAX, QIIME2 and DADA2. The first thing you need to do is get your hands on a BOLD Public Data Package:
- Go to the BOLD systems data package page.
- Login is required to access files on this page, so either login or sign up if you don't already have an account.
- On the data package page, click the Data package (tar.gz compressed) download button, accept the terms and click Download to obtain a temporary download link.
- Use the link to download the data package which will be named
BOLD_Public.<dd>-<Mmm>-<YYYY>.tar.gz, for exampleBOLD_Public.20-Jun-2025.tar.gz.
Tip
To download via the command line you can copy the Download link instead of
clicking it, then use wget or curl to download directly to a file of your
choice. For example, to download the data package to a directory called
data/ you could run:
mkdir data
wget -O data/BOLD_Public.20-Jun-2025.tar.gz <copied download link>or with curl:
mkdir data
curl -o data/BOLD_Public.20-Jun-2025.tar.gz <copied download link>The downloaded file can now be used as input to coidb by pointing to it with
the --input-file argument, or by setting input_file: <path-to-downloaded-tar.gz file> in a configuration
file
BOLD citation:
Ratnasingham, Sujeevan, and Paul D N Hebert. “bold: The Barcode of Life Data System (http://www.barcodinglife.org).” Molecular ecology notes vol. 7,3 (2007): 355-364. doi:10.1111/j.1471-8286.2007.01678.x
The general syntax for running coidb is:
coidb run <arguments>A typical run could look like this:
coidb run -i data/BOLD_Public.04-Jul-2025.tar.gz -o results -c 4In this example, the file data/BOLD_Public.04-Jul-2025.tar.gz was downloaded
from boldsystems.org/ (read more about how to obtain
the input data under Obtain data), output is stored in the
results directory and 4 threads are used for running coidb.
To see a list of all arguments, run coidb run -h. The available arguments are listed below:
--input-file -i PATH Input tar.gz archive dowloaded from BOLD.
--output-dir -o PATH Folder to store database files in [default: results]
--account -A TEXT SLURM compute account [default: None]
--temp-dir PATH Folder for temporary files [default: tmp]
--gbif-backbone Use GBIF backbone to infer consensus taxonomy for BOLD BINs
--consensus-threshold INTEGER Threshold (in %) when calculating consensus taxonomy [default: 80]
--consensus-method [rank|full] Method to use when calculating consensus [default: rank]
--vsearch-identity FLOAT Identity at which to cluster sequences per BIN [default: 1.0]
--ranks TEXT Ranks to use for calculating consensus and generating fastas [default: kingdom, phylum, class, order, family, genus, species]
--min-len INTEGER Minimum length of sequences to include [default: 500]
--batch-size INTEGER Number of BOLD BINs per batch for running vsearch [default: 50000]- The
--input-file-iargument must point to a BOLD tar archive that you have downloaded from boldsystems.org (see Obtain data below). - The
--output-diror-oargument is a directory in which the output fromcoidbwill be stored (see details under Output below). - The
--accountor-Aargument sets a compute account for running on SLURM clusters (see Cluster execution below). - The
--temp-dirargument sets a directory to use for storing temporary output. This directory can be deleted oncecoidbfinishes. - The
--gbif-backboneargument instructscoidbto use the GBIF backbone taxonomy to infer taxonomic information for BOLD BINs. Note that this option is currently not reliable because of outdated GBIF data. - The
--consensus-thresholdspecifies a threshold in percent when calculating consensus taxonomies for BOLD BINs. - The
--consensus-methodargument specifies how the consensus taxonomy is calculated. Withrank(default), a consensus is calculated at each rank separately starting from 'species' and moving up in the hierarchy (genus, family etc.). If a consensus above the consensus-threshold is found at any rank, the taxonomy at that rank and its parent lineage is used as taxonomy for the BOLD BIN. Withfull, a consensus is applied by taking into account the parent lineages at each rank, so starting with all labels from kingdom->species, then kingdom->genus etc. - The
--vsearch-identityargument specifies the identity threshold to use when clustering sequences with vsearch. The default is1.0meaning sequences are clustered at 100% identity. - The
--ranksargument specifies what taxonomic ranks to use. This applies both to what ranks are included in the final output and what ranks are used to calculate the consensus taxonomy. - The
--min-lenargument sets a minimum length for sequences to include in the final output. - The
--batch-sizeargument sets the number of BOLD bins to process with vsearch in parallell. This is used to reduce the the size of the workflow graph by splitting the input sequences into batches withbatch-sizenumber of BOLD bins per file.
In addition to these command line arguments there are some arguments that define how coidb runs on your system and which are similar to how you typically interact with Snakemake workflows:
--config FILE Path to snakemake config file. Overrides existing workflow configuration. [default: None]
--resource -r PATH Additional resources to copy from workflow directory at run time.
--profile -p TEXT Name of profile to use for configuring Snakemake. [default: None]
--dry -n Do not execute anything, and display what would be done.
--lock -l Lock the working directory.
--dag -d PATH Save directed acyclic graph to file. Must end in .pdf, .png or .svg [default: None]
--cores -c INTEGER Set the number of cores to use. If None will use all cores. [default: None]
--no-conda. Do not use conda environments.
--keep-resources Keep resources after pipeline completes.
--keep-snakemake Keep .snakemake folder after pipeline completes.
--verbose -v Run workflow in verbose mode.
--help-snakemake -hs Print the snakemake help and exit.
--help -h Show this message and exit.- The
--configargument lets you pass a configuration file in YAML format, as an alternative to specifying arguments directly on the command line. - The
--profileargument specifies configuration profile to use for runningcoidb.
You can generate a default configuration file by running:
coidb config > config.ymlThis creates a new file config.yml with the following default parameters:
account: ''
batch_size: 50000
consensus_method: rank
consensus_threshold: 80
gbif_backbone: false
input_file: null
min_len: 500
output_dir: results
ranks:
- kingdom
- phylum
- class
- order
- family
- genus
- species
temp_dir: tmp
vsearch_identity: 1.0You can then edit this file and use it with coidb like so:
coidb run --config config.yml <additional arguments>To run coidb using the Docker image (see Install with
Docker) you must mount the directory containing the
downloaded BOLD Data Package file (see Obtain data) as well as
the output directory where the resulting database files will be stored.
As an example, say we have downloaded the BOLD Data Package file
BOLD_Public.04-Jul-2025.tar.gz into a directory called data/ and we want the
resulting files produced by coidb to be placed under releases/04-Jul-2025.
Then we can run:
docker run \
-v $(pwd)/data:/data \
-v $(pwd)/releases/04-Jul-2025:/releases/04-Jul-2025 \
ghcr.io/insect-biome-atlas/coidb \
run \
-i /data/BOLD_Public.04-Jul-2025.tar.gz \
-o /releases/04-Jul-2025 \
-c 4 \
--temp-dir /releases/04-Jul-2025/tmpIn this example, the line -v $(pwd)/data:/data mounts the data/ directory in
the current folder into /data in the Docker container and the line
-v $(pwd)/releases/04-Jul-2025:/releases/04-Jul-2025 creates a folder
releases/04-Jul-2025 on your system and mounts it into /releases/04-Jul-2025
in the container.
The file structure on your system will be:
$(pwd) # your current directory
├── data
│ └── BOLD_Public.04-Jul-2025.tar.gz
└── releases
└── 04-Jul-2025
and inside the container:
/ # container root
├── data
│ └── BOLD_Public.04-Jul-2025.tar.gz
└── releases
└── 04-Jul-2025
The line with ghcr.io/insect-biome-atlas/coidb refers to the Docker image that
you will use to run the container.
The line with run is the command you will run inside the container. The image
entrypoint is coidb so run is added to this command and what follows are
command line arguments passed to coidb:
-i /data/BOLD_Public.04-Jul-2025.tar.gzinstructscoidbto use the Data Package file as input (the path is the one mounted inside the container)-o /releases/04-Jul-2025sets the output directory (again, this is the path inside the container. On your system the results will be inreleases/04-Jul_2025inside your current directory)-c 4sets maximum number of cpus to use to 4--temp-dir /releases/04-Jul-2025/tmpsets the temporary directory. Since this path is inside the output directory you have mounted you ensure that temporary files are kept after the container stops.
Important
When running on a compute cluster we advise NOT to use the coidb
container via Docker or (more commonly on compute clusters) Apptainer. Instead
install coidb as described under Install with
pixi and use a configuration profile (see
below) so that jobs are submitted to the cluster workload manager.
Tip
When running on compute clusters it's good practice to use a terminal
multiplexer such as screen or tmux so that your running processes are not
interrupted by connection failure.
To run coidb on a compute cluster you must set the compute account to use with
the --account or -A argument. In addition, you should use one of the
pre-defined configuration profiles. To see available profiles, run:
coidb profile listThe profiles can be used as-is by adding --profile <name of profile> to the
command line call, or you can output the settings for a profile to a file and
modify it you fit your needs. For example, to output the generic SLURM profile
settings, we recommend that you run:
mkdir my-slurm-profile
coidb profile show slurm > my-slurm-profile/config.yamlThen you can edit the my-slurm-profile/config.yaml. Once you're done you can
use this profile by passing --profile my-slurm-profile to the coidb run
command.
...
The primary outputs from a run are placed in the directory set by the --output-dir command line argument (default: results/). These include:
-
coidb.clustered.fasta.gz: A fasta file with sequences clustered at whatever threshold set in the config file (default is 1.0 which means 100% identity). Sequence ids in this file correspond to process_ids, e.g.BPALB370-17, and can be looked up in the BOLD portal. The fasta header also includes the corresponding BOLD BIN id of the sequence, e.g.bin_uri:BOLD:AAF7702. -
coidb.info.tsv.gz: This TSV file contains sequence and taxonomic information for all records kept after filtering. -
coidb.BOLD_BIN.consensus_taxonomy.exclNA.tsv.gz: This TSV file contains the calculated consensus taxonomy of BOLD BINs. If a consensus could not be reached at a certain taxonomic rank, the taxonomic label at that rank is prefixed with 'unresolved.' followed by the label of the lowest consensus rank. TheexclNApart of the filename means that taxonomic labels corresponding to missing data (those suffixed with_X) were ignored when calculating the consensus. -
coidb.BOLD_BIN.consensus_taxonomy.inclNA.tsv.gz: Same as above, but here all taxonomic labels were taken into account when calculating the consensus (even labels corresponding to missing data).
Important
The consensus taxonomy files described above are used to create the SINTAX,
DADA2 and QIIME2 compatible reference files. This means that there are two
versions of each of these files, one tagged with exclNA and one with
inclNA. Of these two versions the exclNA file will contain more resolved
taxonomies and will allow the taxonomic classifiers to assign sequences with
higher resolution. However, there is a higher risk that these taxonomies
contain errors due to incorrectly added taxonomic information in the BOLD
database. As such, the inclNA version represents a more conservative (but
less resolved) version of the database.
Log files from a coidb run are stored under _logs/ in your --output-dir directory (default: results/).
Temporary files are stored under the directory defined by --temp-dir (default:
tmp/). This entire directory can be removed after a successful run of coidb,
but can also be used to get an idea of what has happened to the raw data along
the way.
For example, the _extract/ directory contains the raw TSV file extracted from
the input file you used, while the _processed/ directory contains e.g. the
data.filtered.tsv file which is the output from the filtering step of coidb.
sintax/coidb.sintax.{exclNA,inclNA}.fasta.gz: These fasta files are compatible with the SINTAX classification tool implemented in vsearch and have headers with the format:
>BPALB370-17;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Lepidoptera,f:Lycaenidae,g:Thersamonia,s:Thersamonia_X,t:BOLD:AAF7702
-
dada2/coidb.dada2.toGenus.{exclNA,inclNA}.fasta.gz,dada2/coidb.dada2.toSpecies.{exclNA,inclNA}.fasta.gzanddada2/coidb.dada2.addSpecies.{exclNA,inclNA}.fasta.gz: These fasta files are compatible with theassignTaxonomyandaddSpeciesfunctions from DADA2. -
qiime2/coidb.qiime2.info.{exclNA,inclNA}.tsv.gz: These TSV files can be used with QIIME2 to create a taxonomy artifact for use with the feature-classifier plugin. Unzip the file then runqiime tools import --type 'FeatureData[Taxonomy]' --input-format TSVTaxonomyFormat --input-path coidb.qiime2.inclNA.info.tsv --output-path taxonomy.qza. Thecoidb.clustered.fasta.gzfile can be used to import sequences withqiime tools import --type 'FeatureData[Sequence]' --input-path coidb.clustered.fasta --output-path seqs.qza.
Firstly, the input file is extracted and the TSV file with taxonomic information and sequence data for each record is identified. This TSV file is then filtered by:
- Selecting a useful subset of columns
- Only keeping records with
COI-5Pin themarker_codecolumn. - Only keeping records assigned to a proper BOLD BIN (with the exception of prokaryotic sequences which are all kept).
- Removing records with sequences that are too short (as defined by the
--min-lenargument). - Stripping any leading and trailing gap (
-) characters - Removing sequences with remaining gaps
- Removing sequences with non DNA characters.
The filtered TSV file is then processed to fill in missing values for taxonomic
ranks. Ranks with missing data is filled with the lowest assigned taxonomic
label, suffixed with _X. For example:
| processid | kingdom | phylum | class | order | family | genus | species | bin_uri |
|---|---|---|---|---|---|---|---|---|
| DUTCH124-19 | Animalia | Platyhelminthes | None | Polycladida | None | None | None | BOLD:ACC8697 |
| AACTA1367-20 | Animalia | Arthropoda | None | None | None | None | None | BOLD:AED1280 |
becomes:
| processid | kingdom | phylum | class | order | family | genus | species | bin_uri |
|---|---|---|---|---|---|---|---|---|
| DUTCH124-19 | Animalia | Platyhelminthes | Platyhelminthes_X | Polycladida | Polycladida_X | Polycladida_XX | Polycladida_XXX | BOLD:ACC8697 |
| AACTA1367-20 | Animalia | Arthropoda | Arthropoda_X | Arthropoda_XX | Arthropoda_XXX | Arthropoda_XXXX | Arthropoda_XXXXX | BOLD:AED1280 |
Some records in the BOLD data may have the same taxonomic labels at a specific
rank, but with different labels for parent ranks. Take for example the
Hemineura genus where records may have these conflicting labels for higher
ranks:
| kingdom | phylum | class | order | family | genus |
|---|---|---|---|---|---|
| Animalia | Arthropoda | Insecta | Psocodea | Elipsocidae | Hemineura |
| Protista | Rhodophyta | Florideophyceae | Ceramiales | Delesseriaceae | Hemineura |
This is dealt with by either removing BOLD BINs that lack taxonomic information for higher ranks, or by prefixing the non-unique rank with the label of the higher taxonomic rank. In the example above, this would generate:
| kingdom | phylum | class | order | family | genus |
|---|---|---|---|---|---|
| Animalia | Arthropoda | Insecta | Psocodea | Elipsocidae | Elipsocidae_Hemineura |
| Protista | Rhodophyta | Florideophyceae | Ceramiales | Delesseriaceae | Delesseriaceae_Hemineura |
After these steps, a consensus taxonomy is calculated for BOLD BINs by taking into account the taxonomic information for all records in each BIN. For example, a BOLD BIN with records labelled:
| kingdom | phylum | class | order | family | genus | species |
|---|---|---|---|---|---|---|
| K | P | C | O | F | G | S |
| K | P | C | O | F | G | S |
| K | P | C | O | F | G | S |
| K | P | C | O | F | G | S |
| K | P | C | O | F | G | S2 |
| K | P | C | O | F | G | G_X |
| K | P | C | O | F | G | G_X |
| K | P | C | O | F | G | G_X |
| K | P | C | O | F | G | G_X |
This BIN has 4 records labelled species S, 1 record labelled species S2 and
4 records with missing species labels (these were given the genus label suffixed
with _X). Starting from species, 40% of records are labelled species S, 10%
are labelled S2 and 40% have ambiguous labels. Using a consensus threshold of
80% (the default) we see that no consensus can be reached for this BIN at
species level. Moving one step up in the hierarchy however gets us 100% of
records labelled genus G. Consequently this BIN will receive a consensus
taxonomy:
| kingdom | phylum | class | order | family | genus | species |
|---|---|---|---|---|---|---|
| K | P | C | O | F | G | unresolved.G |
The consensus generated with this approach is tagged with inclNA in the output
files described above (see Output).
Ignoring all records with G_X at rank=species means that there are 80% records
with species S and 20% with species S2. The consensus taxonomy for the BIN
would then become:
| kingdom | phylum | class | order | family | genus | species |
|---|---|---|---|---|---|---|
| K | P | C | O | F | G | S |
This is what the exclNA tag refers to in the output files described above (see
Output).
Note
In previous versions of coidb the GBIF backbone
taxonomy
was used to set taxonomy of BOLD BINs. However, because the backbone data is
not up to date we do not recommend using this option at the moment. The
functionality is still kept so you can run coidb with the command line flag
--gbif-backbone if you wish.
For BOLD BINs with more than 1 record after filtering sequences are clustered
using vsearch at the identity threshold specified with --vsearch-identity.
BINs with only 1 record are then added to the clustered results to generate the
final coidb.clustered.fasta.gz output file.
Finally, the consensus taxonomy and the filtered sequences are used to generate
reference files compatible with the SINTAX algorithm (using vsearch --sintax queries.fasta --db coidb.sintax.inclNA.fasta.gz ...), DADA2 taxonomic
assignment (using files under
<results-dir>/dada2) and QIIME2 feature-classifier
plugin.