Conversation
Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.
Fixed trailing zero bug, fixed FTP timeout bug (?), fixed empty files hanging downloads, fixed imbalanced genome data sets (e.g., no .fna, yes .faa, yes .gff)
| message(" amr_phenotype: ", n_amr_ids) | ||
| message(" genomes with 0 AMR rows: ", length(ids_zero_amr)) | ||
| if (length(ids_zero_amr)) { | ||
| message(" e.g.: ", paste(utils::head(ids_zero_amr, 10), collapse = ", ")) |
There was a problem hiding this comment.
It printed
Initial summary: targets=24421 | AMR genomes=24422 | genome_data genomes=24423
Final summary:
targets : 24421
bac_data : 24421
genome_data : 24423
amr_phenotype: 24422
genomes with 0 AMR rows: 2
e.g.: , genome.genome_id
So I am guessing it is considering column name as genome id!
There was a problem hiding this comment.
We can strip this out. It was for debugging originally which is why it's messy. Alternatively, if it is useful to keep a summary, I can work on some improvements for it!
There was a problem hiding this comment.
I didn't find any logical impact of this misprinting. We can come back to this later. I have made an issue.
| message("FTPS pass 1 (30s timeout)") | ||
| future::plan(future::multisession, workers = max(1, workers_first)) | ||
| res1 <- future.apply::future_lapply( | ||
| genome_ids, | ||
| function(gid) { | ||
| ok <- .ftpes_download_one(gid, out_dir, | ||
| connect_timeout = 10L, max_time = 30L, | ||
| speed_time = 30L, speed_limit = 2048L |
There was a problem hiding this comment.
| message("FTPS pass 1 (30s timeout)") | |
| future::plan(future::multisession, workers = max(1, workers_first)) | |
| res1 <- future.apply::future_lapply( | |
| genome_ids, | |
| function(gid) { | |
| ok <- .ftpes_download_one(gid, out_dir, | |
| connect_timeout = 10L, max_time = 30L, | |
| speed_time = 30L, speed_limit = 2048L | |
| message("FTPS pass 1 (45s timeout)") | |
| future::plan(future::multisession, workers = max(1, workers_first)) | |
| res1 <- future.apply::future_lapply( | |
| genome_ids, | |
| function(gid) { | |
| ok <- .ftpes_download_one(gid, out_dir, | |
| connect_timeout = 10L, max_time = 45L, | |
| speed_time = 30L, speed_limit = 2048L |
| message("FTPS pass 2 (60s timeout) for failed genomes") | ||
| future::plan(future::multisession, workers = max(1, workers_second)) | ||
| res2 <- future.apply::future_lapply( | ||
| fail_ids, | ||
| function(gid) { | ||
| ok <- .ftpes_download_one(gid, out_dir, | ||
| connect_timeout = 10L, max_time = 60L, | ||
| speed_time = 30L, speed_limit = 2048L | ||
| ) | ||
| list(gid = gid, ok = ok) | ||
| }, | ||
| future.seed = TRUE | ||
| ) |
There was a problem hiding this comment.
| message("FTPS pass 2 (60s timeout) for failed genomes") | |
| future::plan(future::multisession, workers = max(1, workers_second)) | |
| res2 <- future.apply::future_lapply( | |
| fail_ids, | |
| function(gid) { | |
| ok <- .ftpes_download_one(gid, out_dir, | |
| connect_timeout = 10L, max_time = 60L, | |
| speed_time = 30L, speed_limit = 2048L | |
| ) | |
| list(gid = gid, ok = ok) | |
| }, | |
| future.seed = TRUE | |
| ) | |
| message("FTPS pass 2 (120s timeout) for failed genomes") | |
| future::plan(future::multisession, workers = max(1, workers_second)) | |
| res2 <- future.apply::future_lapply( | |
| fail_ids, | |
| function(gid) { | |
| ok <- .ftpes_download_one(gid, out_dir, | |
| connect_timeout = 10L, max_time = 120L, | |
| speed_time = 30L, speed_limit = 2048L | |
| ) | |
| list(gid = gid, ok = ok) | |
| }, | |
| future.seed = TRUE | |
| ) |
Added a function to parse CD-HIT .clstr output into a long-format mapping of clusters to member feature ids. Updated database writing logic to include the new protein members table.
Minor regex change --> cleanData
Updated resistance summary calculation to read from the database and include a count of resistant classes.
| dplyr::distinct() |> | ||
| dplyr::filter(!is.na(protein_ids)) |> | ||
| tidyr::separate_rows(protein_ids, sep = ";") |> | ||
| dplyr::filter(!stringr::str_detect(protein_ids, "_pseudo")) |> |
There was a problem hiding this comment.
| # dplyr::filter(!stringr::str_detect(protein_ids, "_pseudo")) |> | |
| dplyr::mutate(protein_ids = gsub("_pseudo", "", protein_ids)) |> |
This is one of the reasons I found that a lot of group ids don't have mapped bvbrc feature ids .
| names_to = "genome_ids", | ||
| values_to = "protein_ids" | ||
| ) |> | ||
| dplyr::mutate(genome_ids = gsub(".PATRIC", "", genome_ids)) |> |
There was a problem hiding this comment.
| dplyr::mutate(genome_ids = gsub(".PATRIC", "", genome_ids)) |> | |
| dplyr::mutate(genome_ids = sub("\\.PATRIC\\.\\.\\..*$", "", genome_ids)) |> |
There was a problem hiding this comment.
Ran PR locally for Shigella flexneri and it worked as expected once I made the code changes suggested in this review (see inline comments below).
All expected tables present at the end of the run:
Metadata: metadata, amr_phenotype, genome_data, original_metadata
Genes: gene_count, gene_names, gene_seqs, struct
Proteins: protein_count, protein_names, protein_seqs, protein_members, genome_gene_protein
Domains: domain_count, domain_names
| #' | ||
| #' @keywords internal | ||
| .buildProtMatrices <- function(cluster_map) { | ||
| cluster_map[, count := 1] |
There was a problem hiding this comment.
The := operator is used here, but data.table is not imported into the package namespace. Currently, it is only listed under Imports: in DESCRIPTION.
When I ran runDataProcessing() locally, this resulted in an error:
[ was called on a data.table in an environment that is not data.table-aware (cedta())
CD-HIT runs, but the error comes immediately after, when R tries to parse the CD-HIT outputs into a cluster matrix.
To fix, add data.table to any R/*.R file with:
#' @importFrom data.table :=
NULL
Then run devtools::document() to regenerate NAMESPACE. One import covers every := call in the package, so no need to repeat it per file.
| @@ -1627,6 +1792,8 @@ | |||
| } | |||
| if (isTRUE(verbose)) message("Cleaning metadata and exporting Parquet-backed views.") | |||
| cleanData(duckdb_path = duckdb_path, path = out_dir, ref_file_path = ref_file_path) | |||
There was a problem hiding this comment.
This code is leftover from the cleanData -> cleanMetaData rename. The new cleanData() signature no longer accepts ref_file_path, so this line errors with unused argument (ref_file_path = ref_file_path). The two new calls on lines 1795–1796 already handle this correctly — this line should be deleted.
| cleanData(duckdb_path = duckdb_path, path = out_dir, ref_file_path = ref_file_path) |
Rename the original cleanData to cleanMetaData and add roxygen skeleton. Introduce a writeCompressedParquet helper and export cleaned metadata, AMR phenotype, genome data and original metadata to compressed Parquet files, then create a separate DuckDB (parquet-backed) with views for metadata, amr_phenotype, genome_data and original_metadata. Reintroduce a new cleanData function focused on feature matrices (genes/proteins/domains/etc.) that writes feature tables to Parquet and creates corresponding views; remove duplicated metadata parquet exports from the feature-matrix flow. Minor whitespace and path-handling adjustments to normalize paths and ensure output directories exist.
Description
What kind of change(s) are included?
Checklist
Please ensure that all boxes are checked before indicating that this pull request is ready for review.