fix: prevent nan persisting to web vep output#838
Conversation
There was a problem hiding this comment.
Ok I think I opened a can of worms here, sorry!
So if we use default format for output then there are just missing data in the output. So if it doesn't have something in a score variable then it just is missing.
Same for JSON output and probably VCF.
The method needs to allow for no value and probably populate it with N/A or similar.
E.g.
## EVE_CLASS : Classification (Benign, Uncertain, or Pathogenic) when setting 50% as uncertain
## EVE_SCORE : Score from EVE model
## popEVE_ESM1v : Raw ESM1v model score (log-likelihood ratio from protein language model)
## popEVE_EVE : Raw EVE model score (unsupervised variant effect prediction)
## popEVE_SCORE : Score from popEVE model
## popEVE_gap_frequency : Fraction of sequences with a gap at this alignment position in the MSA used for model inference - filter anything above 0.5
## popEVE_gene : Gene symbol corresponding to the variant
## popEVE_mutant : Protein-level variant in [WILDTYPE_AA][AA_POSITION][VARIANT_AA] format (e.g. A123T)
## popEVE_pop_adjusted_ESM1v : ESM1v log-likelihood ratio adjusted for population variation using the popEVE framework
## popEVE_pop_adjusted_EVE : EVE score adjusted for population variation using the popEVE framework
## popEVE_protein : RefSeq identifier associated with the variant
## VEP command-line: vep --force_overwrite --plugin [PATH]/grch38_popEVE_ukbb_20250715.vcf.gz
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
rs699 1:230710048 G ENSG00000135744 ENST00000366667 Transcript missense_variant 843 776 259 M/T aTg/aCg - IMPACT=MODERATE;STRAND=-1;popEVE_EVE=-3.8606567;popEVE_SCORE=-2.5551376;popEVE_gap_frequency=0.0376106194690265;popEVE_gene=AGT;popEVE_mutant=M259T;popEVE_pop_adjusted_EVE=-2.5551376;popEVE_protein=NP_001369746.2
rs699 1:230710048 G ENSG00000244137 ENST00000412344 Transcript downstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=650;STRAND=-1
rs699 1:230710048 G ENSG00000135744 ENST00000679684 Transcript missense_variant 1287 776 259 M/T aTg/aCg - IMPACT=MODERATE;STRAND=-1;popEVE_EVE=-3.8606567;popEVE_SCORE=-2.5551376;popEVE_gap_frequency=0.0376106194690265;popEVE_gene=AGT;popEVE_mutant=M259T;popEVE_pop_adjusted_EVE=-2.5551376;popEVE_protein=NP_001369746.2
This PR updates EVE.pm so popEVE source missing values are treated as missing data, not as output values. Literal
nanfrom the data file were persisting into the VEP output file/ web view. The plugin previously treated these as values and passed them through to VEP output, meaning that they appeared asnaninstead of-.The plugin now filters missing values out of the returned hash. It treats undefined, empty string, ., and
nanas missing. The headers remain the same so that VEP knows these cols still exist.What happens for each output format:
This matches existing plugin behaviour. Plugins generally return only available annotation keys and then let missing value representation depend on the VEP formatters (i.e. tab/json/vsf) - see
dbNSFP.pmL381, for example.Testing
Download and prepare EVE and popEVE data as per
EVE.pmheader info.in.vcf:Command:
Before this fix, running rs699 with the EVE plugin resulted in literal
nanvalues for the following cols: