Skip to content

fix: prevent nan persisting to web vep output#838

Open
ainefairbrother wants to merge 2 commits into
Ensembl:postreleasefix/116from
ainefairbrother:eve-nan-patch
Open

fix: prevent nan persisting to web vep output#838
ainefairbrother wants to merge 2 commits into
Ensembl:postreleasefix/116from
ainefairbrother:eve-nan-patch

Conversation

@ainefairbrother
Copy link
Copy Markdown
Contributor

@ainefairbrother ainefairbrother commented May 18, 2026

This PR updates EVE.pm so popEVE source missing values are treated as missing data, not as output values. Literal nan from the data file were persisting into the VEP output file/ web view. The plugin previously treated these as values and passed them through to VEP output, meaning that they appeared as nan instead of -.

The plugin now filters missing values out of the returned hash. It treats undefined, empty string, ., and nan as missing. The headers remain the same so that VEP knows these cols still exist.

What happens for each output format:

  • TSV/tab: VEP keeps the columns and renders missing values as -.
  • JSON: VEP omits the missing keys
  • VCF: VEP keeps the fields in the CSQ header and writes empty CSQ subfields.

This matches existing plugin behaviour. Plugins generally return only available annotation keys and then let missing value representation depend on the VEP formatters (i.e. tab/json/vsf) - see dbNSFP.pm L381, for example.

Testing

Download and prepare EVE and popEVE data as per EVE.pm header info.

in.vcf:

##fileformat=VCFv4.2
##source=vep-repro
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	230710048	rs699	A	G	.	PASS	.

Command:

# tsv output
./vep \
  -i in.vcf \
  -o out.tsv \
  --dir_cache tabixconverted \
  --cache_version 116 \
  --format vcf --symbol --tab --offline --cache \
  --fasta Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
  --assembly GRCh38 \
  --force_overwrite \
  --no_escape \
  --show_ref_allele \
  --no_stats \
  --plugin EVE,file=$EVE_file,class_number=50,popeve_file=$POPEVE_file 

# vcf output
./vep \
  -i in.vcf \
  -o out.vcf \
  --dir_cache tabixconverted \
  --cache_version 116 \
  --format vcf --symbol --vcf --offline --cache \
  --fasta Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
  --assembly GRCh38 \
  --force_overwrite \
  --no_escape \
  --show_ref_allele \
  --no_stats \
  --plugin EVE,file=$EVE_file,class_number=50,popeve_file=$POPEVE_file 

# json output
./vep \
  -i in.vcf \
  -o out.json \
  --dir_cache tabixconverted \
  --cache_version 116 \
  --format vcf --symbol --json--offline --cache \
  --fasta Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
  --assembly GRCh38 \
  --force_overwrite \
  --no_escape \
  --show_ref_allele \
  --no_stats \
  --plugin EVE,file=$EVE_file,class_number=50,popeve_file=$POPEVE_file 

Before this fix, running rs699 with the EVE plugin resulted in literal nan values for the following cols:

ESM1v=nan
pop-adjusted_ESM1v=nan

@jamie-m-a jamie-m-a self-requested a review May 19, 2026 08:58
@jamie-m-a jamie-m-a self-assigned this May 19, 2026
Copy link
Copy Markdown
Contributor

@jamie-m-a jamie-m-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think I opened a can of worms here, sorry!

So if we use default format for output then there are just missing data in the output. So if it doesn't have something in a score variable then it just is missing.

Same for JSON output and probably VCF.

The method needs to allow for no value and probably populate it with N/A or similar.

E.g.

## EVE_CLASS : Classification (Benign, Uncertain, or Pathogenic) when setting 50% as uncertain
## EVE_SCORE : Score from EVE model
## popEVE_ESM1v : Raw ESM1v model score (log-likelihood ratio from protein language model)
## popEVE_EVE : Raw EVE model score (unsupervised variant effect prediction)
## popEVE_SCORE : Score from popEVE model
## popEVE_gap_frequency : Fraction of sequences with a gap at this alignment position in the MSA used for model inference - filter anything above 0.5
## popEVE_gene : Gene symbol corresponding to the variant
## popEVE_mutant : Protein-level variant in [WILDTYPE_AA][AA_POSITION][VARIANT_AA] format (e.g. A123T)
## popEVE_pop_adjusted_ESM1v : ESM1v log-likelihood ratio adjusted for population variation using the popEVE framework
## popEVE_pop_adjusted_EVE : EVE score adjusted for population variation using the popEVE framework
## popEVE_protein : RefSeq identifier associated with the variant
## VEP command-line: vep --force_overwrite --plugin [PATH]/grch38_popEVE_ukbb_20250715.vcf.gz
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
rs699 1:230710048 G ENSG00000135744 ENST00000366667 Transcript missense_variant 843 776 259 M/T aTg/aCg - IMPACT=MODERATE;STRAND=-1;popEVE_EVE=-3.8606567;popEVE_SCORE=-2.5551376;popEVE_gap_frequency=0.0376106194690265;popEVE_gene=AGT;popEVE_mutant=M259T;popEVE_pop_adjusted_EVE=-2.5551376;popEVE_protein=NP_001369746.2
rs699 1:230710048 G ENSG00000244137 ENST00000412344 Transcript downstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=650;STRAND=-1
rs699 1:230710048 G ENSG00000135744 ENST00000679684 Transcript missense_variant 1287 776 259 M/T aTg/aCg - IMPACT=MODERATE;STRAND=-1;popEVE_EVE=-3.8606567;popEVE_SCORE=-2.5551376;popEVE_gap_frequency=0.0376106194690265;popEVE_gene=AGT;popEVE_mutant=M259T;popEVE_pop_adjusted_EVE=-2.5551376;popEVE_protein=NP_001369746.2

@ainefairbrother ainefairbrother requested a review from jamie-m-a May 20, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants