Critical Assessment of Genome Interpretation

Annotate all missense

Challenge: Annotate all missense

Variant data: public

Last updated: 1 October 2025

This challenge is closed. The challenge closed on September 30, 2025.

How to participate in CAGI7? Download data & submit predictions on Synapse

Make sure you understand our Data Use Agreement and Anonymity Policy

Summary

The challenge is to predict the effect of every missense variant (a variant that results in a single amino acid substitution) listed in dbNSFP, a database that currently describes 82,198,516 single nucleotide variants in the human genome that change a protein’s amino acid sequence. The effect of the vast majority of missense variants is currently unknown, but experimental and clinical evidence is accruing rapidly. Rather than drawing upon a single dataset as typical with most CAGI challenges, predictions will be assessed by comparing with clinical (pathogenicity) and experimental annotations made available after the prediction submission date, and on an ongoing basis. If predictors assent, their predictions will also be incorporated into dbNSFP.

Background

Currently, hundreds of in silico methods for predicting the variant effect have been published (Lin et al., 2024). In many cases, different methods may give opposite predictions for the same variant. dbNSFP is a database of human nonsynonymous single nucleotide variants (nsSNVs) and their functional predictions and annotations (Liu et al., 2011, 2013, 2016, 2020). It compiles a number of functional prediction and conservation scores, as well as other related information including allele frequencies observed in different large datasets, various gene IDs from different databases, functional descriptions of genes, gene expression and gene interaction information.

Prediction challenge

A list of all possible nsSNVs based on the human reference sequence was created from dbNSFP v5.1 (Liu et al., 2020). Participants are asked to predict the functional effect of each missense variant. Since the vast majority of these nsSNVs do not have experimental information, this challenge will assess in silico predictions with new experimental or clinical annotations as they appear in the literature and accumulate in databases. We anticipate making regular evaluations at the time of each CAGI experiment. An assessment of the CAGI6 Annotate All Missense challenge was reported by Rastogi et al. (2025).

Test file format

chr: Chromosome number
pos(1-based): Physical position on the chromosome as to hg38 (1-based coordinate). For mitochondrial SNV, this position refers to the rCRS (GenBank: NC_012920).
Ref: Reference nucleotide allele (as on the + strand)
Alt: Alternative nucleotide allele (as on the + strand)
aaref: Reference amino acid. "X" if the variant is a stop-loss
aaalt: Alternative amino acid. "X" if the variant is a stop-gain
hg19_chr: Chromosome as to hg19, "." means missing
hg19_pos (1-based): Physical position on the chromosome as to hg19 (1-based coordinate). For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
hg18_chr: Chromosome as to hg18, "." means missing 1
hg18_pos (1-based): Physical position on the chromosome as to hg18 (1-based coordinate). For mitochondrial SNV, this position refers to a YRI sequence (GenBank: AF347015)
Genename: Gene name; if the nsSNV can be assigned to multiple genes, gene names are separated by ";"
Cds_strand: Coding sequence (CDS) strand (+ or -)
Refcodon: Reference codon
Codonpos: Position on the codon (1, 2 or 3)
Ensembl_geneid: Ensembl gene id
Ensembl_transcriptid: Ensembl transcript ids (multiple entries separated by ";")
Ensembl_proteinid: Ensembl protein ids. Multiple entries separated by ";", corresponding to Ensembl_transcriptids
AApos: Amino acid position with respect to protein. Multiple entires separated by “;”, corresponding to Ensembl proteinid.

Prediction submission format

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions.

In the submitted file, each row must include the following tab-separated fields:

chr: chromosome number
pos(1-based): physical position on the chromosome as to hg38 (1-based coordinate). For mitochondrial SNV, this position refers to the rCRS (GenBank: NC_012920).
ref: reference nucleotide allele (as on the + strand)
alt: alternative nucleotide allele (as on the + strand)
Prediction score: annotation score for each SNV from 0 (benign) to 1 (deleterious)
SD: standard deviation of the prediction in column 5 indicating confidence
Pred: Based on the score in column 19, indicate whether the SNV is "D(amaging)" , "T(olerated)" or "U(known)"
Comments: optional brief comment on the basis of the prediction in column

In the file, cells in columns 5-8 are marked with a "*". Submit your predictions by replacing the "*" with your value. No empty cells are allowed in the submission. You must submit predictions and standard deviation for all the variants; if you are not confident in a prediction for a variant, enter an appropriate large standard error for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information will be submitted as a separate file.

If predictors assent, predictions will also be incorporated into dbNSFP. This must be explicitly specified in the document describing the method.

File naming

CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

To include a description of your method, use the following filename: <teamname>_desc.*

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.

Related challenges

References

Lin YJ, et al. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics (2024) 18(1):90. PubMed

Liu X, et al. dbNSFP: A lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat (2011) 32:894-899. PubMed

Liu X, et al. dbNSFPv2.0: a database of human non-synonymous SNVs and their functional predictions and annotations. Hum Mutat (2013) 34:E2393-2402. PubMed

Liu X, et al. dbNSFPv3.0: a one-stop database of functional predictions and annotations for human nonsynonymous and splice-site SNVs. Hum Mutat (2016) 37:235-241. PubMed

Liu X, et al. dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med (2020) 12:103. PubMed

Rastogi R, et al. Critical assessment of missense variant effect predictors on disease-relevant variant data. Hum Genet (2025) 144(2-3):281-293. PubMed

Download data

Divided by chromosomes: dbNSFP5.1_nsSNV.zip (860MB).

Download submission template extract: annotateallmissensetemplate.zip

Download submission validation script: annotateallmissensevalidation.py

Dataset provided by