C ritical Assessment of Genome Interpretation

Annotate all in-frame indels

Challenge: Annotate all in-frame indels

Variant data: public

Last updated: 28 July 2025

This challenge is open. The challenge closes on September 15, 2025.

How to participate in CAGI7? Download data & submit predictions on Synapse

Make sure you understand our Data Use Agreement and Anonymity Policy

Summary

Short in-frame insertion and deletion (indel) variants add or remove one or more amino acids without disrupting the open reading frame. While these variants do not cause frameshifts, they can still have profound consequences on protein structure, stability, and function, and are implicated in a variety of genetic diseases. The challenge is to predict the pathogenicity of all single, double, and triple amino acid deletions and all single amino acid insertions in human protein-coding, Mendelian disease-associated genes. For the vast majority of these variants, experimental or clinical functional data are not currently available, but such evidence is accumulating rapidly. Predictions will be assessed against new experimental or clinical annotations as they become available, with regular evaluations in line with CAGI standards.

Background

An average human has between 130-178 short (≤50 amino acids) in-frame indels (Marian et al., 2020). A growing number of in silico tools exist for predicting the impact of missense and nonsense variants (Lin et al., 2024), but far fewer focus on in-frame indels; e.g., see Douville et al. (2016), Pagel et al. (2019) or Wei et al. (2024). In-frame indels can disrupt critical protein domains, interfaces, or regulatory sites, and their effects are often context-dependent. A comprehensive blind assessment of the quality of these models will fill the annotation gap for in-frame indels, supporting both research and clinical variant interpretation.

Prediction challenge

Participants are provided with a comprehensive list of all possible in-frame indels (insertions and deletions in multiples of three nucleotides) in human protein-coding genes, based on the latest genome build. For each variant, participants must predict its functional effect, expressed as a score from 0 (benign/no effect) to 1 (deleterious/complete loss of function), along with a standard deviation indicating confidence of prediction score. Predictions will be evaluated as new experimental or clinical data become available, similar to the Annotate All Missense challenges (Rastogi et al., 2025).

Key considerations. (1) Variant context: Effects may vary depending on protein domain, structural region, or evolutionary conservation. (2) Indel length and position: Longer indels or those in critical regions may have greater impact; (3) Data limitations: Most inframe indels lack direct experimental evidence; predictions should be robust to uncertainty.

Submission format

The prediction submission is a tab-delimited text file. Organizers provide a file template, which should be used for submission. In addition, a validation script is provided, and predictors should check the correctness of the format before submitting their predictions. The amino acid sequences used for this challenge can be obtained from RefSeq MANE select release v1.4.

In the submitted file, each row must include the following tab-separated fields:

Variant: protein change using the HGVS protein-level notation; e.g., NP_001596.2:p.Met100_Leu101del (deletion of two amino acids), NP_001596.2:p.Leu5_Thr6insPro (insertion of 1 amino acid)
Prediction score: real-valued score for each indel from 0 (benign) to 1 (deleterious)
Standard deviation: standard deviation of the prediction in column 2 indicating confidence (must be a positive number)
Classification: based on the score in column 2, indicate whether the indel is "D(amaging)" , "T(olerated)" or "U(known)"
Comment: optional brief comment on the basis of the prediction in columns 2-4.

In the template file, some cells in columns 2-5 are marked with a "*". Submit your predictions by only including variants for which predictions are made and using the "*" if a field is not used. No empty cells are allowed in the submission. Unlike in the Annotate All Missense challenge, here you do not have to submit predictions and standard deviation for all variants. If you are not confident in a prediction for a variant, enter an appropriate large standard error for the prediction. Optionally, enter a brief comment on the basis of the prediction. If you do not enter a comment on a prediction, leave the "*" in those cells. Please make sure you follow the submission guidelines strictly.

In addition, your submission must include a detailed description of the method used to make the predictions, similar in style to the Methods section in a scientific article. This information should be submitted as a separate file and contain (1) algorithms and features used; e.g., structural modeling, conservation, machine learning; (2) training data sources; e.g., ClinVar, gnomAD; (3) any assumptions or limitations.

File naming

CAGI allows submission of up to six models per team, of which model 1 is considered primary. You can upload predictions for each model multiple times; the last submission before deadline will be evaluated for each model. If you are submitting a single file with all predictions combined, please use the format below.

Use the following format for your submissions: <teamname>_model_(1|2|3|4|5|6).(tsv|txt)

To include a description of your method, use the following filename: <teamname>_desc.*

Example: if your team’s name is “bestincagi” and you are submitting predictions for your model number 3, your filename should be bestincagi_model_3.txt.

If, however, your files are large, you can split your predictions in four different files. For example, for single amino acid deletions, please use <teamname>_del1_model_(1|2|3|4|5|6).(tsv|txt). For deletions of length two and three, please replace “del1” by “del2” or “del3”. For insertions, please use “ins1” in place of “del1”.

Related challenges

Download data

Variant data: cagi7-inframe-indel-files.zip (354MB).

Download submission template file: annotatealinframeindelstemplate.tsv

We do not provide a validation script. Participants are encouraged to use Mutalyzer to validate their variant nomenclature.

References

Douville C, et al. Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel). Hum Mutat (2016) 37(1):28-35. PubMed

Lin YJ, et al. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics (2024) 18(1):90. PubMed

Marian AJ. Clinical interpretation and management of genetic variants. JACC Basic Transl Sci (2020) 5(10):1029-1042. PubMed

Pagel KA, et al. Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome. PLoS Comput Biol (2019) 15(6):e1007112. PubMed

Rastogi R, et al. Critical assessment of missense variant effect predictors on disease-relevant variant data. Hum Genet (2025) 144(2-3):281-293. PubMed

Wei Y, et al. INDELpred: Improving the prediction and interpretation of indel pathogenicity within the clinical genome. HGG Adv (2024) 5(4):100325. PubMed

Revision history

4 June 2025: challenge preview posted

22 June 2025: challenge opens

28 July 2025: submission template file added

Center for Critical Assessment of Genome Interpretation

Register/Login

C ritical Assessment of Genome Interpretation

How to participate in CAGI7? Download data & submit predictions on Synapse