CAGI7 Challenge

Summer-Winter 2025

Eleven challenges announced, seven challenges released. 

Clinical Genomes

1. Rare Genomes Project announced

Identify diagnostic variants in children with rare disease from the Rare Genomes Project

The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing for rare disease diagnosis and gene discovery. The study is led by genomics experts and clinicians at the Broad Institute of MIT and Harvard. Research subjects are consented for genomic sequencing and the sharing of their sequence and phenotype information with researchers working to understand the molecular causes of rare disease. When a candidate disease variant believed to be related to the phenotype is identified, the variant is confirmed with Sanger sequencing in a clinical setting and returned to the participant via his or her local physician. In this challenge, whole genome sequence data and phenotype data from a subset of the solved and unsolved RGP families will be provided. Participants in the challenge will try to identify the causative variant(s) in each case. For the unsolved cases, prioritized variants from the participating teams will be examined to see if additional diagnoses can be made.

Data provided by: Heidi Rehm, Anne O’Donnell-Luria, Melanie O’Leary, Broad Institute of MIT and Harvard

Deep Mutational Scanning Challenges 

1. TSC2 to open soon

Predict the effect of missense variants on the TSC2 protein stability

TSC2 encodes tuberin, a tumor suppressor protein involved in regulating cell growth and proliferation. Variants that affect TSC2 function are associated with Tuberous Sclerosis Complex (TSC) and Lymphangioleiomyomatosis (LAM). In this challenge, two libraries of TSC2 missense variants-one within the tuberin domain and another in the RapGAP domain-have been assessed for their effects on protein stability using a high-throughput multiplexed variant stability profiling assay. The challenge is to predict the quantitative impact of these variants on TSC2 stability, as measured by the assay.

Data provided by: Doug Fowler, University of Washington.

2. BARD1 to open soon

Predict the effect of BARD1 single nucleotide variants on RNA abundance and cell survival

BARD1 (BRCA1-Associated RING Domain 1) forms a heterodimer with BRCA1, which is critical for DNA double-strand break repair and tumor suppression. In this challenge, all possible BARD1 single nucleotide variants were assessed for their effects on RNA abundance and cell survival using Saturation Genome Editing in haploid human cells. Participants are asked to predict two separate function scores for each variant, reflecting experimental measurements of RNA abundance and cellular fitness.

Data provided by: Lea Starita, University of Washington

3. LPL to open soon 

Predict the effect of lipoprotein lipase (LPL) variants in a yeast growth assay

Lipoprotein lipase (LPL) is a key enzyme in lipid metabolism, hydrolyzing triglycerides in triglyceride-rich lipoproteins to release free fatty acids to surrounding tissue. Dysfunction in LPL can cause familial hypertriglyceridemia and familial chylomicronemia and can increase the risk of cardiometabolic disease. We have assessed the impact of a comprehensive set of LPL coding variants on LPL cell-surface abundance in mammalian cells: the challenge is to predict the functional consequence of these variants.

Data provided by: Fritz Roth, University of Pittsburgh

4. ATP7B to open soon 

Predict the effect of copper-transporting ATPase 2 (ATP7B) variants in a yeast growth assay

ATP7B, a copper-transporting P-type ATPase, is essential for copper homeostasis and predominantly expressed in the liver. Variants associated with ATP7B dysfunction cause Wilson disease, an autosomal recessive disorder characterized by toxic copper accumulation in the liver, brain, and other tissues. A large library of ATP7B missense variants was assessed with respect to their effects on protein function using a high-throughput yeast complementation assay. The challenge is to predict the functional effects of these variants.

Data provided by: Fritz Roth, University of Pittsburgh

Non-Coding Variant Interpretation 

1. lentiMPRA to open soon

Predicting variant effects in functional regulatory elements using lentiMPRA

The challenge is to predict the functional impact of genetic variants on regulatory element activity. For this purpose, a subset of functionally validated regulatory elements from a large-scale lentiMPRA study were selected and existing single nucleotide variant alleles (SNVs) in these elements added. SNVs were chosen from the 1000 Genomes Project with a focus on variants with diverse allele frequency distributions and proximity to known genes. Each SNV was tested in both reference and alternative allele contexts using lentiMPRA in HepG2 cells across three biological replicates. Sequences were cloned upstream of a minimal promoter in a barcode-tagged reporter construct. Reporter gene expression was measured relative to the plasmid DNA using short-read sequencing of barcodes from the reporter libraries to determine the activity of the sequences. Variant effects were determined as the difference of paired reference and alternative sequences.

Data provided by: Arjun Devadas, University Medical Center Schleswig-Holsteinl; Ryan Hernandez, University of California San Francisco; Nadav Ahituv, University of California San Francisco; Martin Kircher, University Medical Center Schleswig-Holsteinl, Berlin Institute of Health at Charité-Universitätsmedizin Berlin.

Annotation Accumulation Accuracy Assessments

1. Annotate All Missense to open soon

Predict pathogenicity of all missense variants

The challenge is to predict the effect of every missense variant (a variant that results in a single amino acid substitution) listed in dbNSFP,  a database that currently describes 82,198,516 single nucleotide variants in the human genome that change a protein’s amino acid sequence. The effect of the vast majority of missense variants is currently unknown, but experimental and clinical evidence is accruing rapidly. Rather than drawing upon a single dataset as typical with most CAGI challenges, predictions will be assessed by comparing with clinical (pathogenicity) and experimental annotations made available after the prediction submission date, and on an ongoing basis. If predictors assent, their predictions will also be incorporated into dbNSFP.

Data provided by: Xiaoming Liu, University of South Florida

2. Annotate All In-Frame Indels to open soon 

Predict pathogenicity of in-frame indel variants

Short in-frame insertion and deletion (indel) variants add or remove one or more amino acids without disrupting the open reading frame. While these variants do not cause frameshifts, they can still have profound consequences on protein structure, stability, and function, and are implicated in a variety of genetic diseases. The challenge is to predict the pathogenicity of all single, double, and triple amino acid deletions and all single amino acid insertions in human protein-coding, Mendelian disease-associated genes. For the vast majority of these variants, experimental or clinical functional data are not currently available, but such evidence is accumulating rapidly. Predictions will be assessed against new experimental or clinical annotations as they become available, with regular evaluations in line with CAGI standards.

Data provided by: CAGI organizers

Additional challenges in the queue

1. ARSA — announced 

Predict pathogenicity of a large amount of missense variants in ARSA.

2. Splicing — announced 

Predict splicing events from a high-throughput slicing assay.

3. Annotate All Loss-of-Function Variants — announced 

Predict pathogenicity of frameshifting indels, stop gain and stop loss variants.


Last updated: June 4, 2025