CAGI7 Challenge
Summer-Winter 2025
Eleven challenges announced, seven challenges released.
Clinical Genomes
1. Rare Genomes Project — announced
Identify diagnostic variants in children with rare disease from the Rare Genomes Project
The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing for rare disease diagnosis and gene discovery. The study is led by genomics experts and clinicians at the Broad Institute of MIT and Harvard. Research subjects are consented for genomic sequencing and the sharing of their sequence and phenotype information with researchers working to understand the molecular causes of rare disease. When a candidate disease variant believed to be related to the phenotype is identified, the variant is confirmed with Sanger sequencing in a clinical setting and returned to the participant via his or her local physician. In this challenge, whole genome sequence data and phenotype data from a subset of the solved and unsolved RGP families will be provided. Participants in the challenge will try to identify the causative variant(s) in each case. For the unsolved cases, prioritized variants from the participating teams will be examined to see if additional diagnoses can be made.
Data provided by: Heidi Rehm, Anne O’Donnell-Luria, Melanie O’Leary, Broad Institute of MIT and Harvard
Deep Mutational Scanning Challenges
1. TSC2 — to open soon
Predict the effect of missense variants on the TSC2 protein stability
TSC2 encodes tuberin, a tumor suppressor protein involved in regulating cell growth and proliferation. Variants that affect TSC2 function are associated with Tuberous Sclerosis Complex (TSC) and Lymphangioleiomyomatosis (LAM). In this challenge, two libraries of TSC2 missense variants-one within the tuberin domain and another in the RapGAP domain-have been assessed for their effects on protein stability using a high-throughput multiplexed variant stability profiling assay. The challenge is to predict the quantitative impact of these variants on TSC2 stability, as measured by the assay.
Data provided by: Doug Fowler, University of Washington.
2. BARD1 — to open soon
Predict the effect of BARD1 single nucleotide variants on RNA abundance and cell survival
BARD1 (BRCA1-Associated RING Domain 1) forms a heterodimer with BRCA1, which is critical for DNA double-strand break repair and tumor suppression. In this challenge, all possible BARD1 single nucleotide variants were assessed for their effects on RNA abundance and cell survival using Saturation Genome Editing in haploid human cells. Participants are asked to predict two separate function scores for each variant, reflecting experimental measurements of RNA abundance and cellular fitness.
Data provided by: Lea Starita, University of Washington
3. LPL — to open soon
Predict the effect of lipoprotein lipase (LPL) variants in a yeast growth assay
Lipoprotein lipase (LPL) is a key enzyme in lipid metabolism, hydrolyzing triglycerides in triglyceride-rich lipoproteins to release free fatty acids to surrounding tissue. Dysfunction in LPL can cause familial hypertriglyceridemia and familial chylomicronemia and can increase the risk of cardiometabolic disease. We have assessed the impact of a comprehensive set of LPL coding variants on LPL cell-surface abundance in mammalian cells: the challenge is to predict the functional consequence of these variants.
Data provided by: Fritz Roth, University of Pittsburgh
4. ATP7B — to open soon
Predict the effect of copper-transporting ATPase 2 (ATP7B) variants in a yeast growth assay
ATP7B, a copper-transporting P-type ATPase, is essential for copper homeostasis and predominantly expressed in the liver. Variants associated with ATP7B dysfunction cause Wilson disease, an autosomal recessive disorder characterized by toxic copper accumulation in the liver, brain, and other tissues. A large library of ATP7B missense variants was assessed with respect to their effects on protein function using a high-throughput yeast complementation assay. The challenge is to predict the functional effects of these variants.
Data provided by: Fritz Roth, University of Pittsburgh
Non-Coding Variant Interpretation
1. lentiMPRA — to open soon
Predicting variant effects in functional regulatory elements using lentiMPRA
The challenge is to predict the functional impact of genetic variants on regulatory element activity. For this purpose, a subset of functionally validated regulatory elements from a large-scale lentiMPRA study were selected and existing single nucleotide variant alleles (SNVs) in these elements added. SNVs were chosen from the 1000 Genomes Project with a focus on variants with diverse allele frequency distributions and proximity to known genes. Each SNV was tested in both reference and alternative allele contexts using lentiMPRA in HepG2 cells across three biological replicates. Sequences were cloned upstream of a minimal promoter in a barcode-tagged reporter construct. Reporter gene expression was measured relative to the plasmid DNA using short-read sequencing of barcodes from the reporter libraries to determine the activity of the sequences. Variant effects were determined as the difference of paired reference and alternative sequences.
Data provided by: Arjun Devadas, University Medical Center Schleswig-Holsteinl; Ryan Hernandez, University of California San Francisco; Nadav Ahituv, University of California San Francisco; Martin Kircher, University Medical Center Schleswig-Holsteinl, Berlin Institute of Health at Charité-Universitätsmedizin Berlin.
Annotation Accumulation Accuracy Assessments
1. Annotate All Missense — to open soon
Predict pathogenicity of all missense variants
The challenge is to predict the effect of every missense variant (a variant that results in a single amino acid substitution) listed in dbNSFP, a database that currently describes 82,198,516 single nucleotide variants in the human genome that change a protein’s amino acid sequence. The effect of the vast majority of missense variants is currently unknown, but experimental and clinical evidence is accruing rapidly. Rather than drawing upon a single dataset as typical with most CAGI challenges, predictions will be assessed by comparing with clinical (pathogenicity) and experimental annotations made available after the prediction submission date, and on an ongoing basis. If predictors assent, their predictions will also be incorporated into dbNSFP.
Data provided by: Xiaoming Liu, University of South Florida
2. Annotate All In-Frame Indels — to open soon
Predict pathogenicity of in-frame indel variants
Short in-frame insertion and deletion (indel) variants add or remove one or more amino acids without disrupting the open reading frame. While these variants do not cause frameshifts, they can still have profound consequences on protein structure, stability, and function, and are implicated in a variety of genetic diseases. The challenge is to predict the pathogenicity of all single, double, and triple amino acid deletions and all single amino acid insertions in human protein-coding, Mendelian disease-associated genes. For the vast majority of these variants, experimental or clinical functional data are not currently available, but such evidence is accumulating rapidly. Predictions will be assessed against new experimental or clinical annotations as they become available, with regular evaluations in line with CAGI standards.
Data provided by: CAGI organizers
Additional challenges in the queue
1. ARSA — announced
Predict pathogenicity of a large amount of missense variants in ARSA.
2. Splicing — announced
Predict splicing events from a high-throughput slicing assay.
3. Annotate All Loss-of-Function Variants — announced
Predict pathogenicity of frameshifting indels, stop gain and stop loss variants.
Last updated: June 4, 2025