Critical Assessment of Genome Interpretation

CAGI7 Challenge

Summer-Winter 2025

Fourteen challenges released and completed.

Clinical Genomes

Identify diagnostic variants in children with rare disease from the Rare Genomes Project

The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing for rare disease diagnosis and gene discovery, led by genomics experts and clinicians at the Broad Institute of MIT and Harvard. In this challenge, variants from short-read genome sequencing data and phenotype data from a subset of the solved and unsolved RGP families are provided. Participants in the challenge (predictors) will try to identify the causal variant(s) in each proband. For the unsolved probands, prioritized variants from the participating teams will be examined to see if additional genetic diagnoses can be made.

Data provided by: Heidi Rehm, Anne O’Donnell-Luria, Melanie O’Leary, Stephanie DiTroia, Broad Institute of MIT and Harvard

2. Rare Genomes Project CRAM — closed

Call and identify diagnostic variants in children with rare disease from the Rare Genomes Project

The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing for rare disease diagnosis and gene discovery led by genomics experts and clinicians at the Broad Institute of MIT and Harvard. In this challenge, mapped reads from short-read genome sequence data and phenotype data from a subset of the solved and unsolved RGP probands will be provided. Participants in the challenge (predictors) will try to identify the causative variant(s) in each proband. For the unsolved probands, prioritized variants from the participating teams will be examined to see if additional genetic diagnoses can be made.

Data provided by: Heidi Rehm, Anne O’Donnell-Luria, Melanie O’Leary, Stephanie DiTroia, Broad Institute of MIT and Harvard

Polygenic Risk Scores

1. Polygenic Risk Scores — closed

Predict common disease phenotypes from individuals’ genotypes

Polygenic risk scores (PRS) have potential clinical utility for risk surveillance, prevention and personalized medicine. Participants will be provided with datasets of four real phenotypes (Type 2 Diabetes, Breast Cancer, Inflammatory Bowel Disease and Coronary Artery Disease) and of thirty simulated phenotypes representing a range of genetic architectures of common polygenic diseases. The challenge is to predict the disease outcomes of individuals in held-out validation cohorts.

Data provided by: Sung Chun and Shamil Sunyaev, Harvard Medical School

Deep Mutational Scanning Challenges

1. TSC2 — closed

Predict the effect of missense variants on the TSC2 protein stability

TSC2 encodes tuberin, a tumor suppressor protein involved in regulating cell growth and proliferation. Variants that affect TSC2 function are associated with Tuberous Sclerosis Complex (TSC) and Lymphangioleiomyomatosis (LAM). In this challenge, two libraries of TSC2 missense variants-one within the tuberin domain and another in the RapGAP domain-have been assessed for their effects on protein stability using a high-throughput multiplexed variant stability profiling assay. The challenge is to predict the quantitative impact of these variants on TSC2 stability, as measured by the assay.

Data provided by: Doug Fowler, University of Washington.

2. BARD1 — closed

Predict the effect of BARD1 single nucleotide variants on RNA abundance and cell survival

BARD1 (BRCA1-Associated RING Domain 1) forms a heterodimer with BRCA1, which is critical for DNA double-strand break repair and tumor suppression. In this challenge, all possible BARD1 single nucleotide variants were assessed for their effects on RNA abundance and cell survival using Saturation Genome Editing in haploid human cells. Participants are asked to predict two separate function scores for each variant, reflecting experimental measurements of RNA abundance and cellular fitness.

Data provided by: Lea Starita, University of Washington

3. LPL — closed

Predict the effect of lipoprotein lipase (LPL) variants from a surface abundance assay in mammalian cells

Lipoprotein lipase (LPL) is a key enzyme in lipid metabolism, hydrolyzing triglycerides in triglyceride-rich lipoproteins to release free fatty acids to surrounding tissue. Dysfunction in LPL can cause familial hypertriglyceridemia and familial chylomicronemia and can increase the risk of cardiometabolic disease. We have assessed the impact of a comprehensive set of LPL coding variants on LPL cell-surface abundance in mammalian cells: the challenge is to predict the functional consequence of these variants.

Data provided by: Fritz Roth, University of Pittsburgh

4. ATP7B — closed

Predict the effect of copper-transporting ATPase 2 (ATP7B) variants in a yeast growth assay

ATP7B, a copper-transporting P-type ATPase, is essential for copper homeostasis and predominantly expressed in the liver. Variants associated with ATP7B dysfunction cause Wilson disease, an autosomal recessive disorder characterized by toxic copper accumulation in the liver, brain, and other tissues. A large library of ATP7B missense variants was assessed with respect to their effects on protein function using a high-throughput yeast complementation assay. The challenge is to predict the functional effects of these variants.

Data provided by: Fritz Roth, University of Pittsburgh

5. ARSA — closed

Predict the effect of missense mutations on protein stability in Arylsulfatase A

Metachromatic Leukodystrophy (MLD) is an autosomal recessive, lysosomal-storage disease caused by mutations in Arylsulfatase A (ARSA) and toxic accumulation of sulfatide substrate. Genome sequencing has revealed hundreds of protein-altering, ARSA missense variants, but the functional effect of most variants remains unknown. ARSA protein stability was measured using a high-throughput cellular degradation assay for a large set of variants. The challenge is to predict the fractional protein stability of each of the 8,867 missense mutant protein at 48 hours post-expression.

Data provided by: Michael H. Gelb, University of Washington

6. FGFR — closed

Predict gain-of-function variants in the fibroblast growth factor receptors

Aberrantly activated fibroblast growth factor receptors (FGFRs) frequently drive tumorigenesis via activating, gain-of-function (GoF) mutations. The challenge involves predicting the functional impact of all possible missense variants (derived from single-nucleotide variants) in the kinase domains of human FGFR1, FGFR2, FGFR3, and FGFR4. These variants pose significant challenges for variant interpretation and precision oncology due to the lack of functional and clinical data, as well as the currently limited repertoire of approaches for predicting GoF. In addition to predicting whether variants cause activation or inactivation, the challenge also optionally involves predicting drug resistance. Predictions will be assessed against a high-throughput functional genomics saturation mutational scanning dataset.

Data provided by: Sven Diederichs, University of Freiburg & German Cancer Consortium (DKTK)

Non-Coding Variant Interpretation

1. lentiMPRA — closed

Predicting variant effects in functional regulatory elements using lentiMPRA

The challenge is to predict the functional impact of genetic variants on regulatory element activity. For this purpose, a subset of functionally validated regulatory elements from a large-scale lentiMPRA study were selected and existing single nucleotide variant alleles (SNVs) in these elements added. SNVs were chosen from the 1000 Genomes Project with a focus on variants with diverse allele frequency distributions and proximity to known genes. Each SNV was tested in both reference and alternative allele contexts using lentiMPRA in HepG2 cells across three biological replicates. Sequences were cloned upstream of a minimal promoter in a barcode-tagged reporter construct. Reporter gene expression was measured relative to the plasmid DNA using short-read sequencing of barcodes from the reporter libraries to determine the activity of the sequences. Variant effects were determined as the difference of paired reference and alternative sequences.

Data provided by: Arjun Devadas, University Medical Center Schleswig-Holsteinl; Ryan Hernandez, University of California San Francisco; Nadav Ahituv, University of California San Francisco; Martin Kircher, University Medical Center Schleswig-Holsteinl, Berlin Institute of Health at Charité-Universitätsmedizin Berlin.

Splicing

1. Splicing Mini Gene — closed

Predict aberrant splicing for peri-exonic and deep intronic variants

A high-throughput splicing assay was applied to assess the effect of 9,133 single nucleotide variants or small indels on splicing in a mini-gene construct transfected into HEK293 landing pad cells. The results are expressed as ΔAbS (delta Aberrant Splicing), between the variant AbS and the reference AbS. The challenge is to predict ΔAbS for each variant.

Data provided by: Kinga Bujakowska, Harvard Medical School

Annotation Accumulation Accuracy Assessments

1. Annotate All Missense — closed

Predict pathogenicity of all missense variants

The challenge is to predict the effect of every missense variant (a variant that results in a single amino acid substitution) listed in dbNSFP, a database that currently describes 82,198,516 single nucleotide variants in the human genome that change a protein’s amino acid sequence. The effect of the vast majority of missense variants is currently unknown, but experimental and clinical evidence is accruing rapidly. Rather than drawing upon a single dataset as typical with most CAGI challenges, predictions will be assessed by comparing with clinical (pathogenicity) and experimental annotations made available after the prediction submission date, and on an ongoing basis. If predictors assent, their predictions will also be incorporated into dbNSFP.

Data provided by: Xiaoming Liu, University of South Florida

2. Annotate All In-Frame Indels — closed

Predict pathogenicity of in-frame indel variants

Short in-frame insertion and deletion (indel) variants add or remove one or more amino acids without disrupting the open reading frame. While these variants do not cause frameshifts, they can still have profound consequences on protein structure, stability, and function, and are implicated in a variety of genetic diseases. The challenge is to predict the pathogenicity of all single, double, and triple amino acid deletions and all single amino acid insertions in human protein-coding, Mendelian disease-associated genes. For the vast majority of these variants, experimental or clinical functional data are not currently available, but such evidence is accumulating rapidly. Predictions will be assessed against new experimental or clinical annotations as they become available, with regular evaluations in line with CAGI standards.

Data provided by: CAGI organizers

3. Annotate All Loss-of-Function Variants — closed

Predict pathogenicity of frameshifting indels, stop gain and stop loss variants

Loss-of-function variants represent critical classes of genetic variation that can significantly alter protein function and are implicated in a variety of genetic diseases. Two major sources contribute to these variants: (1) frameshifting mutations from out-of-frame insertions or deletions that shift the reading frame and create new downstream stop codons and (2) single nucleotide substitutions that directly create premature stop codons or eliminate natural stop codons. These variants often result in complete loss of function, dominant-negative effects, or gain of toxic function. The challenge is to predict the functional impact of all 1bp and 2bp frameshift insertions/deletions and single nucleotide substitutions that result in stop gain or stop loss across human protein-coding, Mendelian disease-associated genes. For frameshifting variants leading to the same stop gain or stop loss, only the one located at the most 5’ end is included. Predictions will be assessed against new experimental or clinical annotations as they become available, with regular evaluations in line with CAGI standards.

Data provided by: CAGI organizers

Last updated: June 22, 2026

Center for Critical Assessment of Genome Interpretation

Register/Login

Critical Assessment of Genome Interpretation