CAGI Scientist Opening: apply here ☀️

CAGI5 Challenge

Challenges released!

Regulatory variants

1. Regulation saturation

Predict effect of all variants in 10 disease associated promoter and 11 enhancer elements in a MPRA.

17,500 single nucleotide variants and small indels in 11 human disease associated enhancers (including IRF4, IRF6, MYC, SORT1) and 10 promoters (including TERT, LDLR, F9, HBG1) were assessed in a saturation mutagenesis massively parallel reporter assay.. Promoters were cloned into a plasmid upstream of a barcoded reporter, whose expression was measured relative to the plasmid DNA to determine the impact of promoter variants. Enhancers were assayed similarly, placed upstream of a minimal promoter. The challenge is to predict the functional effects of these variants in the regulatory regions upon barcoded reporter expression. 

Data provided by: Martin Kircher, Translational Genomics Center, Berlin Institute of Health, Berlin, Germany & Department of Genome Sciences, University of Washington

Nonsynonymous variants

1. CALM1

Predict the effect of calmodulin variants in a yeast growth assay.

Calmodulin is a calcium-sensing protein that modulates the activity of a large number of proteins in the cell. It is involved in many cellular processes, and is especially important for neuron and muscle cell function. Variants that affect calmodulin function have been found to be causally associated with cardiac arrhythmias. A large library of calmodulin missense variants was assessed with respect to their effects on protein function using a high-throughput yeast complementation assay. The challenge is to predict the functional effects of these calmodulin variants on competitive growth in a high-throughput yeast complementation assay. 

Data provided by: Frederick "Fritz" Roth, University of Toronto

2. PCM1

Predict whether missense mutations within the PCM1 gene impact zebrafish ventricular area development. 

The PCM1 (Pericentriolar Material 1) gene is a component of centriolar satellites occurring around centrosomes in vertebrate cells. Several studies have implicated PCM1 variants as a risk factor for schizophrenia. Ventricular enlargement is one of the most consistent abnormal structural brain findings in schizophrenia Therefore 38 transgenic human PCM1 missense mutations implicated in schizophrenia were assayed in a zebrafish model to determine their impact on the posterior ventricle area. The challenge is to predict whether variants implicated in schizophrenia impact zebrafish ventricular area. 

Data provided by: Nicholas Katsanis, Duke University

3. Frataxin

Predict the impact of variants of Frataxin protein on thermodynamic stability. 

Fraxatin is a highly-conserved protein found in prokaryotes and eukaryotes that is required for efficient regulation of cellular iron homeostasis. Humans with a frataxin deficiency have the cardio- and neurodegenerative disorder Friedreich's ataxia. A library of eight missense variants was assessed by near and far-UV circular dichroism and intrinsic fluorescence spectra to determine thermodynamic stability at different concentration of denaturant. These were used to calculate a ΔΔGH20 value, the difference in unfolding free energy (ΔGH20) between the mutant and wild-type proteins for each variant. The challenge is to predict ΔΔGH20 for each frataxin variant.

Data provided by: Roberta Chiaraluce and Valerio Consalvi, Sapienza University, Rome

4. TPMT and p10

Predict the effect of variants on TPMT and p10 protein stability.

The gene p10 encodes for PTEN (Phosphatase and TEnsin Homolog), an important secondary messenger molecule promoting cell growth and survival through signaling cascades including those controlled by AKT and mTOR. Thiopurine S-methyl transferase (TPMT) is a key enzyme involved in the metabolism of thiopurine drugs and functions by catalyzing the S-methylation of aromatic and heterocyclic sulfhydryl groups. A library of thousands of PTEN and TPMT mutations was assessed to measure the stability of the variant protein using a multiplexed variant stability profiling (VSP) assay, which detects the presence of EGFP fused to the mutated PTEN and TPMT protein respectively. The stability of the variant protein dictates the abundance of the fusion protein and thus the EGFP level of the cell. The challenge is to predict the effect of each variant on TPMT and/or PTEN protein stability.

Data provided by: Kenneth Matreyek, Lea Starita, and Doug Fowler, University of Washington

5. Annotate all nonsynonymous variants

Predict impact of all nonsynonymous variants in the genome. 

dbNSFP describes 810,848,49 possible protein-altering variants in the human genome. The challenge is to predict the functional effect of every such variant. For the vast majority of these missense variants, the functional impact is not currently known, but experimental and clinical evidence are accruing rapidly. Rather than drawing upon a single discrete dataset as typical with CAGI, predictions will be assessed by comparing with experimental or clinical annotations made available after the prediction submission date, on an ongoing basis. if predictors assent, predictions will also incorporated into dbNSFP.

Data provided by: Xiaoming Liu, University of Texas School of Public Health

6. GAA

Predict impact of nonsynonymous variants in the GAA protein. 

Acid alpha-glucosidase (GAA) is a lysosomal alpha-glucosidase. Some mutations in GAA cause a rare disorder, Pompe disease, (Glycogen Storage Disease II). Rare GAA missense variants found in a human population sample have been assayed for enzymatic activity in transfected cell lysates. The assessment of this challenge will include evaluations that recognize novelty of approach. The challenge is to predict the fractional enzyme activity of each mutant protein compared to the wild-type enzyme.

Data provided by: Wyatt Clark, Kevin Ru, Karen Yu, Jonathan H. LeBowitz, BioMarin Pharmaceutical

Classification of variants in breast cancer cases and controls

1. CHEK2

Predict the probability of an individual with a given CHEK2 variant gene being in the case (breast cancer) or control cohorts.

Variants in the CHEK2 gene are associated with breast cancer. This challenge includes CHEK2 gene variants from approximately 1200 Latino breast cancer cases and 1200 ethnically matched controls. This challenge is to estimate the probability of each gene variant occurring in an individual from the cancer affected cohort. 

Data provided by: Elad Ziv, University of California, San Francisco


Predict which variants are associated with increased risk for breast cancer.

Breast cancer is the most prevalent cancer among women worldwide. The association between germline mutations in the BRCA1 and BRCA2 genes and the development of cancer has been well established. The most common high-risk mutations associated with breast cancer are those in the autosomal dominant breast cancer genes 1 and 2 (BRCA1 and BRCA2). Mutations in these genes are found in 1-3% of breast cancer cases. The challenge is to predict which variants are associated with increased risk for breast cancer. 

Data provided by: Amanda Spurdle, QIMR Berghofer Medical Research Institute (Australia), and the ENIGMA consortium


1. MaPSy

Identify the alleles causing splicing defects and estimate their effects on splicing in a Massively Parallel Splicing Assay.

The Massively Parallel Splicing Assay (MaPSy) approach was used to screen 797 reported exonic disease mutations using a mini-gene system, assaying both in vivo via transfection in tissue culture, and in vitro via incubation in cell nuclear extract. The challenge is to predict the degree to which a given variant causes changes in splicing. 

Data provided by: Will Fairbrother, Brown University

2. Vex-seq

Predict effect of variants on exon splicing in a high-throughput assay.

A barcoding approach called Variant exon sequencing (Vex-seq) was applied to assess effect of 2,059 natural single nucleotide variants and short indels on splicing of a globin mini-gene construct transfected into HepG2 cells. This is reported as ΔΨ (delta PSI, or Percent Spliced In), between the variant Ψand the reference Ψ. The challenge is to predict ΔΨ for each variant. 

Data provided by: Brenton R. Graveley, UConn Health, Farmington

Clinical genomes

1. SickKids clinical genomes

Match the patients’ genome to their clinical descriptions and predict the causal pathogenic variants.

This challenge involves 30 children with suspected genetic disorders who were referred for clinical genome sequencing. Predictors are given the 30 genome sequences, and are also provided with the phenotypic descriptions as shared with the diagnostic laboratory. The challenge is to predict what class of disease is associated with each genome, and which genome corresponds to which clinical description. Predictors may additionally identify the diagnostic variant(s) underlying the predictions, and identify predictive secondary variants conferring high risk of other diseases whose phenotypes are not reported in the clinical descriptions.

Data provided by: Stephen Meyn & colleagues, SickKids

2. ID Panel

Predict patients’ clinical descriptions and pathogenic variants from gene panel sequences.

The challenge presented here is to use computational methods to predict a patient’s clinical phenotype and the causal variant(s) based on analysis of their gene panel sequence data. Sequence data for 74 genes associated with intellectual disability (ID) and/or Autism spectrum disorders (ASD) from a cohort of 150 patients with a range of neurodevelopmental presentations (ID, autism, epilepsy, etc..) have been made available for this challenge. For each patient, predictors must report the causative variants and which of seven phenotypes are present.

Data provided by: Emanuela Leonardi, Alessandra Murgia, Neurodevelopmental Molecular Genetics Laboratory, Department of Women’s and Children’s Health, University of Padua – Hospital of Padua.

Research trial exomes, complex disease

1. Clotting disease (DVT or PE) exomes

Distinguish between exomes of individuals who have developed clotting disorder and which did not.

African Americans have a higher incidence of developing venous thromboembolisms (VTE), which includes deep vein thrombosis (DVT) and pulmonary embolism (PE), than people of European ancestry. Participants are provided with exome data and clinical covariates for a cohort of African Americans who have been prescribed Warfarin either because they had experienced a VTE event or had been diagnosed with atrial fibrillation (which predisposes to clotting). The challenge is to distinguish between these conditions. At present, in contrast to European ancestry, there are no genetic methods for anticipating which African Americans are most at risk of a venous thromboembolism, and the results of this challenge may contribute to the development of such tools.

Data provided by: Roxana Daneshjou and Russ Altman, Stanford University School of Medicine