Child pages
  • Breast Cancer Challenge: Detailed Description
Skip to end of metadata
Go to start of metadata


The goal of the breast cancer prognosis Challenge is to assess the accuracy of computational models designed to predict breast cancer survival, based on clinical information about the patient's tumor as well as genome-wide molecular profiling data including gene expression and copy number profiles.


Molecular diagnostics for cancer therapeutic decision-making are among the most promising applications of genomic technology. Several diagnostic tests have gained regulatory approval in recent years. Molecular profiles have proved particularly powerful in adding prognosis information to standard clinical practice in breast cancer, using gene-expression-based diagnostic tests such as MammaPrint [1] and Oncotype Dx [2].

Based on initial promising clinical results, computational approaches to infer molecular predictors of cancer clinical phenotypes are one of the most active areas of research in both industrial and academic institutions, leading to a flood of published reports of signatures predictive of cancer phenotypes. Several trends have emerged through these numerous studies: 1) genes defining predictive signatures of the same phenotype often do not overlap across multiple studies; 2) predictive signatures reported by one group may not prove robust in other studies; 3) there is no consensus regarding the most accurate signatures or computational methods for inferring predictive signatures; 4) there is no consensus regarding the added value of incorporating molecular data in addition to or instead of traditionally used clinical covariates.

There is a critical need to objectively and systematically assess whether genomic data, at this current time, provides value above and beyond classic TNM staging and other clinical covariates. For instance, the UK’s NICE (National Institute for Health and Clinical Excellence) has initial guidance that the genomic prognostic signatures currently being marketed do not supplant clinical measures in a cost-effective manner. The emergence of datasets containing clinical measurements combined with genome-wide molecular profiles of large breast cancer patient cohorts now allows prognostic models to be systematically evaluated. Given the complexity of the data and plethora of possible modeling approaches, we believe the most powerful mechanism of elucidating the optimal use of genomic and clinical information in breast cancer prognosis is through a community-based effort to evaluate the accuracy of many different modeling approaches on a common dataset and analytical platform and using a blind methodology to avoid the biases of self assessment.

The Challenge

This Challenge will create a community-based effort to provide an unbiased assessment of models and methodologies for the prediction of breast cancer survival. A common dataset will be provided to all participants, with a validation dataset held out for model evaluation. A novel dataset will be generated at the end of the Challenge and used to provide a final, unbiased score for each model.

Resources provided:

  1. Full-time use of a large computing resource is being donated by Google Inc. for the duration of the Challenge. All participants have been provisioned a dedicated 16 GB 8 core machine running R Studio for full time use for the duration of the competition. We are working on an experimental system to support large-scale parallel computation that will be provided to participants at an alpha level a few months after the start of the competition. We will begin by supporting parallel execution on the multiple cores of user machines and expand the system to larger scale parallel computation.
    The availability of these resources to all participants will allow for a democratization of computational resources and will empower participants to apply their best ideas in a high performance compute environment. Google’s donation of a common compute space also allows all models, including computationally intensive ones, to be shared and re-run on a common platform, enabling transparency of the process, and future work to evolve and extend components of promising models, either in breast cancer prognosis or other applications.
  2. The dataset will come from the METABRIC cohort of 2,000 breast cancer samples and include detailed clinical annotations, 10 median year survival time, gene expression, and copy number data [3]. 1,000 samples will be provided for model training, 500 samples used for real-time model evaluation, and 500 samples used for final scoring of all models.
  3. Additional breast cancer datasets, curated by Sage Bionetworks, will provide information on several thousand additional patients that Challenge participants can use in their model development.
  4. A novel dataset is being generated from 350 fresh frozen primary tumors with the same clinical annotations and survival data as the METABRIC cohort. Gene expression and copy number data will be generated for this cohort using the same molecular profiling platforms as were used to generate the METABRIC data. This will provide a truly novel validation set for the scoring of predictive models.
  5. A web-based platform called Synapse will be provided by Sage Bionetworks: the Synapse platform will enable transparent, reproducible model building and analysis workflows, as well as the sharing of data, tools, and models with the Challenge community, the model evaluators and the publication reviewers.
  6. For questions about the challenge please contact

Challenge timeline:

    • June-July 17th, 2012. Sign up for the Sage Bionetworks-DREAM Breast Cancer Prognosis Challenge. Registered participants will be notified by email about the initiation of the Challenge.
    • July 17th-October 15th, 2012. A live demo call was held on July 17th to help participants get started with the Challenge. A step by step guide is available here with additional details about the Challenge available at Breast Cancer Challenge: Detailed Description. Data from 1,000 samples will be provided to participants for training of models. An additional 500 samples will be used to provide real-time evaluation of all submitted models. The remaining 500 samples will be used for final scoring of all models (taking place after October 15th). 
    • October 15th, 2012. Final submission of all models, to be scored against the 500 Metabric data samples not used in the previous phase. The deadline for submitting models for the Breast Cancer Prognosis Challenge is 5PM EST October 15th, and the best performers will be announced at the DREAM 7 Conference taking place in San Francisco on November 12 to 16.
    • Late 2012. Final assessment of all models in newly generated data. For the new validation data set, molecular and clinical data on approximately 350 breast cancer samples (with archived fresh frozen tumor samples) is being provided by the group of Anne-Lise Borresen-Dale with the help of a donation from AVON. We are currently curating the clinical records of this patient cohort to harmonize with the current METABRIC dataset and working on generating the genomic profiling data for these samples. We aim to generate these data by the November 12 DREAM conference and announce initiation of the final evaluation to be performed on this data set. We will keep participants informed on progress in generating these data.

Data for Phase 1: the information below is now deprecated by Phase 2. Please click here for details

Starting in early July, all data for the Challenge will be accessible through Sage Bionetwork’s Synapse software platform and loaded into R objects via simple function calls through the Synapse R client. The data will comprise the following information:

Survival data 

  • Survival data is loaded into R as a Surv object as defined in the R survival package. This object is simply a 2 column matrix with sample names on the rows and columns:
    • time – time from diagnosis to last follow up.
    • status – weather the patient was alive at last follow up time.

Feature data

  • Gene expression data.
    • Performed on the Illumina HT 12v3 platform.
    • Loaded as Bioconductor ExpressionSet object.
    • Data normalized as described in [3].
  • Copy number data.
    • Performed on the Affymetrix SNP 6.0 platform
    • Loaded as Bioconductor ExpressionSet object
    • Data normalized as described in [3].
  • Clinical covariates. For a detailed explanation of the clinical data and how it is currently used in breast cancer prognosis and treatment, see Breast Cancer Challenge clinical background.
    • Loaded as a data.frame  object with the following features. We note that in the initial data release on 7/17/2012 factor data is encoded as characters. We are working on a new data release to encode these variables as factors with pre-specified factor levels and expect to update the data in the upcoming weeks.
variable nametypedescriptionfactor levels
age_at_diagnosisnumericage of patient at diagnosis of disease 
groupfactordisease and treatment group
  • 1 = Lymph Node negative and have not received chemotherapy

  • 2 = ER positive, Lymph Node positive, have received hormone therapy but no chemotherapy

  • 3 = ER negative, Lymph Node positive, have received chemotherapy

  • 4 = all others

gradeintegergrade of disease (1, 2, 3) 
sizeintegersize of tumor in cm 
lymph_nodes_positivefactorlymph node assessment
  • positive
  • negative
histological_typefactortumor histology
  • IDC
  • ILC
ER_IHC_statusfactorER status
  • pos
  • neg
  • null
cellularityfactortumor cellularity
  • low
  • moderate
  • high
  • undef


factorPam50 subtype by expression clustering
  • LumA
  • LumB
  • Her2
  • Normal
  • Basal
  • NC
Treatmentfactortreatment received
  • HT/RT = hormone / radiation therapy
  • CT/HT/RT = chemo / hormone / radiation therapy
  • NONE = none
  • CT/RT = chemo / radiation therapy
  • HT = hormone therapy
  • RT = radiation therapy
  • CT/HT = chemo / hormone therapy
  • CT = chemotherapy
Sitefactorsite of data collection
  • 1
  • 2
  • 3
  • 4
  • 5

Submission of Predictions and Write-up

A primary goal of this Challenge is to promote transparent, reusable models that can be assessed and extended by the community. To this end, models built for this Challenge will be constructed using the R programming language and uploaded to a common platform (Synapse) provided by Sage Bionetworks. Models will be uploaded as R objects implementing a function called customPredict() that returns a vector of risk predictors when given a set of feature data as input. The customPredict() function will be run by a validation script for each submitted model and resulting predictions scored as described in the Scoring section below.

The Challenge supports source code submissions allowing the validation script to also reproduce the training of the model. At various times throughout the competition the creator of the best performing model on the leaderboard may be asked to write a short description of their approach to be posted on the discussion forum.


The Challenge models will be scored by calculating the concordance index between the predicted survival and the true survival information in the validation dataset (accounting for the censor variable indicating whether the patient was alive at last follow-up). In addition, other scoring metrics will be considered depending on the suggestions of the community throughout the Challenge.


The high impact journal Science Translational Medicine (STM) has agreed for the best performing individual or team in the final evaluation (using the newly generated data) to publish their results as a "prize" for best performance provided its score is better than the score of a pre-defined baseline set of models. STM representatives agreed that having an evaluation committee re-run and compare all models in a transparent environment can serve the role of peer review (Challenge-assisted peer review), allowing the results from the winning individual or team to be published without additional review. Furthermore, the lead author of the best performing submission in the challenge phase ending in November will receive a speaking invitation at the DREAM 7 Conference taking place in San Francisco on November 12 to 16.


This Challenge is fueled by the generous donation of clinical study data on 2,000 breast cancer patients obtained by Samuel Aparicio of the BC Cancer Research Centre, Carlos Caldas of Cancer Research UK, and Anne-Lise Borresen-Dale of Oslo University Hospital. The Challenge was organized by Adam Margolin, Erhan Bilal, Mike Kellen, Brian Bot, Brig Mecham, Erich Huang, Andrew Trister, Charles Ferte, Gustavo Stolovitzky and Stephen Friend who profited from many discussions with Laura van’t Veer.


[1]        L. J. van  ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend, “Gene expression profiling predicts clinical outcome of breast cancer,” Nature, vol. 415, no. 6871, pp. 530–536, Jan. 2002.

[2]        S. Paik, S. Shak, G. Tang, C. Kim, J. Baker, M. Cronin, F. L. Baehner, M. G. Walker, D. Watson, T. Park, W. Hiller, E. R. Fisher, D. L. Wickerham, J. Bryant, and N. Wolmark, “A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer,” N. Engl. J. Med., vol. 351, no. 27, pp. 2817–2826, Dec. 2004.

[3]        C. Curtis, S. P. Shah, S.-F. Chin, G. Turashvili, O. M. Rueda, M. J. Dunning, D. Speed, A. G. Lynch, S. Samarajiwa, Y. Yuan, S. Gräf, G. Ha, G. Haffari, A. Bashashati, R. Russell, S. McKinney, M. Group, A. Langerød, A. Green, E. Provenzano, G. Wishart, S. Pinder, P. Watson, F. Markowetz, L. Murphy, I. Ellis, A. Purushotham, A.-L. Børresen-Dale, J. D. Brenton, S. Tavaré, C. Caldas, and S. Aparicio, “The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups,” Nature, 2012.

  • No labels

1 Comment

  1. Anonymous

    Is it still possible to participate in the challenge if we have not signed up a team yet?

Write a comment…