Computational Reproducibility

There exist many ways to make the computational analyses of a given study fully reproducible. Size and accessibility of the data, software tools and computing resource requirements are among the factors that will define how an analysis can be made fully reproducible.

Following its mission, the Society asks its members to share examples of manuscripts that can be fully reproduced. In particular, the data must be freely accessible, the code must be open-source and documented, and a tutorial describing how to rerun the analyses to generate the figures and tables of the manuscript must be provided with the submission. The board members of the Society will select pipelines using different data and tools to best represent the diversity of reproducibility frameworks, and will work with te authors to reproduce the pipeline independently.

The ultimate goal of this project consists of providing practical examples and guidelines to help scientists make their own research fully reproducible.

Inaugural Society Project

An Internationally-conducted External Quality Control scheme on Machine Learning Algorithms to assess Tumor Infiltrating Lymphocytes in Breast Cancer

Project Coordinator: Roberto Salgado, Click to Email

How to participate: all groups with documented analytically validated machine learning tools are welcome to participate. Interested groups need to provide to the coordinators a motivated request to participate, with documentation of the analytical validity of the method they are going to apply for this program. The information provided by the groups will be considered strictly confidential. The method that is going to applied for this assessment needs to be locked, may not be changed during the assessment and should be described in detail in order to avoid implicit overfitting.


  • To set quality standards and performance metrics on machine learning algorithms before introduction in a clinical trial setting and/or daily practice setting.
  • To set quality standards and performance metrics that can be used by regulatory agencies to certify machine learning algorithms for use in patient management.
  • To develop a framework for comparison of machine learning algorithms to determine precise quantitative metrics of other breast cancer biomarkers, like Ki67.


At present, in early-stage disease clinico-pathological risk stratification is performed using a limited set of features such as tumor size and lymph node status. Very large adjuvant trials such as ALTTO and APHINITY that have applied these stratification schemes have illustrated the key problems with the current classification scheme – it does not stratify patients with sufficient granularity to permit selection for clinical trials. The current scheme also takes the approach of placing patients on a continuum of risk. This is at odds with results from high-throughput technologies such as gene expression profiling and genomic assays, which focus on identifying individual patient groups with particular clinical behaviour. Several results in this area have identified genomic, transcriptomic or proteomic features which in hindsight are associated with particular histological features. This suggests that the histological appearance of a tumor represents a useful cancer phenotype which can be further explored, and contribute to staging and stratification.

Machine learning refers to the general computational approach whereby data is used by algorithms to develop predictive models. These models are finely tuned to optimize accuracy and generalizability as applied to new data. Although machine learning existed for some time, more recently, advances in algorithm development and hardware infrastructure has enabled ‘deep learning’ approaches. Deep learning was originally designed to mimic the neural architecture of the human brain, and conceptually uses a series of connected nodes (neural nets) which respond to input in a way that is tuned with repeated cycles of learning. Neural nets have the ability to learn rich representations of complex data, which may contain hierarchical and non-linear relationships. These abilities make neural nets ideally suited to image classification. They have exhibited spectacular results in this area, often matching the performance of experts in the field or exceeding it (superhuman capabilities).

With recent advances in deep learning provides a path forward for numerous applications in digital pathology. On one level, the robust performance and training characteristics of deep learning allows us to develop accurate automated assays for pathological features such as grade and lymphocyte infiltration. These have the potential to be ‘learn once, apply everywhere’. This is in contrast to existing imaging methods, which lack the precision and robustness to be used in the clinical setting. If the promise of deep learning can be validated, the use of digital pathology would aid pathologists in routine reporting, and could be expected to improve the validity of current pathology based clinico-pathological features. In the short term, digital pathology would also help standardize pathology results within and across trials given the time required for pathology assessed quantitative metrics.

TILs have been shown to be a reliable and reproducible marker of tumor immunogenicity in breast cancer. It is clear that higher levels of TILs are associated with improved prognosis in early stage TNBC and HER2-positive breast cancer, as well as a higher probability of achieving pCR in the neoadjuvant setting. Analysis of TILs in residual disease specimens after neoadjuvant therapy has also been shown to have prognostic value. The evaluation of TILs as a biomarker in breast cancer is likely to be extended from the research domain to the clinical setting in the near future. The assessment of TILs by digital image analysis might be useful for standardization in the future, since this approach has the potential, for example, to determine the number of TILs per mm² stromal tissue as an exact measurement contrary to the approximate semi-quantitative evaluation suggested at this moment. In the first International Guidelines on TIL-assessment in breast cancer we proposed to develop an inter-laboratory Ring study to assess the reproducibility and clinical validity of TILs assessment, including machine learning algorithms. While TILs have been measured morphologically and have been shown to add predominantly prognostic information, methodological open questions in the morphological evaluation of TILs still remain, for example the assessment and importance of spatial TIL-heterogeneity. The measurement on H&E-stained slides most likely represents the beginning of the efforts to use infiltrating cell properties as companion diagnostic tests. Thus, as a field, we should be open to the introduction of molecular methods, most likely in situ, that can classify the TILs-component and bring higher levels of information to the patient sample. However, at this time, these deep learning approaches are still experimental and not sufficiently documented for introduction into standard practice.

On another level however, deep learning also permits discovery of image based features which may be very difficult for current approaches to identify, particularly if they only exist in small groups of patients. The key benefit of deep learning here is to rapidly identify pathological features in clinical trials that are predictive of treatment or prognostic of outcome in a standardized way. This is an essential first step in deciding if previously undescribed pathological features are clinically relevant, and is largely infeasible using current approaches. Deep learning also permits modification and retraining of the feature set to optimize accuracy and interpretability, which is again infeasible with current methods.

The Working Group is therefore proposing a collaboration with the Massive Analysis and Quality Control Consortium (website) characterizing tumor infiltrating lymphocytes using machine learning algorithms. Developing a machine learning based assay for tumor infiltrating lymphocytes would enable rapid expansion of this promising pathological feature, and by providing an adjunct to human pathologists, enhance the validity and robustness for prognosis/prediction.

Specific Aims:

  • Comparison of the machine learning image classification metrics with those of pathologists in the RING-study which the Working group has published (Carsten Denkert et al., Mod. Pathol. 2016)
  • Comparison of automated TILs scoring with pathologist scoring results in different settings, namely core biopsies, full sections, pre-invasive (DCIS), untreated and treated tumors.
  • Comparing the performance of deep learning approaches to identify complex features such as clustering/spatial statistics including proximity of TILs to cancer cells that are prognostic of outcome or predictive of treatment.
  • Comparing the clinical validity of different machine learning algorithms, the utility of combining models for improved accuracy and to identify possible false positives and false negatives.
  • To combine annotated training data from different sites to create a comprehensive breast cancer ML training and validation data base hosted by the consortium.
  • A framework will be developed to facilitate automated testing, validation and certification of image classification derived pathology metrics that can improve the standard of care.
  • Develop together with both groups a kind of review, perspective or opinion paper on the use of Artificial Intelligence/machine learning tools in Oncology, focusing but not exclusively on TILs and including the quality requirements for use of these technologies in a clinical trial and daily practice setting, similar in kind as the Lisa Mc Shane paper in Nature Criteria for the use of omics-based predictors in clinical trials, doi: 10.1038/nature12564, Nature 2013, which is an exercise that may be very useful for regulatory (FDA; EMA). If we pursue this idea, we should aim for a high level journal like Nature, Nature Biotechnology, Nature Reviews Clinical Oncology or Nature Reviews Drug Discovery.

Study Design:

Breast cancer slide-sets in different settings (invasive, DCIS, residual disease) with known TIL-assessment by pathologists will be posted on the website of the International Immune-oncology Biomarker Group.

  • A clinical trial slide-set, with clinical annotation, will be hosted on the website of the International Immune-oncology Biomarker Group.
  • All these slides can then be assessed by all participating groups.
  • The metrics assessed will be reported on pre-specified formats to the coordinators.
  • A systematic comparison of the output of the machine learning/deep learning approaches with the pathologists’ score will be performed in all datasets and eventual added clinical validity to the pathologists’ TIL-score will be evaluated using the clinical trial datasets.


  • The project is aimed to start in 2019 and aims to be finished within 1 year from the start of the program.
  • Results will be presented at the annual meeting of the International Immuno-Oncology Biomarker Working Group held at the San Antonio Breast Cancer Conference and at the annual MAQC-Conference.
  • Publication is aimed within 6 months after completion of the program in a high level journal.