Integrated Center for Oncology

Breast Cancer Gene-Expression Miner v4.3
(bc-GenExMiner v4.3)

bc-GenExMiner logo


[ Published annotated data ][ Published transcriptomic data ][ Data pre-processing ][ Molecular subtype classification ]
[ Statistical analyses ][ Survival statistical tests ][ Gene expression ][ Correlation map ]

Published annotated data:

bc-GenExMiner version v4.3 (current: - archives:)
Data type shown: Microarray (available data:)

#ReferenceNo. patientsNodal
ER status1PR status1HER2 status1SBR
Age at diagnosisNPI
Event status   
MR   AE   
1Van de Vijver et al., 2002295   101   122   
2Sotiriou et al., 200399   30   53   
3Ma et al., 200459      27   
4expO et al., 2005307         
5Minn et al., 200582   27   27   
6Pawitan et al., 2005159   40   50   
7Wang et al., 2005286   107   107   
8Weigelt et al., 200550   213   13   
9Bild et al., 2006158      50   
10Chin et al., 2006112   221   42   
11Ivshina et al., 2006249   2   89   
12Chin et al., 2007171   38   56   
13Desmedt et al., 2007198   62   91   
14Loi et al., 2007267   266   88   
15Minn et al., 200758   11   11   
16Naderi et al., 2007135      65   
17Yau et al., 200747         
18Zhou et al., 200754   9   9   
19Anders et al., 200875   14   14   
20Chanrion et al., 2008151   46   55   
21Loi et al., 200877   210   13   
22Schmidt et al., 2008200   46   46   
23Calabrò et al., 2009139      96   
24Desmedt et al., 200955      55   
25Jézéquel et al., 2009252   65   68   
26Zhang et al., 2009136   20   20   
27Jönsson et al., 2010346      151   
28Li et al., 2010115   214   14   
29Parris et al., 201094      45   
30Sircoulomb et al., 201055   17   17   
31Symmans et al., 201043   71   71   
32Buffa et al., 2011216   82   82   
33Dedeurwaerder et al., 201185      36   
34Filipits et al., 2011277   58   58   
35Hatzis et al., 2011309   65   65   
36Heikkinen et al., 2011174   34   34   
37Kao et al., 2011296   63   73   
38Sabatier et al., 201171         
39Sabatier et al., 2011239      74   
40Wang et al., 2011149      10   
41Curtis et al., 20121 980      1 143   
42Kuo et al., 201251   12   12   
43Servant et al., 2012343   119   119   
44Clarke et al., 2013104   48   48   
45Guedj et al., 2013536   119   119   
46Larsen et al., 2013183         
47Nagalla et al., 201341   14   14   
48Castagnoli et al., 201453   23   23   
49Fumagalli et al., 201457         
50Merdad et al., 201445         
51Terunuma et al., 201455      19   
52Burstein et al., 201567         
53Michaut et al., 2016104   26   32   
54Biermann et al., 201753         
Total10 012   41473226394119851501 491   3 526   

1 ER, PR and HER2 status determined by immunohistochemistry
2 NPI score could be computed only for node negative patients

Legend  Open

 No.: number of
 ER: oestrogen receptor by IHC
 PR: progesterone receptor by IHC
 HER2: HER2 receptor by IHC
 IHC: ImmunoHistoChemistry
 SBR: Scarff Bloom and Richardson grade
 NPI: Nottingham prognostic index
 AOL: Adjuvant! Online
 SSPs: Single Sample Predictors (Sorlie, Hu and PAM50)
 SCMs: Subtype Clustering Models (SCMOD1, SCMOD2, SCMGENE)
 MR: metastatic relapse
 AE: any event (any pejorative event: local relapse, metastatic relapse or death.)
 : available information
 : unavailable information

[ back ]

Published transcriptomic data:

The following inclusion criteria for selection of transcriptomic data were used:
- macrodissection only (no microdissection, no biopsy),
- no neoadjuvant therapy before tumour collection,
- minimum number of patients: 40,
- no duplicate sample inside and between datasets, filtering:
. by sample ID and,
. by a threshold of Pearson correlation ‹ 0.99 which is used to avoid duplicate data,
- only female breast cancer.

bc-GenExMiner version v4.3 (current: - archives:)
Data type shown: Microarray (available data:)

#ReferenceNo. patientsStudy codePlatform originPlatform codeDNA chipNo. unique genes (2018)Processing *bc-GenExMiner version
1Van de Vijver et al., 2002295   Rosetta2002Agilent25k oligo custom14 853   log2 ratio1.0
2Sotiriou et al., 200399   PNAS1732912100NCI8k cDNA custom4 345   log2 ratio1.0
3Ma et al., 200459   GSE1378ArcturusGPL122322k oligo custom14 839   log2 ratio1.0
4expO et al., 2005307   GSE2109AffymetrixGPL570HG-U133P220 796   MAS5 and log24.3
5Minn et al., 200582   GSE2603AffymetrixGPL96HG-U133A12 662   MAS5 and log21.0
6Pawitan et al., 2005159   GSE1456AffymetrixGPL96 - GPL97HG-U133A + B18 432   MAS5 and log21.0
7Wang et al., 2005286   GSE2034AffymetrixGPL96HG-U133A12 662   MAS5 and log21.0
8Weigelt et al., 200550   GSE2741AgilentGPL1390Human 1A oligo UNC custom13 980   log2 ratio1.0
9Bild et al., 2006158   GSE3143AffymetrixGPL91HG-U95A v28 767   MAS5 and log21.0
10Chin et al., 2006112   E_TABM_158AffymetrixA-AFFY-76HG-U133A v212 662   MAS5 and log21.0
11Ivshina et al., 2006249   GSE4922AffymetrixGPL96 - GPL97HG-U133A + B18 432   MAS5 and log21.0
12Chin et al., 2007171   GSE8757VUMC MicroarrayGPL5737Human 30K 60-mer oligo array17 783   log2 ratio3.1
13Desmedt et al., 2007198   GSE7390AffymetrixGPL96HG-U133A12 662   MAS5 and log21.0
14Loi et al., 2007267   GSE6532AffymetrixGPL96 - GPL97 - GPL570HG U133A + B + P220 545   MAS5 and log21.0
15Minn et al., 200758   GSE5327AffymetrixGPL96HG-U133A12 662   MAS5 and log21.0
16Naderi et al., 2007135   E_UCON_1AgilentA-AGIL-14Human 1A oligo G4110A14 258   log2 ratio1.0
17Yau et al., 200747   GSE8193AffymetrixGPL96HG-U133A13 064   MAS5 and log24.3
18Zhou et al., 200754   GSE7378AffymetrixGPL96HG-U133A12 662   MAS5 and log23.1
19Anders et al., 200875   GSE7849AffymetrixGPL91HG-U95A v28 767   MAS5 and log21.0
20Chanrion et al., 2008151   GSE9893MLRGGPL5049Human 21k v12.015 016   MAS5 and log21.0
21Loi et al., 200877   GSE9195AffymetrixGPL570HG-U133P220 545   MAS5 and log21.0
22Schmidt et al., 2008200   GSE11121AffymetrixGPL96HG-U133A12 662   MAS5 and log21.1
23Calabrò et al., 2009139   GSE10510DKFZGPL648635k oligo17 809   log2 ratio1.0
24Desmedt et al., 200955   GSE16391AffymetrixGPL570HG-U133P220 545   MAS5 and log23.1
25Jézéquel et al., 2009252   GSE11264UMGC-IRCNAGPL48199k cDNA custom1 808   log2 ratio1.0
26Zhang et al., 2009136   GSE12093AffymetrixGPL96HG-U133A12 662   MAS5 and log21.1
27Jönsson et al., 2010346   GSE22133SweGeneGPL5345H_v2.1.1 55K9 236   log2 ratio3.1
28Li et al., 2010115   GSE19615AffymetrixGPL570HG-U133P220 545   MAS5 and log23.1
29Parris et al., 201094   GSE20462IlluminaGPL6947HumanHT-12 V3.019 019   Quantile norm. and l4.3
30Sircoulomb et al., 201055   GSE17907AffymetrixGPL570HG-U133P220 545   MAS5 and log23.1
31Symmans et al., 201043   GSE17705AffymetrixGPL96HG-U133A13 064   MAS5 and log24.3
32Buffa et al., 2011216   GSE22219IlluminaGPL6098HumanRef-8 v1.0 expr-bc15 757   log2 ratio3.1
33Dedeurwaerder et al., 201185   GSE20711AffymetrixGPL570HG-U133P220 545   MAS5 and log23.1
34Filipits et al., 2011277   GSE26971AffymetrixGPL96HG-U133A12 662   MAS5 and log23.1
35Hatzis et al., 2011309   GSE25055AffymetrixGPL96HG-U133A12 662   MAS5 and log23.1
36Heikkinen et al., 2011174   GSE24450IlluminaGPL6947HumanHT-12 V3.019 019   Quantile norm. and l4.3
37Kao et al., 2011296   GSE20685AffymetrixGPL570HG-U133P220 545   MAS5 and log23.1
38Sabatier et al., 201171   GSE31448AffymetrixGPL570HG-U133P220 796   MAS5 and log24.3
39Sabatier et al., 2011239   GSE21653AffymetrixGPL570HG-U133P220 545   MAS5 and log23.1
40Wang et al., 2011149   GSE16987IlluminaGPL6104HumanRef-8 v2.0 expr-bc16 769   log2 ratio3.1
41Curtis et al., 20121 980   METABRICIllumina HT-12 v3HumanHT-12 V3.018 027   non Affy4.3
42Kuo et al., 201251   GSE33926AgilentGPL7264Human 1A Microarray (V2) G4110B16 641   log2 ratio3.1
43Servant et al., 2012343   GSE30682Illumina HumanWG-6 vGPL6884HumanWG-6 v3.019 019   Quantile norm. and l4.3
44Clarke et al., 2013104   GSE42568AffymetrixGPL570HG-U133P220 796   MAS5 and log24.3
45Guedj et al., 2013536   E_MTAB_365AffymetrixGPL570HG-U133P220 796   MAS5 and log24.3
46Larsen et al., 2013183   GSE40115AgilentGPL15931SurePrint G3 Human GE 8x60K 20 121   log2 ratio4.3
47Nagalla et al., 201341   GSE45255AffymetrixGPL96HG-U133A12 662   MAS5 and log23.1
48Castagnoli et al., 201453   GSE55348IlluminaGPL14951HumanHT-12 WG-DASL V4.0 R219 019   Quantile norm. and l4.3
49Fumagalli et al., 201457   GSE43358AffymetrixGPL570HG-U133P220 796   MAS5 and log24.3
50Merdad et al., 201445   GSE36295AffymetrixGPL6244Gene 1.0 ST20 253   rma-gene-level4.3
51Terunuma et al., 201455   GSE37751AffymetrixGPL6244Gene 1.0 ST20 253   rma-gene-level4.3
52Burstein et al., 201567   GSE76274AffymetrixGPL570HG-U133P220 796   MAS5 and log24.3
53Michaut et al., 2016104   GSE68057AgilentGPL20078Agendia32627_DPv1.14_SCFGplus20 212   Quantile norm. and l4.3
54Biermann et al., 201753   GSE97177IlluminaGPL6947HumanHT-12 V3.019 019   Quantile norm. and l4.3
Total  10 012   

* Data have been converted to a common scale (median equal to 0 and standard deviation equal to 1).

[ back ]

Data pre-processing:

1 DNA microarrays data

1.1 Affymetrix pre-processing:

Before being log2-transformed, Affymetrix raw CEL data were MAS5.0-normalised using the Affymetrix Expression Console.

1.2 Non-Affymetrix pre-processing:

Data have been downloaded as they were deposited in the public databases. When patient to reference ratio and its log2-transformation were not already calculated, we performed the complete process.

1.3 All DNA microarrays data:

Finally, in order to merge all studies data and create pooled cohorts, we converted studies data to a common scale (median equal to 0 and standard deviation equal to 1 a).

2 RNA-seq data

2.1 TCGA pre-processing:

RNA-Seq dataset were downloaded from the TCGA database (Genomic Data Commons Data Portal). We used the RNA-seq expression level read counts data produced by HTSeq and normalized using the FPKM normalization method b . FPKM values was log2-transformed using an offset of 0.1 in order to avoid undefined values.

2.2 GSE81540 pre-processing:

We used the Sweden Cancerome Analysis Network Breast (SCAN-B) c database. RNA-seq reads were mapped to the hg19 human genome with tophat2 and normalized in FPKM with cufflinks2 pipeline. Then Log2-transformed with an offset of 0.1.

2.3 All RNA-seq data:

Finally, in order to merge all studies data and create pooled cohorts, we converted studies data to a common scale (median equal to 0 and standard deviation equal to 1 a).

a Shabalin et al. Bioinformatics. 2008; 24,1154-1160
b Expression mRNA pipeline
c Saal et al. Genome Medicine 2015 7:20.

[ back ]

Molecular subtype classification:

Table 1: Molecular subtyping methods

Molecular subtype predictor (MSP) No. genes in MSP Reference Platform correspondence R script reference Statistics Subtypes
Single sample predictor (SSP) Sorlie's SSP 500   Sorlie et al, 2003 Gene symbols; probes median (if multiple probes for a same gene) Weigelt et al, 2010 Nearest centroid classifier;
highest correlation coefficient between patient profile and the 5 centroids
Luminal A,
Luminal B,
Normal breast-like
Hu's SSP 306   Hu et al, 2006
PAM50 SSP 50   Parker et al, 2009
Subtype clustering model (SCM) SCMOD1 726   Desmedt et al, 2008
Wirapati et al, 2008
subtype.cluster function, R package genefu Mixture of three gaussians;
use of ESR1, ERBB2 and AURKA modules
ER+/HER2- low proliferation,
ER+/HER2- high proliferation
SCMOD2 663  

Table 2: Molecular subtyping of 14 725 breast cancer patients included in bc-GenExMiner v4.3 according to 6 molecular subtype predictors. (A. DNA microarrays [n = 10 012], B. RNA-seq [n = 4 713])

MSPBasal-likeHER2-ELuminal ALuminal BNormal breast-likeunclassified
Sorlie's SSP1 457 14.6 1 182 11.8 2 937 29.3 1 144 11.4 1 315 13.1 1 977 19.7 
Hu's SSP2 282 22.8 879 8.8 2 421 24.2 1 844 18.4 1 501 15.0 1 085 10.8 
PAM50 SSP1 958 19.6 1 478 14.8 2 814 28.1 1 947 19.4 1 257 12.6 558 5.6 
RSSPC1 310 389 1 435 367 651 
low proliferation
high proliferation
SCMOD11 870 18.7 1 158 11.6 3 109 31.1 2 810 28.1 1 065 10.6 
SCMOD21 972 19.7 1 119 11.2 2 965 29.6 2 685 26.8 1 271 12.7 
SCMGENE2 803 28.0 1 403 14.0 2 583 25.8 2 200 22.0 1 023 10.2 
RSCMC1 294 692 1 827 1 490 
RMSPC1 060 231 827 243 
MSPBasal-likeHER2-ELuminal ALuminal BNormal breast-likeunclassified
Sorlie's SSP624 13.2 642 13.6 1 596 33.9 663 14.1 841 17.8 347 7.4 
Hu's SSP1 023 21.7 421 8.9 1 198 25.4 1 001 21.2 926 19.6 144 3.1 
PAM50 SSP832 17.7 734 15.6 1 432 30.4 1 031 21.9 640 13.6 44 0.9 
RSSPC583 208 747 224 437 
low proliferation
high proliferation
SCMOD1630 13.4 366 7.8 1 986 42.1 1 731 36.7 0.0 
SCMOD2666 14.1 416 8.8 1 915 40.6 1 716 36.4 0.0 
SCMGENE781 16.6 2 365 50.2 836 17.7 731 15.5 0.0 
RSCMC551 152 660 465 
RMSPC513 71 191 64 

Legend  Open

 MSP: molecular subtype predictor (SSPs + SCMs)
 No: number of patients
 SSP: single sample predictor
 RSSPC: robust SSP classification based on patients classified in the same subtype with the three SSPs
 SCM: subtype clustering model
 RSCMC: robust SCM classification based on patients classified in the same subtype with the three SCMs
 RMSPC: robust molecular subtype predictors classification

[ back ]

Statistical analyses:

Several types of analyses are available: prognostic analyses, correlation analyses and expression analyses, all of which have different subtypes.


Targeted expression analysis:

Once the analysis criteria have been chosen (gene(s) to be tested, clinical criterion (criteria) to test the gene against), the distribution of the gene in the available population (all cohorts with availability of required information pooled together) according to the clinical criterion (criteria) is illustrated by box and whiskers plots. To assess the significance of the difference in gene distributions in between the different groups, a Welch's test is performed, as well as Dunnett-Tukey-Kramer's tests when appropriate.

Exhaustive expression analysis:

box and whiskers plots are displayed, along with Welch's (and Dunett-Tukey-Kramer's) tests for every possible clinical criteria for a unique gene.

Customised expression analysis:

Similarly to targeted analysis, distribution of a chosen gene is compared in between groups, but here, the groups are defined based on another gene: the population (all cohorts with both gene values available pooled together) is split according to the median of the latter gene, resulting in 2 groups.


Targeted prognostic analysis:

Once the analysis criteria have been chosen (gene / Probe Set to be tested, nodal and oestrogen receptor status of the cohorts to be explored, event, on which survival analysis will be based, and splitting criterion for the gene), the prognostic impact of the gene is evaluated on all cohorts pooled by means of univariate Cox proportional hazards model, stratified by cohort, and illustrated with a Kaplan-Meier curve.
Cox results are displayed on the curve. In case of more than 2 groups, detailed Cox results (pairwise comparisons) are given in a separate table.
In order to minimize unreliability at the end of the curve, the 15% of patients with the longest follow-up are not plotteda.
To evaluate independent prognostic impact of gene(s) relative to the well-established clinical markers NPIb and AOLc (10-year overall survival) and to proliferation scored, adjusted Cox proportional hazards models are performed on pool's patients with available data.

Exhaustive prognostic analysis:

Univariate Cox proportional hazards model and Kaplan-Meier curves are performed on each of the 18 possible pools corresponding to every combination of population (nodal and oestrogen receptor status) and event criteria (metatastic relapse [MR], any event [AE]) to assess the prognostic impact of the chosen gene / Probe Set, discretised according to the splitting criterion selected. Results are displayed by population and event criteria and are ordered by p-value (smallest to largest).

Molecular subtype prognostic analysis:

Patients are pooled according to their molecular subtypes, based on three single sample predictors (SSPs) and three subtype clustering models (SCMs), and on three supplementary robust molecular subtype classifications consisting on the intersections of the 3 SSPs and/or of the 3 SCMs classifications: only patients with concordant molecular subtype assignment for the 3 SSPs (RSSPC), for the 3 SCMs (RSCMC), or for all predictors (RMSPC), are kept. Univariate Cox proportional analysis and Kaplan-Meier curves are performed for the chosen gene / Probe Set, discretised according to the splitting criterion selected, for each of the different molecular subtypes populations.

Basal-like/TNBC prognostic analysis:

Univariate Cox proportional hazards analyses and Kaplan-Meier curves are performed, for the chosen gene / Probe Set, discretised according to the splitting criterion selected, on Basal-like (BL) patients (as defined by PAM50), on Triple-Negative breast cancer (TNBC) patients (as defined by immunohistochemistry [IHC]) and on patients both BL and TNBC.


Gene correlation targeted analysis:

Pearson's correlation coefficient is computed with associated p-value for each pair of genes based on ten different populations: all patients pooled together, patients with positive oestrogen receptor status, patients with negative oestrogen receptor status, Basal-like patients, HER2-E patients, Luminal A patients and Luminal B patients (the last 4 subgroups being determined by the RMSPC), Basal-like (PAM50) patients, Triple-Negative (IHC) patients and the intersection of the 2 latter populations.
Results are displayed in a correlation map, where each cell corresponds to a pairwise correlation and is coloured according to the correlation coefficient value, from dark blue (coefficient = -1) to dark red (coefficient = 1).
Pearson's pairwise correlation plots are also computed to illustrate each pairwise correlation.

Gene correlation exhaustive analysis:

Pearson's correlation coefficient is computed, with associated p-value, between the chosen gene and all other genes that are present in the database, based on different populations: all patients pooled together, Basal-like patients, HER2-E patients, Luminal A patients and Luminal B patients, the last 4 subgroups being determined by the RMSPC.
Genes with correlation above 0.40 in absolute value and with associated p-value less than 0.05 are retained and the genes with best correlation coefficients are displayed in two different tables: one for the first 50 (or less) positive correlations, one for the first 50 (or less) negative ones.
The lists with all genes fulfilling criteria of correlation coefficient above 0.40 in absolute value and associated p-value less than 0.05 can be downloaded from the results page.

Gene Ontology analysis:

As a complement to this "screening" analysis, an analysis is performed to find Gene Ontology enrichment terms. This analysis focuses on significantly under- or over-represented terms present in the list of genes most positively correlated with the chosen gene, including itself, in the list of genes most negatively correlated with the chosen gene and in the union of these two lists.
For each term of each of the Gene Ontology trees (biological process, molecular function and cellular component), comparison is done between the number of occurrences of this term in the "target list", i.e. the number of times this term is directly linked to a gene, and the number of occurrences of this term in the "gene universe" (all of the genes that are expressed in the database) by means of Fisher's exact test. Terms with associated p-values less than 0.01 are kept.

Gene correlation analysis by chromosomal location:

Pearson's correlation coefficient is computed, with associated p-value, between the chosen gene and genes located around the chosen gene (up to 15 up and 15 down) on the same chromosome, based on seven different populations: all patients pooled together, patients with positive oestrogen receptor status, patients with negative oestrogen receptor status, Basal-like patients, HER2-E patients, Luminal A patients and Luminal B patients, the last 4 subgroups being determined by the RMSPC.
Detailed results are displayed in a table for each population. Pearson's pairwise correlation plots are also performed to illustrate correlation of each gene with the chosen one.

Targeted correlation analysis (TCA):

As a complement, results of gene correlation analysis for genes selected via the "TCA" column can be displayed.
Targeted correlation analysis ("TCA" button), which aims at evaluating the robustness of clusters, is proposed: correlation analyses are automatically computed between all possible pairs of genes that compose a selected cluster.

a Pocock et al. Lancet. 2002; 359(9318):1686-9
b Galea et al. Breast Cancer Res Treat. 1982; 45(3):361-6.
c Adjuvant! Online
d Dexter et al. BMC Syst Biol. 2010; 4:127.

Nota bene:
  • When working with gene symbols and in case of multiple probesets for the same gene, probeset values median is taken as unique value for the gene.
  • Kaplan-Meier curves will not be computed in populations with less than 5 patients.

[ back ]

Statistical tests:

  Survival statistical tests
Optimal Discretisation

In prognostic analyses, when choosing "optimal" as the splitting criterion for discretisation, gene / Probe Set is split according to

all percentiles from the 20th to the 80th, with a step of 5, and the cutoff giving the best p-value (Cox model) is kept.

Cox model

  - Aim of the Cox model:
Cox model is a regression model to express the relation between a covariate, either continuous (e.g. G gene) or ordered discrete (e.g. SBR grade), and the risk of occurrence of a certain event (e.g. metastatic relapse).
Its simplified formula for G gene can be written as follows:
h(t,g) = h0(t)*exp(.g), where h is the hazard function of the event occurrence at time t, dependent on the value g of G and h0(t) is the positive baseline hazard function, shared by all patients.
is the regression coefficient associated with G, the parameter one wants to evaluate.

  - Interpretation of Cox model results:
There are two particularly interesting results when building a Cox model: the p-value associated with , which tells us whether the covariate (e.g. gene) has a significant impact on the event-free survival (if the p-value is less than a certain threshold, usually 5%) and the hazard ratio (HR) (equal to exp()), sometimes summed up by its way (sign of ).

The HR, which is really interesting when the p-value is significant, is actually a risk ratio of an event occurrence between patients with regards to their relative measurements for the gene under study. To be more specific, the HR corresponds to the factor by which the risk of occurrence of the event is multiplied when the risk factor increases by one unit: h(t,G+1) = h(t,G)*exp().
The "way" of this HR permits therefore to know how the gene will generally affect the patients event-free survival.
For example, saying that parameter associated with the gene G under study is negative (thus exp() < 1) means that the greater the value of G, the lower the risk of event: if A and B are two patients such as A's G value gA is greater than B's G value gB, then one can say that patient A has a lower risk of metastatic relapse than patient B:
    gA > gB, < 0
 ⇒ .gA < .gB
 ⇒ exp(.gA) < exp(.gB)
 ⇒ h0(t)*exp(.gA) < h0(t)*exp(.gB), that is, h(t, gA) < h(t, gB).

Kaplan-Meier curves

  - The Kaplan-Meier estimator:
Kaplan-Meier method, also known as the product-limit method, is a non-parametric method to estimate the survival function S(t) (= Pr(T > t): probability of having a survival time T longer than time t) of a given population. It is based on the idea that being alive at time t means being alive just before t and staying alive at t.
Suppose we have a population of n patients, among whom k patients have experienced an event (metastastic relapse or death for instance) at distinct times t1 < t2 < ... < tm (m=k if all events occurred at different times). For each time ti, let ni designs the number of patients still at risk just before ti, that is patients who have not yet experienced the event and are not censored, and let ei designs the number of events that occurred at ti. The event-free survival probability at time ti, S(ti), is then the probability S(ti-1) of not experiencing the event before time ti (at time ti-1) multiply by the probability (ni-ei)/ni of not experiencing the event at time ti (which by definition of ti corresponds to the probability of not experiencing the event during the interval between ti-1 and ti): S(ti) = S(ti-1) x (ni-ei)/ni.
The Kaplan-Meier estimator of the survival function S(t) is thus the cumulative product:

Kaplan-Meier formula

  - The curve:
The Kaplan-Meier survival curve, i. e. the plot of the survival function, permits to visualize the evolution of the survival function (estimate). The curve is shaped like a staircase, with a step corresponding to events at the end of each [ti-1; ti[ interval.
The illustration of the Kaplan-Meier survival estimator by the Kaplan-Meier survival curve becomes especially interesting when there are different groups of patients (e.g. according to different treatments or different values of biological markers) and one wants to compare their relative event-free survival. The different survival curves are then plotted together and can be visually compared.
The colour palette used for the curve is from R package viridisa, it permits to keep the colour difference when converted to black and white scale and is designed to be perceived by readers with the most common form of color blindness.

  - Reliability of the estimation:
Caution must be taken concerning the interpretation of the survival curve, especially at the end of the survival curve: the censored patients induce a loss of information and reduce the sample size, making the survival curve less reliable; the end of the curve is obviously particularly affected. For our analyses, in order to minimize unreliability at the end of the curve, the 15% of patients with the longest event-free survival or follow-up are not plotteda.

a R package viridis: default color maps from 'matplotlib'
b Pocock et al. Lancet. 2002; 359(9318):1686-9

[ back ]

  Gene expression correlations
Pearson correlation

  - The coefficient:
Pearson correlation coefficient, also known as the Pearson's product moment correlation coefficient and denoted by r, measures the linear dependence (correlation) between two variables (e.g. genes).
It is obtained by the formula r = cov(G1,G2) / (std(G1)*std(G2)), where cov(G1,G2) is the covariance between the variables G1 and G2 and std denotes the standard deviation of each variable.
r values can vary from -1 to 1. A negative r means that when the first variable increases, the second one decreases, a postive r means that both variables increase or decrease simultaneously. The greater the r in absolute value, the stronger the linear dependence between the two variables, with the extreme values of -1 or 1 meaning a perfect linear dependence between the two variables, in which case, if the two variables are plotted, all data points lie on a line.

  - The associated p-value:
Along with the Pearson correlation coefficient, one can test if this coefficient is different from 0, knowing that the statistic
t = r*√(n-2)/√(1-r2) follows a Student distribution with (n-2) degrees of freedom, n being the number of values.
The p-value associated with the Pearson correlation coefficient permits thus to know if a linear dependence exists between the two variables.
Note that one has to be careful when interpreting p-value associated with Pearson correlation coefficient: a significant p-value means that a linear dependence exists between two variables but does not mean that this linear dependence is strong; for example, a coefficient of 0.05 with 1600 data points is associated with a significant p-value (p = 0.046) but one can certainly not conclude that there is a strong linear dependence between the two variables !

Correlation map

A correlation map illustrates pairwise correlations among a given group of genes.
A correlation map is a square table where each line and each column represent a gene. Each cell represents an "interaction" between two genes and is coloured according to the value of the Pearson correlation coefficient between these two genes, from dark blue (coefficient = -1) to dark red (coefficient = 1).
Cells from the diagonal of the correlation map represents "interaction" of a gene with itself and are coloured in black.

Pairwise correlation plot

On a correlation plot, the least-squares regression line is plotted along with the data points to illustrate the correlation between two given genes.

[ back ]

  Gene expression analyses
Box and whiskers plots

Box and whiskers plots permit to graphically represent descriptive statistics of a continuous variable (e.g. gene) : the box goes from the lower quartile (Q1) to the upper quartile (Q3), with an horizontal line marking the median. At the bottom and the top of the box, whiskers indicate the distance between the Q1, respectively Q3, and 1.5 times the interquartile range, that is : Q1-1.5*(Q3-Q1) and Q3+1.5*(Q3-Q1). Finally, stars indicate outliers, if there is any, that is, patients with values below or above the end of the whiskers.

Box and whiskers plots permit to visually compare distributions of a gene among the different population groups. When there is more than one group, Welch's test is used to evaluate the difference of gene's expression in between the groups. Moreover, when there are at least three different groups and Welch's p-value is significative (indicating that gene's expression is different in between at least two subpopulations), Dunnett-Tukey-Kramer's test is used for two-by-two comparisons (this test permits to know the significativity level but does not give a precise p-value).
[ back ]