 Breast Cancer GeneExpression Miner v4.4 (bcGenExMiner v4.4)   
Glossary
[
Published annotated data ][
Published transcriptomic data ][
Intrinsic molecular subtype classification ][
Data preprocessing ]
[
Statistical analyses ][
Survival statistical tests ][
Graphic illustrations ]
Published annotated data:

The following inclusion criteria for selection of transcriptomic data were used:
 invasive carcinomas,
 tumour macrodissection (no microdissection, no biopsy),
 no neoadjuvant therapy before tumour collection,
 minimum number of patients: 40,
 no duplicate sample inside and between datasets, filtering:
. by sample ID and,
. by a threshold of Pearson correlation ‹ 0.99 which is used to avoid duplicate data,
 female breast cancer.

#  Reference  No. patients  ER status^{1}  PR status^{1}  HER2 status^{1}  Nodal status  Histo. type^{a}  SBR status  NPI status  AOL status  Age diagn.^{b}  P53  SSPs status  SCMs status  Event status  IHC  seq  MR  OS  1  Van de Vijver et al., 2002  295        ^{3}        101  79  2  Sotiriou et al., 2003  99               30  45  3  Ma et al., 2004  59                 4  expO et al., 2005  298        ^{3}          5  Minn et al., 2005  82               27   6  Pawitan et al., 2005  159  ^{2}              40  40  7  Wang et al., 2005  286               107   8  Weigelt et al., 2005  50        ^{3}        13  10  9  Bild et al., 2006  158  ^{2}               50  10  Chin et al., 2006  112        ^{3}        21  35  11  Ivshina et al., 2006  249        ^{3}          12  Chin et al., 2007  171               38  57  13  Desmedt et al., 2007  198               62  56  14  Loi et al., 2007  267        ^{3}        66   15  Minn et al., 2007  58               11   16  Naderi et al., 2007  135                47  17  Yau et al., 2007  47                 18  Zhou et al., 2007  54               9   19  Anders et al., 2008  75               14   20  Chanrion et al., 2008  151               46  41  21  Loi et al., 2008  77        ^{3}        10   22  Schmidt et al., 2008  200  ^{2}              46   23  Calabrò et al., 2009  139                63  24  Desmedt et al., 2009  55                 25  Jézéquel et al., 2009  252               65  47  26  Zhang et al., 2009  136               20   27  Jönsson et al., 2010  346                151  28  Li et al., 2010  115        ^{3}        14   29  Parris et al., 2010  94                44  30  Sircoulomb et al., 2010  55               17   31  Symmans et al., 2010  43               71   32  Buffa et al., 2011  216               82   33  Dedeurwaerder et al., 2011  85        ^{3}          34  Filipits et al., 2011  277               58   35  Hatzis et al., 2011  309               65   36  Heikkinen et al., 2011  174               34  27  37  Kao et al., 2011  296  ^{2}              63  62  38  Sabatier et al., 2011  71                 39  Sabatier et al., 2011  239                 40  Wang et al., 2011  149                 41  Curtis et al., 2012  1 980                1 143  42  Kuo et al., 2012  51               12   43  Servant et al., 2012  343               119   44  Clarke et al., 2013  104        ^{3}        48  35  45  Guedj et al., 2013  536               119   46  Larsen et al., 2013  183                 47  Nagalla et al., 2013  41               14  10  48  Castagnoli et al., 2014  53               23   49  Fumagalli et al., 2014  56                 50  Merdad et al., 2014  45                 51  Terunuma et al., 2014  55        ^{3}         19  52  Burstein et al., 2015  66                 53  Michaut et al., 2016  104               26  20  54  Biermann et al., 2017  53                 Total  10 001  7 243  3 989  3 026  8 240  4 699  7 325  4 379  948  7 857  922  1 980  9 657  9 403  1 491  2 081 
^{1} ER, PR and HER2 status determined by immunohistochemistry (IHC) ^{2} ER status determined by means of genomics data (Affymetrix™ probe: 205225_at) in case of a lack of IHC data.
See Kenn et al.^{3} NPI score could be computed only for node negative patients ^{a} Histological types ^{b} Age at diagnosis 
[ back ]
Published transcriptomic data:

The following inclusion criteria for selection of transcriptomic data were used:
 invasive carcinomas,
 tumour macrodissection (no microdissection, no biopsy),
 no neoadjuvant therapy before tumour collection,
 minimum number of patients: 40,
 no duplicate sample inside and between datasets, filtering:
. by sample ID and,
. by a threshold of Pearson correlation ‹ 0.99 which is used to avoid duplicate data,
 female breast cancer.

#  Reference  No. patients  Study code  Platform origin  Platform code  DNA chip  No. unique genes (2019)  Processing *  bcGenExMiner version  1  Van de Vijver et al., 2002  295  Rosetta2002  Agilent   25k oligo custom  14 853  log2 ratio  1.0  2  Sotiriou et al., 2003  99  PNAS1732912100  NCI   8k cDNA custom  4 345  log2 ratio  1.0  3  Ma et al., 2004  59  GSE1378  Arcturus  GPL1223  22k oligo custom  14 839  log2 ratio  1.0  4  expO et al., 2005  298  GSE2109  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  4.3  5  Minn et al., 2005  82  GSE2603  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  1.0  6  Pawitan et al., 2005  159  GSE1456  Affymetrix™  GPL96  GPL97  HGU133A + B  18 430  MAS5 and log2  1.0  7  Wang et al., 2005  286  GSE2034  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  1.0  8  Weigelt et al., 2005  50  GSE2741  Agilent  GPL1390  Human 1A oligo UNC custom  13 980  log2 ratio  1.0  9  Bild et al., 2006  158  GSE3143  Affymetrix™  GPL91  HGU95A v2  8 767  MAS5 and log2  1.0  10  Chin et al., 2006  112  E_TABM_158  Affymetrix™  AAFFY76  HGU133A v2  12 262  MAS5 and log2  1.0  11  Ivshina et al., 2006  249  GSE4922  Affymetrix™  GPL96  GPL97  HGU133A + B  18 430  MAS5 and log2  1.0  12  Chin et al., 2007  171  GSE8757  VUMC Microarray  GPL5737  Human 30K 60mer oligo array  17 782  log2 ratio  3.1  13  Desmedt et al., 2007  198  GSE7390  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  1.0  14  Loi et al., 2007  267  GSE6532  Affymetrix™  GPL96  GPL97  GPL570  HG U133A + B + P2  20 542  MAS5 and log2  1.0  15  Minn et al., 2007  58  GSE5327  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  1.0  16  Naderi et al., 2007  135  E_UCON_1  Agilent  AAGIL14  Human 1A oligo G4110A  14 258  log2 ratio  1.0  17  Yau et al., 2007  47  GSE8193  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  4.3  18  Zhou et al., 2007  54  GSE7378  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  3.1  19  Anders et al., 2008  75  GSE7849  Affymetrix™  GPL91  HGU95A v2  8 767  MAS5 and log2  1.0  20  Chanrion et al., 2008  151  GSE9893  MLRG  GPL5049  Human 21k v12.0  15 014  MAS5 and log2  1.0  21  Loi et al., 2008  77  GSE9195  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  1.0  22  Schmidt et al., 2008  200  GSE11121  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  1.1  23  Calabrò et al., 2009  139  GSE10510  DKFZ  GPL6486  35k oligo  17 807  log2 ratio  1.0  24  Desmedt et al., 2009  55  GSE16391  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  3.1  25  Jézéquel et al., 2009  252  GSE11264  UMGCIRCNA  GPL4819  9k cDNA custom  1 808  log2 ratio  1.0  26  Zhang et al., 2009  136  GSE12093  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  1.1  27  Jönsson et al., 2010  346  GSE22133  SweGene  GPL5345  H_v2.1.1 55K  9 236  log2 ratio  3.1  28  Li et al., 2010  115  GSE19615  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  3.1  29  Parris et al., 2010  94  GSE20462  Illumina  GPL6947  HumanHT12 V3.0  19 016  Quantile norm. and log2  4.3  30  Sircoulomb et al., 2010  55  GSE17907  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  3.1  31  Symmans et al., 2010  43  GSE17705  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  4.3  32  Buffa et al., 2011  216  GSE22219  Illumina  GPL6098  HumanRef8 v1.0 exprbc  15 757  log2 ratio  3.1  33  Dedeurwaerder et al., 2011  85  GSE20711  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  3.1  34  Filipits et al., 2011  277  GSE26971  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  3.1  35  Hatzis et al., 2011  309  GSE25055  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  3.1  36  Heikkinen et al., 2011  174  GSE24450  Illumina  GPL6947  HumanHT12 V3.0  19 016  Quantile norm. and log2  4.3  37  Kao et al., 2011  296  GSE20685  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  3.1  38  Sabatier et al., 2011  71  GSE31448  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  4.3  39  Sabatier et al., 2011  239  GSE21653  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  3.1  40  Wang et al., 2011  149  GSE16987  Illumina  GPL6104  HumanRef8 v2.0 exprbc  17 132  log2 ratio  3.1  41  Curtis et al., 2012  1 980  METABRIC  Illumina  GPL6947  HumanHT12 V3.0  18 025  Quantile norm. and log2  4.3  42  Kuo et al., 2012  51  GSE33926  Agilent  GPL7264  Human 1A Microarray (V2) G4110B  16 641  log2 ratio  3.1  43  Servant et al., 2012  343  GSE30682  Illumina  GPL6884  HumanWG6 v3.0  19 016  Quantile norm. and log2  4.3  44  Clarke et al., 2013  104  GSE42568  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  4.3  45  Guedj et al., 2013  536  E_MTAB_365  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  4.3  46  Larsen et al., 2013  183  GSE40115  Agilent  GPL15931  SurePrint G3 Human GE 8x60K  20 118  log2 ratio  4.3  47  Nagalla et al., 2013  41  GSE45255  Affymetrix™  GPL96  HGU133A  12 262  MAS5 and log2  3.1  48  Castagnoli et al., 2014  53  GSE55348  Illumina  GPL14951  HumanHT12 WGDASL V4.0 R2  19 459  Quantile norm. and log2  4.3  49  Fumagalli et al., 2014  56  GSE43358  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  4.3  50  Merdad et al., 2014  45  GSE36295  Affymetrix™  GPL6244  Gene 1.0 ST  20 251  rmagenelevel  4.3  51  Terunuma et al., 2014  55  GSE37751  Affymetrix™  GPL6244  Gene 1.0 ST  20 251  rmagenelevel  4.3  52  Burstein et al., 2015  66  GSE76274  Affymetrix™  GPL570  HGU133P2  20 542  MAS5 and log2  4.3  53  Michaut et al., 2016  104  GSE68057  Agilent  GPL20078  Agendia32627 DPv1.14 SCFGplus  20 209  Quantile norm. and log2  4.3  54  Biermann et al., 2017  53  GSE97177  Illumina  GPL6947  HumanHT12 V3.0  19 016  Quantile norm. and log2  4.3  Total  10 001  
* Data have been converted to a common scale (median equal to 0 and standard deviation equal to 1). 
[ back ]
Intrinsic molecular subtypes classification:
Table 1: Intrinsic molecular subtyping methods

Molecular subtypes predictor (MSP) 
No. genes in MSP 
Reference 
Platform correspondence 
R script reference 
Statistics 
Subtypes 
Single sample predictor (SSP) 
Sorlie's SSP 
500 
Sorlie et al, 2003 
Gene symbols; probes median (if multiple probes for a same gene) 
Weigelt et al, 2010 
Nearest centroid classifier; highest correlation coefficient between patient profile and the 5 centroids 
Basallike, HER2E, Luminal A, Luminal B, Normal breastlike 
Hu's SSP 
306 
Hu et al, 2006 
PAM50 SSP 
50 
Parker et al, 2009 
Subtype clustering model (SCM) 
SCMOD1 
726 
Desmedt et al, 2008
Wirapati et al, 2008 
subtype.cluster function, R package genefu 
Mixture of three gaussians; use of ESR1, ERBB2 and AURKA modules 
ER/HER2, HER2E, ER+/HER2 low proliferation, ER+/HER2 high proliferation 
SCMOD2 
663 
SCMGENE 
3 
Table 2: Intrinsic molecular subtyping of 14 713 breast cancer patients
included in bcGenExMiner v4.4 according to 6 molecular subtype predictors.
A DNA microarrays (n = 10 001). B RNAseq (n = 4 712).

A 
MSP  Basallike  HER2E  Luminal A  Luminal B  Normal breastlike  unclassified  No  %  No  %  No  %  No  %  No  %  No  %  Sorlie's SSP  1 453  14.5  1 182  11.8  2 934  29.3  1 142  11.4  1 313  13.1  1 977  19.8  Hu's SSP  2 277  22.8  879  8.8  2 422  24.2  1 840  18.4  1 500  15  1 083  10.8  PAM50 SSP  1 954  19.5  1 477  14.8  2 811  28.1  1 944  19.4  1 257  12.6  558  5.6  RSSPC  1 306    388    1 439    366    651        
MSP  ER/HER2  HER2E  ER+/HER2 low proliferation  ER+/HER2 high proliferation    unclassified  No  %  No  %  No  %  No  %      No  %  SCMOD1  1 867  18.7  1 156  11.6  3 104  31  2 809  28.1      1 065  10.6  SCMOD2  1 965  19.6  1 117  11.2  2 966  29.7  2 682  26.8      1 271  12.7  SCMGENE  2 790  27.9  1 400  14  2 586  25.9  2 202  22      1 023  10.2  RSCMC  1 288    690    1 827    1 490             RIMSPC  1 055    231    828    242             B 
MSP  Basallike  HER2E  Luminal A  Luminal B  Normal breastlike  unclassified  No  %  No  %  No  %  No  %  No  %  No  %  Sorlie's SSP  625  13.3  641  13.6  1 595  33.8  667  14.2  839  17.8  345  7.3  Hu's SSP  1 022  21.7  421  8.9  1 200  25.5  1 001  21.2  923  19.6  145  3.1  PAM50 SSP  832  17.7  736  15.6  1 433  30.4  1 029  21.8  639  13.6  43  0.9  RSSPC  583    208    748    226    435        
MSP  ER/HER2  HER2E  ER+/HER2 low proliferation  ER+/HER2 high proliferation    unclassified  No  %  No  %  No  %  No  %      No  %  SCMOD1  630  13.4  365  7.7  1 986  42.2  1 731  36.7      0  0.0  SCMOD2  667  14.2  416  8.8  1 913  40.6  1 716  36.4      0  0.0  SCMGENE  781  16.6  2 360  50.1  838  17.7  733  15.6      0  0.0  RSCMC  551    150    661    465             RIMSPC  513    72    192    66            
Figure 1: Intrinsic molecular subtyping of 14 713 breast cancer patients
included in bcGenExMiner v4.4 according to 6 intrinsic molecular subtype predictors by comparison of source of data: DNA microarrays (outer circles) vs. RNAseq (inner circles).
A 3 single sample predictors and the robust SSP classification (intersection).
B 3 subtype clustering models and the robust SCM classification (intersection).
C Robust RIMSPC classification (robust intrinsic molecular subtype predictors classification based on patients classified in the same subtype with the six MSPs).
 A 
Sorlie's SSP  Hu's SSP  PAM50 SSP  RSSPC  Legend     

Basallike 

HER2E 

Luminal A 

Luminal B 

Normal breastlike 

unclassified 

  B 
SCMOD1  SCMOD2  SCMGENE  RSCMC  Legend     

ER/HER2 

HER2E 

ER+/HER2 low prolif. 

ER+/HER2 high prolif. 

  C 
   RIMSPC  Legend     

Basallike 

HER2E 

Luminal A 

Luminal B 


Legend

MSP:  molecular subtype predictor (SSPs + SCMs)  No:  number of patients  SSP:  single sample predictor  RSSPC:  robust SSP classification based on patients classified in the same subtype with the three SSPs  SCM:  subtype clustering model  RSCMC:  robust SCM classification based on patients classified in the same subtype with the three SCMs  RIMSPC:  robust intrinsic molecular subtype predictors classification 


[ back ]
Data preprocessing:
1 DNA microarrays data
1.1 Affymetrix® preprocessing:
Before being log2transformed, Affymetrix™ raw CEL data were MAS5.0normalised (Microarray Affymetrix™ Suite 5.0)
using the Affymetrix Expression Console™.
Except for Affymetrix™ Gene 1.0 ST which were preprocessed using robust multiarray analysis (RMA) algorithme
from Affy Bioconductor package^{a}.
1.2 NonAffymetrix preprocessing:
Data have been downloaded as they were deposited in the public databases.
When patient to reference ratio and its log2transformation were not already calculated,
we performed the complete process.
1.3 All DNA microarrays data:
Finally, in order to merge all studies data and create pooled cohorts,
we converted studies data to a common scale (median equal to 0
and standard deviation equal to 1^{b}).


2 RNAseq data
2.1 TCGA preprocessing:
RNASeq dataset were downloaded from the TCGA database (Genomic Data Commons Data Portal).
We used the RNAseq expression level read counts data produced by HTSeq and normalized using the FPKM normalization method ^{c} .
FPKM values was log2transformed using an offset of 0.1 in order to avoid undefined values.
2.2 GSE81540 preprocessing:
We used the Sweden Cancerome Analysis Network – Breast (SCANB) ^{d} database.
RNAseq reads were mapped to the hg19 human genome with tophat2 and normalized in FPKM with cufflinks2 pipeline.
Then log2transformed with an offset of 0.1.
2.3 All RNAseq data:
Finally, in order to merge all studies data and create pooled cohorts,
we converted studies data to a common scale (median equal to 0
and standard deviation equal to 1 ^{b}).

[ back ]
Statistical analyses:
Several types of analyses are available: correlation analyses, expression analyses and prognostic analyses,
all of which have different subtypes.

Correlation analyses


Gene correlation targeted analysis:
Pearson's correlation coefficient is computed with associated pvalue for each pair of genes based on ten different populations:
all patients pooled together, patients with positive oestrogen receptor status, patients with negative oestrogen receptor status, Basallike patients,
HER2E patients, Luminal A patients and Luminal B patients (the last 4 subgroups being determined by the RIMSPC),
Basallike (as defined by PAM50) patients, TripleNegative (as defined by immunohistochemistry [IHC]) patients and the intersection of the 2 latter populations.
Results are displayed in a correlation map, where each cell corresponds to a pairwise correlation
and is coloured according to the correlation coefficient value, from dark blue (coefficient = 1) to dark red (coefficient = 1).
Pearson's pairwise correlation plots are also computed to illustrate each pairwise correlation.
Gene correlation exhaustive analysis:
Pearson's correlation coefficient is computed, with associated pvalue, between the chosen gene and all other genes that are present in the database,
based on different populations: all patients pooled together, Basallike patients, HER2E patients, Luminal A patients and Luminal B patients,
the last 4 subgroups being determined by the RIMSPC.
Genes with correlation above 0.40 in absolute value and with associated pvalue less than 0.05 are retained and the genes with best correlation coefficients are displayed
in two different tables: one for the first 50 (or less) positive correlations, one for the first 50 (or less) negative ones.
The lists with all genes fulfilling criteria of correlation coefficient above 0.40 in absolute value and associated pvalue less than 0.05 can be downloaded from the results page.
Gene Ontology analysis:
As a complement to this "screening" analysis, an analysis is performed to find Gene Ontology enrichment terms.
This analysis focuses on significantly under or overrepresented terms present in the list of genes most positively correlated with the chosen gene, including itself,
in the list of genes most negatively correlated with the chosen gene and in the union of these two lists.


For each term of each of the Gene Ontology trees (biological process, molecular function and cellular component), comparison is done between
the number of occurrences of this term in the "target list", i.e. the number of times this term is directly linked to a gene,
and the number of occurrences of this term in the "gene universe" (all of the genes that are expressed in the database) by means of Fisher's exact test.
Terms with associated pvalues less than 0.01 are kept.
Gene correlation analysis by chromosomal location:
Pearson's correlation coefficient is computed, with associated pvalue, between the chosen gene and genes located around the chosen gene (up to 15 up and 15 down) on the same chromosome,
based on seven different populations: all patients pooled together, patients with positive oestrogen receptor status, patients with negative oestrogen receptor status, Basallike patients,
HER2E patients, Luminal A patients and Luminal B patients, the last 4 subgroups being determined by the RIMSPC.
Detailed results are displayed in a table for each population.
Pearson's pairwise correlation plots are also performed to illustrate correlation of each gene with the chosen one.
Targeted correlation analysis (TCA):
As a complement, results of gene correlation analysis for genes selected via the "TCA" column can be displayed.
Targeted correlation analysis ("TCA" button), which aims at evaluating the robustness of clusters, is proposed:
correlation analyses are automatically computed between all possible pairs of genes that compose a selected cluster.

Expression analyses


Targeted expression analysis:
Once the analysis criteria have been chosen (gene(s) to be tested, clinical criterion (criteria) to test the gene against),
the distribution of the gene in the available population (all cohorts with availability of required information pooled together)
according to the clinical criterion (criteria) is illustrated by box and whisker, beeswarm, violin and raincloud plots.
To assess the significance of the difference in gene distributions in between the different groups, a Welch's test is performed,
as well as DunnettTukeyKramer's tests when appropriate.


Exhaustive expression analysis:
box and whisker, beeswarm, violin and raincloud plots are displayed, along with Welch's (and DunettTukeyKramer's) tests
for every possible clinical criteria for a unique gene.
Customised expression analysis:
Similarly to targeted analysis, distribution of a chosen gene is compared in between groups, but here, the groups are defined based on another gene:
the population (all cohorts with both gene values available pooled together) is split according to the median of the latter gene, resulting in 2 groups.

Prognostic analyses


Targeted prognostic analysis:
Once the analysis criteria have been chosen (gene / Probe Set to be tested,
nodal and oestrogen receptor status of the cohorts to be explored, event, on which survival analysis will be based, and splitting criterion for the gene),
the prognostic impact of the gene is evaluated on all cohorts pooled by means of univariate
Cox proportional hazards model, stratified by cohort,
and illustrated with a KaplanMeier curve.
Cox results are displayed on the curve. In case of more than 2 groups, detailed Cox results (pairwise comparisons) are given in a separate table.
In order to minimize unreliability at the end of the curve, the 15% of patients with the longest followup are not plotted ^{a}.
To evaluate independent prognostic impact of gene(s) relative to
the wellestablished clinical markers NPI ^{b} and AOL ^{c} (10year overall survival) and to proliferation score ^{d},
adjusted Cox proportional hazards models are performed on pool's patients with available data.
Exhaustive prognostic analysis:
Univariate Cox proportional hazards model and KaplanMeier curves
are performed on each of the 18 possible pools corresponding to every combination of population
(nodal and oestrogen receptor status) and event criteria (metatastic relapse [MR], overall survival [OS]) to assess
the prognostic impact of the chosen gene / Probe Set, discretised according to the splitting criterion selected.
Results are displayed by population and event criteria and are ordered by pvalue (smallest to largest).


Molecular subtype prognostic analysis:
Patients are pooled according to their molecular subtypes, based on three single sample predictors (SSPs)
and three subtype clustering models (SCMs), and on three supplementary robust molecular subtype classifications
consisting on the intersections of the 3 SSPs and/or of the 3 SCMs classifications:
only patients with concordant molecular subtype assignment for the 3 SSPs (RSSPC),
for the 3 SCMs (RSCMC), or for all predictors (RIMSPC), are kept. Univariate Cox proportional analysis
and KaplanMeier curves are performed for the chosen gene / Probe Set,
discretised according to the splitting criterion selected,
for each of the different molecular subtypes populations.
Basallike/TNBC prognostic analysis:
Univariate Cox proportional hazards analyses and KaplanMeier curves
are performed, for the chosen gene / Probe Set, discretised according to the splitting criterion selected, on Basallike (BL) patients (PAM50),
on TripleNegative breast cancer (TNBC) patients (IHC) and on patients both BL and TNBC.

Nota bene:
 When working with gene symbols and in case of multiple probesets for
the same gene, probeset values median is taken as unique value for the gene.
 KaplanMeier curves will not be computed in populations with less than 5 patients.

[ back ]
Statistical tests:
Correlation statistical tests


Pearson correlation
 The coefficient:
Pearson correlation coefficient, also known as the Pearson's product moment correlation coefficient and denoted by r, measures the linear dependence (correlation)
between two variables (e.g. genes).
It is obtained by the formula r = cov(G_{1},G_{2}) / (std(G_{1})*std(G_{2})),
where cov(G_{1},G_{2}) is the covariance between the variables G_{1} and G_{2} and std denotes the standard deviation of each variable.
r values can vary from 1 to 1. A negative r means that when the first variable increases, the second one decreases,
a postive r means that both variables increase or decrease simultaneously.
The greater the r in absolute value, the stronger the linear dependence between the two variables, with the extreme values of 1 or 1 meaning a perfect linear dependence
between the two variables, in which case, if the two variables are plotted, all data points lie on a line.


 The associated pvalue:
Along with the Pearson correlation coefficient, one can test if this coefficient is different from 0, knowing that the statistic
t = r*√(n2)/√(1r^{2}) follows a Student distribution with (n2) degrees of freedom, n being the number of values.
The pvalue associated with the Pearson correlation coefficient permits thus to know if a linear dependence exists between the two variables.
Note that one has to be careful when interpreting pvalue associated with Pearson correlation coefficient: a significant pvalue means that a linear dependence
exists between two variables but does not mean that this linear dependence is strong; for example, a coefficient of 0.05 with 1600 data points is associated
with a significant pvalue (p = 0.046) but one can certainly not conclude that there is a strong linear dependence between the two variables !

[ back ]
Expression statistical tests


To evaluate the difference of gene's expression among the different population groups, Welch's test is used in between the groups.
Moreover, when there are at least three different groups and Welch's pvalue is significant (indicating that gene's


expression
is different in between at least two subpopulations), DunnettTukeyKramer's test is used for twobytwo comparisons
(this test permits to know the significativity level but does not give a precise pvalue).

[ back ]
Prognostic statistical tests


Optimal Discretisation
In prognostic analyses, when choosing "optimal" as the splitting criterion for discretisation,
gene / Probe Set is split according to


all percentiles from the 20th to the 80th, with a step of 5, and
the cutoff giving the best pvalue (Cox model) is kept.


Cox model
 Aim of the Cox model:
Cox model is a regression model to express the relation between a covariate,
either continuous (e.g. G gene) or ordered discrete (e.g. SBR grade), and the risk
of occurrence of a certain event (e.g. metastatic relapse).
Its simplified formula for G gene can be written as follows:
h(t,g) = h0(t)*exp(ß.g), where h is the hazard function of the event occurrence at time t,
dependent on the value g of G and h0(t) is the positive baseline hazard function,
shared by all patients.
ß is the regression coefficient associated with G, the parameter one wants to evaluate.
 Interpretation of Cox model results:
There are two particularly interesting results when building a Cox model: the pvalue
associated with ß, which tells us whether the covariate (e.g. gene) has a significant
impact on the eventfree survival (if the pvalue is less than a certain threshold,
usually 5%) and the hazard ratio (HR) (equal to exp(ß)), sometimes summed up by its “way”
(sign of ß).


The HR, which is really interesting when the pvalue is significant,
is actually a risk ratio of an event occurrence between patients with regards
to their relative measurements for the gene under study. To be more specific,
the HR corresponds to the factor by which the risk of occurrence of
the event is multiplied when the risk factor increases by one unit:
h(t,G+1) = h(t,G)*exp(ß).
The "way" of this HR permits therefore to know how the gene will generally affect
the patients eventfree survival.
For example, saying that parameter ß associated with the gene G under study is negative
(thus exp(ß) < 1) means that the greater the value of G, the lower the risk of event:
if A and B are two patients such as A's G value gA is greater than B's G value gB,
then one can say that patient A has a lower risk of metastatic relapse than patient B:
gA > gB, ß < 0
⇒ ß.gA < ß.gB
⇒ exp(ß.gA) < exp(ß.gB)
⇒ h0(t)*exp(ß.gA) < h0(t)*exp(ß.gB), that is, h(t, gA) < h(t, gB).

KaplanMeier curves
 The KaplanMeier estimator:
KaplanMeier method, also known as the productlimit method, is a nonparametric method
to estimate the survival function S(t) (= Pr(T > t): probability of having a survival
time T longer than time t) of a given population. It is based on the idea that being alive
at time t means being alive just before t and staying alive at t.
Suppose we have a population of n patients, among whom k patients have experienced
an event (metastastic relapse or death for instance) at distinct times
t1 < t2 < ... < tm
(m=k if all events occurred at different times). For each time ti, let ni designs
the number of patients still at risk just before ti, that is patients who have not
yet experienced the event and are not censored, and let ei designs the number of
events that occurred at ti. The eventfree survival probability at time ti, S(ti),
is then the probability S(ti1) of not experiencing the event before time ti
(at time ti1) multiply by the probability (niei)/ni of not experiencing the event
at time ti (which by definition of ti corresponds to the probability of not experiencing
the event during the interval between ti1 and ti): S(ti) = S(ti1) x (niei)/ni.
The KaplanMeier estimator of the survival function S(t) is thus the cumulative product:


 The curve:
The KaplanMeier survival curve, i. e. the plot of the survival function, permits to
visualize the evolution of the survival function (estimate). The curve is shaped like
a staircase, with a step corresponding to events at the end of each [ti1; ti[ interval.
Tick marks on each curve indicate censored observation.
The illustration of the KaplanMeier survival estimator by the KaplanMeier survival
curve becomes especially interesting when there are different groups of patients
(e.g. according to different treatments or different values of biological markers)
and one wants to compare their relative eventfree survival. The different survival
curves are then plotted together and can be visually compared.
The colour palette used for the curve is from R package viridis ^{a},
it permits to keep the colour difference when converted to black and white scale
and is designed to be perceived by readers with the most common form of color blindness.
 Reliability of the estimation:
Caution must be taken concerning the interpretation of the survival curve,
especially at the end of the survival curve: the censored patients induce a loss
of information and reduce the sample size, making the survival curve less reliable;
the end of the curve is obviously particularly affected. For our analyses, in order
to minimize unreliability at the end of the curve, the 15% of patients with
the longest eventfree survival or followup are not plotted ^{a}.

[ back ]
Graphic illustrations:
Correlation graphic illustrations


Correlation map
A correlation map illustrates pairwise correlations among a given group of genes.
A correlation map is a square table where each line and each column represent a gene.
Each cell represents an "interaction" between two genes and is coloured according to the value of the Pearson correlation coefficient between these two genes,
from dark blue (coefficient = 1) to dark red (coefficient = 1).
Cells from the diagonal of the correlation map represents "interaction" of a gene with itself and are coloured in black.


Pairwise correlation plot
On a correlation plot, the leastsquares regression line is plotted along with the data points to illustrate the correlation between two given genes.


Expression graphic illustrations


Box and whisker, beeswarm, violin and raincloud plots
Box and whisker plots permit to graphically represent descriptive statistics of a continuous variable (e.g. gene):
the box goes from the lower quartile (Q1) to the upper quartile (Q3), with an horizontal line marking the median.
At the bottom and the top of the box, whisker indicates the distance between the Q1, respectively Q3,
and 1.5 times the interquartile range, that is: Q11.5*(Q3Q1) and Q3+1.5*(Q3Q1).
Beeswarm is a onedimensional scatter plot similar to stripchart, except that wouldbe overlapping points are separated such that each is visible
(package beeswarm^{a}).
Violin plot combines the kernel probability density plot and box and whisker plot.
Density curves are plotted symmetrically on both sides of the box and whisker plot.


Raincloud plot is a combination of splithalf violin, raw jittered data points, and box and whisker plot ^{b}.
Box and whisker, beeswarm, violin and raincloud plots permit to visually compare distributions of a gene among the different population groups.

[ back ]


