Breast Cancer Gene-Expression Miner v4.0
Published annotated data ][
Published genomic data ][
Data pre-processing ][
Molecular subtype classification ]
Statistical analyses ][
Survival statistical tests ][
Gene expression ][
Correlation map ][
Biological validation ]
Published annotated data:
|ER status||PR status||HER2 status||SBR|
|Age at diagnosis||NPI|
|Event status |
|MR ||AE |
|1||Van de Vijver et al., 2002||295 ||101 ||122 |
|2||Sotiriou et al., 2003||99 ||30 ||53 |
|3||Ma et al., 2004||59 || ||27 |
|4||Minn et al., 2005||82 ||27 ||27 |
|5||Pawitan et al., 2005||159 ||1||40 ||50 |
|6||Wang et al., 2005||286 ||107 ||107 |
|7||Weigelt et al., 2005||50 ||2||13 ||13 |
|8||Bild et al., 2006||158 ||1|| ||50 |
|9||Chin et al., 2006||112 ||2||21 ||42 |
|10||Ivshina et al., 2006||249 ||2|| ||89 |
|11||Chin et al., 2007||171 ||38 ||56 |
|12||Desmedt et al., 2007||198 ||62 ||91 |
|13||Loi et al., 2007||401 ||2||101 ||139 |
|14||Minn et al., 2007||58 ||11 ||11 |
|15||Naderi et al., 2007||135 || ||65 |
|16||Zhou et al., 2007||54 ||9 ||9 |
|17||Anders et al., 2008||75 ||14 ||14 |
|18||Chanrion et al., 2008||155 ||48 ||57 |
|19||Loi et al., 2008||77 ||2||10 ||13 |
|20||Schmidt et al., 2008||200 ||1||46 ||46 |
|21||Calabrò et al., 2009||139 || ||96 |
|22||Desmedt et al., 2009||55 || ||55 |
|23||Jézéquel et al., 2009||252 ||65 ||68 |
|24||Zhang et al., 2009||136 ||20 ||20 |
|25||Jönsson et al., 2010||346 || ||151 |
|26||Li et al., 2010||115 ||2||14 ||14 |
|27||Sircoulomb et al., 2010||55 ||17 ||17 |
|28||Buffa et al., 2011||216 ||82 ||82 |
|29||Dedeurwaerder et al., 2011||85 || ||36 |
|30||Filipits et al., 2011||277 ||58 ||58 |
|31||Hatzis et al., 2011||309 ||65 ||65 |
|32||Kao et al., 2011||296 ||1||63 ||73 |
|33||Sabatier et al., 2011||266 || ||83 |
|34||Wang et al., 2011||149 || ||10 |
|35||Kuo et al., 2012||51 ||12 ||12 |
|36||Nagalla et al., 2013||41 ||14 ||14 |
|Total||5 861 ||29||36||19||15||26||26||17||8||33||32||1 088 ||1 935 |
ER status was determined based on 205225_at Affymetrix probe (HG-U133) or on the median value of Affymetrix probes representing ESR1 (HG-U95A v2) using a 2-component Gaussian mixture distribution model. Lehmann et al.
J Clin Invest. 2011 Jul 1;121(7):2750-672
NPI score could be computed only for node negative patients
| No.: ||number of|
| ER: ||oestrogen receptor by IHC|
| PR: ||progesterone receptor by IHC|
| HER2: ||HER2 receptor by IHC|
| IHC: ||ImmunoHistoChemistry|
| SBR: ||Scarff Bloom and Richardson grade|
| NPI: ||Nottingham prognostic index|
| AOL: ||Adjuvant! Online|
| SSPs: ||Single Sample Predictors (Sorlie, Hu and PAM50)|
| SCMs: ||Subtype Clustering Models (SCMOD1, SCMOD2, SCMGENE)|
| MR: ||metastatic relapse|
| AE: ||any event (any pejorative event: local relapse, metastatic relapse or death.)|
| : ||available information|
| : ||unavailable information|
[ back ]
Published genomic data:
|#||Reference||No. patients||Study code||Platform origin||Platform code||DNA chip||No. unique genes (2015)||Processing *||bc-GenExMiner version|
|1||Van de Vijver et al., 2002||295 ||Rosetta2002||Agilent||25k oligo custom||15 031 ||log2 ratio||1.0|
|2||Sotiriou et al., 2003||99 ||PNAS1732912100||NCI||8k cDNA custom||4 368 ||log2 ratio||1.0|
|3||Ma et al., 2004||59 ||GSE1378||Arcturus||GPL1223||22k oligo custom||15 558 ||log2 ratio||1.0|
|4||Minn et al., 2005||82 ||GSE2603||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||1.0|
|5||Pawitan et al., 2005||159 ||GSE1456||Affymetrix||GPL96 - GPL97||HG-U133A + B||19 894 ||MAS5 and log2||1.0|
|6||Wang et al., 2005||286 ||GSE2034||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||1.0|
|7||Weigelt et al., 2005||50 ||GSE2741||Agilent||GPL1390||Human 1A oligo UNC custom||13 927 ||log2 ratio||1.0|
|8||Bild et al., 2006||158 ||GSE3143||Affymetrix||GPL91||HG-U95A v2||9 076 ||MAS5 and log2||1.0|
|9||Chin et al., 2006||112 ||E_TABM_158||Affymetrix||A-AFFY-76||HG-U133A v2||13 226 ||MAS5 and log2||1.0|
|10||Ivshina et al., 2006||249 ||GSE4922||Affymetrix||GPL96 - GPL97||HG-U133A + B||19 894 ||MAS5 and log2||1.0|
|11||Chin et al., 2007||171 ||GSE8757||VUMC Microarray||GPL5737||Human 30K 60-mer oligo array||18 363 ||log2 ratio||3.1|
|12||Desmedt et al., 2007||198 ||GSE7390||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||1.0|
|13||Loi et al., 2007||401 ||GSE6532||Affymetrix||GPL96 - GPL97 - GPL570||HG U133A + B + P2||22 847 ||MAS5 and log2||1.0|
|14||Minn et al., 2007||58 ||GSE5327||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||1.0|
|15||Naderi et al., 2007||135 ||E_UCON_1||Agilent||A-AGIL-14||Human 1A oligo G4110A||14 268 ||log2 ratio||1.0|
|16||Zhou et al., 2007||54 ||GSE7378||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||3.1|
|17||Anders et al., 2008||75 ||GSE7849||Affymetrix||GPL91||HG-U95A v2||9 076 ||MAS5 and log2||1.0|
|18||Chanrion et al., 2008||155 ||GSE9893||MLRG||GPL5049||Human 21k v12.0||15 184 ||MAS5 and log2||1.0|
|19||Loi et al., 2008||77 ||GSE9195||Affymetrix||GPL570||HG-U133P2||22 847 ||MAS5 and log2||1.0|
|20||Schmidt et al., 2008||200 ||GSE11121||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||1.1|
|21||Calabrò et al., 2009||139 ||GSE10510||DKFZ||GPL6486||35k oligo||16 536 ||log2 ratio||1.0|
|22||Desmedt et al., 2009||55 ||GSE16391||Affymetrix||GPL570||HG-U133P2||22 847 ||MAS5 and log2||3.1|
|23||Jézéquel et al., 2009||252 ||GSE11264||UMGC-IRCNA||GPL4819||9k cDNA custom||1 814 ||log2 ratio||1.0|
|24||Zhang et al., 2009||136 ||GSE12093||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||1.1|
|25||Jönsson et al., 2010||346 ||GSE22133||SweGene||GPL5345||H_v2.1.1 55K||9 281 ||log2 ratio||3.1|
|26||Li et al., 2010||115 ||GSE19615||Affymetrix||GPL570||HG-U133P2||22 847 ||MAS5 and log2||3.1|
|27||Sircoulomb et al., 2010||55 ||GSE17907||Affymetrix||GPL570||HG-U133P2||22 847 ||MAS5 and log2||3.1|
|28||Buffa et al., 2011||216 ||GSE22219||Illumina||GPL6098||HumanRef-8 v1.0 expr-bc||15 623 ||log2 ratio||3.1|
|29||Dedeurwaerder et al., 2011||85 ||GSE20711||Affymetrix||GPL570||HG-U133P2||22 847 ||MAS5 and log2||3.1|
|30||Filipits et al., 2011||277 ||GSE26971||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||3.1|
|31||Hatzis et al., 2011||309 ||GSE25055||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||3.1|
|32||Kao et al., 2011||296 ||GSE20685||Affymetrix||GPL570||HG-U133P2||22 847 ||MAS5 and log2||3.1|
|33||Sabatier et al., 2011||266 ||GSE21653||Affymetrix||GPL570||HG-U133P2||22 847 ||MAS5 and log2||3.1|
|34||Wang et al., 2011||149 ||GSE16987||Illumina||GPL6104||HumanRef-8 v2.0 expr-bc||16 976 ||log2 ratio||3.1|
|35||Kuo et al., 2012||51 ||GSE33926||Agilent||GPL7264||Human 1A Microarray (V2) G4110B||16 754 ||log2 ratio||3.1|
|36||Nagalla et al., 2013||41 ||GSE45255||Affymetrix||GPL96||HG-U133A||13 226 ||MAS5 and log2||3.1|
|Total ||5861 |
* Data have been converted to a common scale (median equal to 0 and standard deviation equal to 1).
[ back ]
1.1 Affymetrix pre-processing:
Before being log2-transformed, Affymetrix raw CEL data were MAS5.0-normalised
using the Affymetrix Expression Console.
1.2 Non-Affymetrix pre-processing:
Data have been downloaded as they were deposited in the public databases.
When patient to reference ratio and its log2-transformation were not already calculated,
we performed the complete process.
2 All data:
Finally, in order to merge all studies data and create pooled cohorts,
we converted studies data to a common scale (median equal to 0
and standard deviation equal to 1 a
a Shabalin et al.
Bioinformatics. 2008; 24,1154-1160
[ back ]
Molecular subtype classification:
RMSPC (Robust Molecular Subtype Predictors Classification): patients classified in the same molecular subtype with the six molecular subtype predictors (MSP).
Table 1: Molecular subtyping methods
|Molecular subtype predictor (MSP)
||No. genes in MSP
||R script reference
|Single sample predictor (SSP)
||Sorlie et al, 2003
||Gene symbols; probes median (if multiple probes for a same gene)
||Weigelt et al, 2010
||Nearest centroid classifier;
highest correlation coefficient between patient profile and the 5 centroids
||Hu et al, 2006
||Parker et al, 2009
|Subtype clustering model (SCM)
||Desmedt et al, 2008
Wirapati et al, 2008
|subtype.cluster function, R package genefu
||Mixture of three gaussians;
use of ESR1, ERBB2 and AURKA modules
ER+/HER2- low proliferation,
ER+/HER2- high proliferation
Table 2: Molecular subtyping of 5 861 breast cancer patients
included in bc-GenExMiner v3.1 according to 6 molecular subtype predictors
||Basal-like||HER2-E||Luminal A||Luminal B||Normal breast-like||unclassified |
|Sorlie's SSP||795 ||13.6 ||606 ||10.3 ||1503 ||25.6 ||637 ||10.9 ||663 ||11.3 ||1657 ||28.3 |
|Hu's SSP||1268 ||21.6 ||502 ||8.6 ||1339 ||22.8 ||989 ||16.9 ||808 ||13.8 ||955 ||16.3 |
|PAM50 SSP||1144 ||19.5 ||828 ||14.1 ||1581 ||27 ||1068 ||18.2 ||728 ||12.4 ||512 ||8.7 |
|RSSPC||703 ||- ||190 ||- ||761 ||- ||190 ||- ||335 ||- ||- ||- |
|SCMOD1||929 ||15.9 ||861 ||14.7 ||1653 ||28.2 ||1499 ||25.6 ||- ||- ||919 ||15.7 |
|SCMOD2||996 ||17 ||1027 ||17.5 ||1588 ||27.1 ||1418 ||24.2 ||- ||- ||832 ||14.2 |
|SCMGENE||2038 ||34.8 ||911 ||15.5 ||1048 ||17.9 ||945 ||16.1 ||- ||- ||919 ||15.7 |
|RSCMC||699 ||- ||373 ||- ||656 ||- ||524 ||- ||- ||- ||- ||- |
||580 ||- ||124 ||- ||324 ||- ||80 ||- ||- ||- ||- ||- |
| MSP: ||Molecular Subtype Predictor (SSPs + SCMs)|
| No: ||number of patients|
| SSP: ||Single Sample Predictor|
| RSSPC: ||Robust SSP Classification based on patients classified in the same subtype with the three SSPs|
| SCM: ||Subtype Clustering Model|
| RSCMC: ||Robust SCM Classification based on patients classified in the same subtype with the three SCMs|
| RMSPC: ||Robust Molecular Subtype Predictors Classification|
[ back ]
Several types of analyses are available: prognostic analyses, correlation analyses and expression analyses,
all of which have different subtypes.
Targeted expression analysis:
Once the analysis criteria have been chosen (gene(s) to be tested, clinical criterion (criteria) to test the gene against),
the distribution of the gene in the available population (all cohorts with availability of required information pooled together)
according to the clinical criterion (criteria) is illustrated by box and whiskers plots
To assess the significance of the difference in gene distributions in between the different groups, a Welch's test is performed,
as well as Dunnett-Tukey-Kramer's tests when appropriate.
Exhaustive expression analysis:
box and whiskers plots
are displayed, along with Welch's (and Dunett-Tukey-Kramer's) tests
for every possible clinical criteria for a unique gene.
Customised expression analysis:
Similarly to targeted analysis, distribution of a chosen gene is compared in between groups, but here, the groups are defined based on another gene:
the population (all cohorts with both gene values available pooled together) is split according to the median of the latter gene, resulting in 2 groups.
Targeted prognostic analysis:
Once the analysis criteria have been chosen (gene(s) to be tested,
nodal and oestrogen receptor status of the cohorts to be explored and event),
several statistical tests are conducted on each cohort and on all cohorts
The prognostic impact of each gene is evaluated by means of univariate Cox proportional hazards model
Results are displayed by cohorts (including pool) and are illustrated in a forest plot
are then performed on the pool with the gene values
dichotomised according to gene median (calculated from the pool).
Cox results corresponding to dichotomised values are
displayed on the curve. In order to minimize unreliability
at the end of the curve, the 15% of patients with the longest follow-up are not plotteda
To evaluate independent prognostic impact of gene(s) relative to
the well-established clinical markers NPIb
(10-year overall survival) and to proliferation scored
adjusted Cox proportional hazards models are performed on pool's patients with available data.
Exhaustive prognostic analysis:
Univariate Cox proportional hazards model
is performed on each of the
18 possible pools corresponding to every combination of population
(nodal and oestrogen receptor status) and event criteria (metatastic relapse [MR], any event [AE]) to assess
the prognostic impact of a unique gene. Results are displayed by
population and event criteria and are ordered by p-value (smallest to largest).
Molecular subtype prognostic analysis:
Patients are pooled according to their molecular subtypes, based on three single sample predictors (SSPs)
and three subtype clustering models (SCMs), and on three supplementary robust molecular subtype classifications
consisting on the intersections of the 3 SSPs and/or of the 3 SCMs classifications:
only patients with concordant molecular subtype assignment for the 3 SSPs (RSSPC),
for the 3 SCMs (RSCMC), or for all predictors (RMSPC), are kept. Univariate Cox proportional analysis
is performed for the chosen gene for each of the different molecular subtypes populations.
are also computed.
Basal-like/TNBC prognostic analysis:
Univariate Cox proportional hazards analyses
are performed, for the chosen gene,
on Basal-like (BL) patients (as defined by PAM50), on Triple-Negative breast cancer (TNBC) patients (as defined by immunohistochemistry [IHC])
and on patients both BL and TNBC. Kaplan-Meier curves
are also computed.
Gene correlation targeted analysis:
coefficient is computed with associated p-value for each pair of genes based on ten different populations:
all patients pooled together, patients with positive oestrogen receptor status, patients with negative oestrogen receptor status, Basal-like patients,
HER2-E patients, Luminal A patients and Luminal B patients (the last 4 subgroups being determined by the RMSPC
Basal-like (PAM50) patients, Triple-Negative (IHC) patients and the intersection of the 2 latter populations.
Results are displayed in a correlation map
, where each cell corresponds to a pairwise correlation
and is coloured according to the correlation coefficient value, from dark blue (coefficient = -1) to dark red (coefficient = 1).
Pearson's pairwise correlation plots
are also computed to illustrate each pairwise correlation.
Gene correlation exhaustive analysis:
coefficient is computed, with associated p-value, between the chosen gene and all other genes that are present in the database,
based on different populations: all patients pooled together, Basal-like patients, HER2-E patients, Luminal A patients and Luminal B patients,
the last 4 subgroups being determined by the RMSPC.
Genes with correlation above 0.40 in absolute value and with associated p-value less than 0.05 are retained and the genes with best correlation coefficients are displayed
in two different tables: one for the first 50 (or less) positive correlations, one for the first 50 (or less) negative ones.
The lists with all genes fulfilling criteria of correlation coefficient above 0.40 in absolute value and associated p-value less than 0.05 can be downloaded from the results page.
Gene Ontology analysis:
As a complement to this "screening" analysis, an analysis is performed to find Gene Ontology
This analysis focuses on significantly under- or over-represented terms present in the list of genes most positively correlated with the chosen gene, including itself,
in the list of genes most negatively correlated with the chosen gene and in the union of these two lists.
For each term of each of the Gene Ontology trees (biological process, molecular function and cellular component), comparison is done between
the number of occurrences of this term in the "target list", i.e. the number of times this term is directly linked to a gene,
and the number of occurrences of this term in the "gene universe" (all of the genes that are expressed in the database) by means of Fisher's exact test.
Terms with associated p-values less than 0.01 are kept.
Gene correlation analysis by chromosomal location:
coefficient is computed, with associated p-value, between the chosen gene and genes located around the chosen gene (up to 15 up and 15 down) on the same chromosome,
based on seven different populations: all patients pooled together, patients with positive oestrogen receptor status, patients with negative oestrogen receptor status, Basal-like patients,
HER2-E patients, Luminal A patients and Luminal B patients, the last 4 subgroups being determined by the RMSPC.
Detailed results are displayed in a table for each population.
Pearson's pairwise correlation plots
are also performed to illustrate correlation of each gene with the chosen one.
Targeted correlation analysis (TCA):
As a complement, results of gene correlation analysis for genes selected via the "TCA" column can be displayed.
Targeted correlation analysis ("TCA" button), which aims at evaluating the robustness of clusters, is proposed:
correlation analyses are automatically computed between all possible pairs of genes that compose a selected cluster.
a Pocock et al.
Lancet. 2002; 359(9318):1686-9
b Galea et al.
Breast Cancer Res Treat. 1982; 45(3):361-6.
c Adjuvant! Online
d Dexter et al.
BMC Syst Biol. 2010; 4:127.
- When working with gene symbols and in case of multiple probesets for
the same gene, probeset values median is taken as unique value for the gene.
- Cox models performed on pool(s) are stratified by cohort.
- The value of gene median taken as a cutoff to dichotomise gene expression values and
perform Kaplan-Meier curves on the pool is an arbitrary value and may not be - and in most case is not -
the best cutoff for the specific gene. Hence, a gene that is significant when considering continue values
might not remain significant after dichotomisation.
[ back ]
Survival statistical tests
- Aim of the Cox model:
Cox model is a regression model to express the relation between a covariate,
either continuous (e.g. G gene) or ordered discrete (e.g. SBR grade), and the risk
of occurrence of a certain event (e.g. metastatic relapse).
Its simplified formula for G gene can be written as follows:
h(t,g) = h0(t)*exp(ß.g), where h is the hazard function of the event occurrence at time t,
dependent on the value g of G and h0(t) is the positive baseline hazard function,
shared by all patients.
ß is the regression coefficient associated with G, the parameter one wants to evaluate.
- Interpretation of Cox model results:
There are two particularly interesting results when building a Cox model: the p-value
associated with ß, which tells us whether the covariate (e.g. gene) has a significant
impact on the event-free survival (if the p-value is less than a certain threshold,
usually 5%) and the hazard ratio (HR) (equal to exp(ß)), sometimes summed up by its “way”
(sign of ß).
The HR, which is really interesting when the p-value is significant,
is actually a risk ratio of an event occurrence between patients with regards
to their relative measurements for the gene under study. To be more specific,
the HR corresponds to the factor by which the risk of occurrence of
the event is multiplied when the risk factor increases by one unit:
h(t,G+1) = h(t,G)*exp(ß).
The "way" of this HR permits therefore to know how the gene will generally affect
the patients event-free survival.
For example, saying that parameter ß associated with the gene G under study is negative
(thus exp(ß) < 1) means that the greater the value of G, the lower the risk of event:
if A and B are two patients such as A's G value gA is greater than B's G value gB,
then one can say that patient A has a lower risk of metastatic relapse than patient B:
gA > gB, ß < 0
⇒ ß.gA < ß.gB
⇒ exp(ß.gA) < exp(ß.gB)
⇒ h0(t)*exp(ß.gA) < h0(t)*exp(ß.gB), that is, h(t, gA) < h(t, gB).
- The Kaplan-Meier estimator:
Kaplan-Meier method, also known as the product-limit method, is a non-parametric method
to estimate the survival function S(t) (= Pr(T > t): probability of having a survival
time T longer than time t) of a given population. It is based on the idea that being alive
at time t means being alive just before t and staying alive at t.
Suppose we have a population of n patients, among whom k patients have experienced
an event (metastastic relapse or death for instance) at distinct times
t1 < t2 < ... < tm
(m=k if all events occurred at different times). For each time ti, let ni designs
the number of patients still at risk just before ti, that is patients who have not
yet experienced the event and are not censored, and let ei designs the number of
events that occurred at ti. The event-free survival probability at time ti, S(ti),
is then the probability S(ti-1) of not experiencing the event before time ti
(at time ti-1) multiply by the probability (ni-ei)/ni of not experiencing the event
at time ti (which by definition of ti corresponds to the probability of not experiencing
the event during the interval between ti-1 and ti): S(ti) = S(ti-1) x (ni-ei)/ni.
The Kaplan-Meier estimator of the survival function S(t) is thus the cumulative product:
- The curve:
The Kaplan-Meier survival curve, i. e. the plot of the survival function, permits to
visualize the evolution of the survival function (estimate). The curve is shaped like
a staircase, with a step corresponding to events at the end of each [ti-1; ti[ interval.
The illustration of the Kaplan-Meier survival estimator by the Kaplan-Meier survival
curve becomes especially interesting when there are different groups of patients
(e.g. according to different treatments or different values of biological markers)
and one wants to compare their relative event-free survival. The different survival
curves are then plotted together and can be visually compared.
- Reliability of the estimation:
Caution must be taken concerning the interpretation of the survival curve,
especially at the end of the survival curve: the censored patients induce a loss
of information and reduce the sample size, making the survival curve less reliable;
the end of the curve is obviously particularly affected. For our analyses, in order
to minimize unreliability at the end of the curve, the 15% of patients with
the longest event-free survival or follow-up are not plotteda
A forest plot is a graphical means to view results, i.e. a score (odds or hazard ratio)
and a confidence interval (CI), of the same analysis applied to different populations (studies).
In particular it permits, via Cox HRs, to survey the impact of a gene
on survival in different cohorts all at once, and thus to get a better (visual) idea of
how the results vary between studies.
A forest plot is organized as follows: for each study, the score (eg. HR) is represented
by a square centred on the value of the score (HR) and whose size depends on the precision
of the score estimation (the more precise the estimation, the bigger the square).
A horizontal line passing through the square represents the (usually 95%) CI.
At the bottom of the forest plot are represented the score (HR) and CI obtained by
the pool (i.e. all cohorts pooled) in the shape of a diamond with the centre
representing the score (HR) and the right and left ends representing the CI limits.
Finally, a vertical line representing a no effect score (HR=1) is drawn.
a Pocock et al.
Lancet. 2002; 359(9318):1686-9
[ back ]
Gene expression correlations
- The coefficient:
Pearson correlation coefficient, also known as the Pearson's product moment correlation coefficient and denoted by r, measures the linear dependence (correlation)
between two variables (e.g. genes).
It is obtained by the formula r = cov(G1,G2) / (std(G1)*std(G2)),
where cov(G1,G2) is the covariance between the variables G1 and G2 and std denotes the standard deviation of each variable.
r values can vary from -1 to 1. A negative r means that when the first variable increases, the second one decreases,
a postive r means that both variables increase or decrease simultaneously.
The greater the r in absolute value, the stronger the linear dependence between the two variables, with the extreme values of -1 or 1 meaning a perfect linear dependence
between the two variables, in which case, if the two variables are plotted, all data points lie on a line.
- The associated p-value:
Along with the Pearson correlation coefficient, one can test if this coefficient is different from 0, knowing that the statistic
t = r*√(n-2)/√(1-r2) follows a Student distribution with (n-2) degrees of freedom, n being the number of values.
The p-value associated with the Pearson correlation coefficient permits thus to know if a linear dependence exists between the two variables.
Note that one has to be careful when interpreting p-value associated with Pearson correlation coefficient: a significant p-value means that a linear dependence
exists between two variables but does not mean that this linear dependence is strong; for example, a coefficient of 0.05 with 1600 data points is associated
with a significant p-value (p = 0.046) but one can certainly not conclude that there is a strong linear dependence between the two variables !
A correlation map illustrates pairwise correlations among a given group of genes.
A correlation map is a square table where each line and each column represent a gene.
Each cell represents an "interaction" between two genes and is coloured according to the value of the Pearson correlation coefficient between these two genes,
from dark blue (coefficient = -1) to dark red (coefficient = 1).
Cells from the diagonal of the correlation map represents "interaction" of a gene with itself and are coloured in black.
Pairwise correlation plot
On a correlation plot, the least-squares regression line is plotted along with the data points to illustrate the correlation between two given genes.
[ back ]
[ back ]
Gene expression analyses
Box and whiskers plots
Box and whiskers plots permit to graphically represent descriptive statistics of a continuous variable (e.g. gene) :
the box goes from the lower quartile (Q1) to the upper quartile (Q3), with an horizontal line marking the median.
At the bottom and the top of the box, whiskers indicate the distance between the Q1, respectively Q3, and 1.5 times the interquartile range,
that is : Q1-1.5*(Q3-Q1) and Q3+1.5*(Q3-Q1). Finally, stars indicate outliers, if there is any, that is, patients with values below
or above the end of the whiskers.
Box and whiskers plots permit to visually compare distributions of a gene among the different population groups.
When there is more than one group, Welch's test is used to evaluate the difference of gene's expression in between the groups.
Moreover, when there are at least three different groups and Welch's p-value is significative (indicating that gene's expression
is different in between at least two subpopulations), Dunnett-Tukey-Kramer's test is used for two-by-two comparisons
(this test permits to know the significativity level but does not give a precise p-value).
[ back ]
Complexity of bioinformatics process may distort genomic data, and downstream,
statistics applied on these data and meta-data may conduct to erroneous results.
That is why biological validation of our tool is needed.
Since 2010, we conduct a screening of breast cancer markers (RNA and protein) referenced in
(keywords: breast cancer marker/biomarker). Significance of these genes is then tested in our tool.
These tests proved that bc-GenExMiner caught biological sense contained
in annotated genomic data and preserved it from bioinformatics biases,
even when data are merged in new cohorts, and that its results are pertinent.
Following tables display concordant conclusions about significance
of recently published candidate markers in breast cancer.
1) Prognostic module validation
Tested genes for biological validation of prognostic module
1-2) By molecular subtype
Tested genes for biological validation of prognostic module by molecular subtype
2) Correlation module validation
Tested genes for biological validation of targeted correlation analysis
|1 ||ESR1; GATA3; FOXA1; XBP1||Lacroix M et al.||2004 ||Mol Cell Endocrinol|
|2 ||FBP1; ESR1||Dong C et al.||2013 ||Cancer Cell|
|3 ||MKI67; AURKA; UBE2C||Wirapati P et al.|
Jézéquel P et al.
Loussouarn D et al.
|Breast Cancer Res|
Breast Cancer Res Treat
Br J Cancer
|4 ||PIP (GCDFP15);AR||Darb-Esfahani S et al.||2014 ||BMC Cancer|
|5 ||TNFAIP1; POLDIP2|
|Grinchuk OV et al.||2010 ||BMC Genomics|
2-2) Exhaustive and Gene ontology analysis
Tested genes for biological validation of exhaustive correlation and gene ontology analyses
2-3) By chromosomal location
Tested genes for biological validation of correlation analysis by chromosomal location
|1 ||ESR1; C6orf97; C6orf211; RMND1||Dunbier AK et al.||2011 ||PLoS Genet|
|2 ||LSM1; BAG4; DDHD2; PPAPDC1B; WHSC1L1||Bernard-Pierrot I et al.|
André F et al.
Clin Cancer Res
|3 ||Numerous genes||Buness A et al.|
Jézéquel P et al.
|4 ||TRAF4; MED24; GGA3||Bergamaschi A et al.|
Buness A et al.
Hu X et al.
|Genes Chromosomes Cancer|
Mol Cancer Res
3) Expression map module validation
By molecular subtype
Tested genes for biological validation of expression map analysis
|1 ||CALB2||Taliano RJ et al.||2013 ||Hum Pathol|
|2 ||CDH3||Liu N et al.||2012 ||Med Oncol|
|3 ||CDH3||Tsang JYS et al.||2013 ||Hum Pathol|
|4 ||CEACAM6||Tsang JYS et al.||2013 ||Breast Cancer Res Treat|
|5 ||CEACAM6||Balk-Møller E et al.||2014 ||Am J Pathol|
|6 ||CKAP2||Kim HS et al.||2014 ||PLoS One|
|7 ||CRYAB||Malin D et al.||2013 ||Clin Cancer Res|
|8 ||CRYAB||Koletsa T et al.||2014 ||BMC Clin Pathol|
|9 ||CXCR4||Zhang M et al.||2012 ||Ultrastruct Pathol|
|10 ||DACH1||Powe DG et al.||2014 ||PLoS One|
|11 ||ERCC1||Gerhard R et al.||2013 ||Pathol Res Pract|
|12 ||FBP1||Dong C et al.||2013 ||Cancer Cell|
|13 ||FEN1||Abdel-Fatah TMA et al.||2014 ||Mol Oncol|
|14 ||FOXC1||Ray PS et al.||2011 ||Ann Surg Oncol|
|15 ||FSCN1||Esnakula AK et al.||2013 ||J Clin Pathol|
|16 ||FZD7||Yang L et al.||2011 ||Oncogene|
|17 ||LDHB||McCleland ML et al.||2012 ||Cancer Res|
|18 ||LRP6||Yang L et al.||2011 ||Oncogene|
|19 ||MED1; STARD3; TCAP; PNMT; PGAP3; C17orf37; ORMDL3; PSMD3; NR1D1||Kauraniemi P et al.||2006 ||Endocr Relat Cancer|
|20 ||MET; ETS1; KRT6A; KRT6B; ANXA8; MMP9||Charafe-Jauffret E et al.||2006 ||Oncogene|
|21 ||MKI67; AURKA; UBE2C||Wirapati P et al.|
Jézéquel P et al.
Loussouarn D et al.
|Breast Cancer Res|
Breast Cancer Res Treat
Br J Cancer
|22 ||PI3||Labidi-Galy SI et al.||2014 ||Oncogene|
|23 ||PIP (GCDFP15)||Darb-Esfahani S et al.||2014 ||BMC Cancer|
|24 ||PRLR; KRT19||Charafe-Jauffret E et al.||2006 ||Oncogene|
|25 ||SDC1||Nguyen TL et al.||2013 ||Am J Clin Pathol|
|26 ||SFRP1||Jeong YJ et al.||2013 ||Oncol Rep|
|27 ||SOX10||Cimino-Mathews A et al.||2012 ||Hum Pathol|
|28 ||SPDEF||Buchwalter G et al.||2013 ||Cancer Cell|
|29 ||TCF7||Yang L et al.||2011 ||Oncogene|
|30 ||VIM||Tsang JYS et al.||2013 ||Hum Pathol|