8  Revision

* The library is already synchronized with the lockfile.
    ../../results/plots/revision ../../results/rds_files/revision 
                           FALSE                            FALSE 
   ../../results/tables/revision 
                           FALSE 

8.1 Comparing to combat’s performance

We know that the first two components of PCA correspond to cohort and platform specific effects. We can apply combat from the sva package on the unnormalized data.

We now check what is the relationship between the PC1 and PC2 from combat to the actual PCs from EMBER.

And the spearman correlation is for PC3 and PC1:

Figure 8.1

In a nutshell, the EMBER position corresponds to what is obtained by combat, with the advantage that we can actually use the embedding for new samples. Combat is restricted to the samples inputted.

8.2 Autoencoders


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmptVtdKF/filedbe81b4b82f9/h2o_ronchi_started_from_r.out
    /tmp/RtmptVtdKF/filedbe814b09f213/h2o_ronchi_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 327 milliseconds 
    H2O cluster timezone:       Europe/Zurich 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.44.0.3 
    H2O cluster version age:    6 months and 1 day 
    H2O cluster name:           H2O_started_from_R_ronchi_hva325 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   20.00 GB 
    H2O cluster total cores:    32 
    H2O cluster allowed cores:  32 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.2.0 (2022-04-22) 

Features are the columns and the rows are the samples. So for the genomic data we have to transpose it.

Figure 8.2

Just using the autoencoder in the data does not work.

We now use on the normalized data instead on 4 units in the hidden layer.

Figure 8.3

And now coloring by er status. It is not really able to completely distinguish the ER+ and ER- samples.

Figure 8.4

Figure 8.5

We do a pairsplot coloring by cohort to see where the distinction between cohorts is coming from.

Figure 8.6

And now coloring by ER status instead.

Figure 8.7

Perhaps now if we include more layers instead and use 4 components.

In terms of mean squared error all the models are very similar to each other. And based on the previous results it seems that by using

8.3 Mixing analysis

In this section we try to understand what is the minimum number of genes necessary to be able to distinguish ER+ and ER- BC samples. We select the top 10, 25, 50, 100, 200, 500, 750, 1000, 2500, 5000 and all the genes and try to classify the samples bewteen ER+ and ER- based on their position of the third component. This should give an idea of how good the embedding is based on the number of genes.

Figure 8.8

Using less than 25 genes mixes well the cohorts in the first two components. When using more than 50 genes the mixing is already lost in the first two components. So what we actually want is a mixing in just one of the pairs of components, either PC1+PC2 or PC3+PC4 while there is no mixing in the other pair of component. By analysing visually this happens when using at least 1000 genes.

Due to the difference in cohort sizes, METABRIC has 648 samples and TCGA 352, the expected iLISI coefficient is:

[1] 1.84

The expected value for the iLISI coefficient when calculating for ER status is:

[1] 1.59

And the plot below shows the iLISI numbers for each combination of PC1+PC2 and PC3+PC4 annotated by the total number of genes. An integration by ER status means that the combination of components is not capturing the biology.

Figure 8.9

8.4 Stable genes

One of the raised issues is that the average ranking of the stable genes might depend on phenotypes. Here we show that the distribution of average of stable genes is stable across different phenotypes, such as ER status and molecular subtype.

First we plot by ER status in both METABRIC and TCGA.

Figure 8.10

8.5 Scaling

We now show the embedding using no scaling, i.e., just log FPKM and median intensity, using the qPCR like normalization and using a z-normalization on the sample level.

Figure 8.11

And now we compare the iLISI scores of the qPCR-like normalization and z-scaling as well.

Figure 8.12

8.6 Tumor purity: statistical tests

One of the points raised by the reviewers was about the statistics on the differences between the principal components stratified by molecular subtype. We perform ANOVA followed by a Tukey post hoc test.

Figure 8.13

8.7 Metastasis and FFPE data

Accessed 20240425: https://www.cbioportal.org/study/summary?id=brca_mbcproject_2022

In the cBioportal one can check the clinical data directly with all the information. When hovering over the column names it is possible to obtain the column names from the data available when downloaded. For example, the ER percentage is available from the sample in the diagnostics. Ki67, similarly. In general the metastasis do not have ER percentage.

We now try to understand if samples are coming from core biopsy, surgical procedures or from a metastasis setting. To start with that, we look at a single patient ID that has two breast samples.

  BX_TIME_DAYS CALC_TREATMENT_NAIVE    BX_TYPE
1            0                  YES       CORE
2          114                   NO MASTECTOMY

For patient with ID MBCProject_4MF1FlFQ, there are two samples. One at time of diagnosis, core biopsy, and the other at a mastectomy after 114 days of treatment. The column CALC_TREATMENT_NAIVE tells if the sample is treatment naive or not, BX_TYPE corresponds to the type of surgery (PATH Procedure Type column on cBioPortal).

We now look at another patient (MBCProject_5gHasou8):

  BX_TIME_DAYS CALC_TREATMENT_NAIVE    BX_TYPE
1            0                  YES       CORE
2          199                   NO MASTECTOMY

The results are similar for this patient too. We now move on and use EMBER on all available samples and then compare these matched samples.

Total number of stable genes: 44
Total number of genes: 1037
Number of samples: 153

We show all the samples from the project below on top of TCGA, SCAN-B and METABRIC.

Figure 8.14

And now stratified by either local metastasis/primary sample or by distant metastasis.

Figure 8.15

Both combined for the response to reviewers:

8.8 TNBC and ER+

Here we use another cohort with both ER+ and ER- samples coming from FFPE with two different library preparation protocols: Illumina-Access and Nugen-Ovation.

Total number of stable genes: 44
Total number of genes: 1022
Number of samples: 21

We now show the samples highlighted on top of EMBER space with samples from TCGA and METABRIC.

Figure 8.16

8.9 SCAN-B library protocol

SCAN-B has 3 different types of samples based on their library protocol, namely: dUTP, NeoPrep and TruSeq.

Here we show that EMBER is capable of fully removing any effects associated to library protocol.

Figure 8.17

8.10 Scaling before doing PCA

When developing EMBER we did not center or scale the data, as the gene expressions after normalization all had the same range. Here we compare what happens when doing the centering followed by the scaling and then applying PCA.

To understand the effect of centering and scaling the data prior to applying PCA, we compared the loadings of PC1 from EMBER to the average of the genes in the normalized data (a). The correlation between these two variables is -1, showing that PC1 corresponds to centering. When comparing the loadings with PC2 and the standard deviation of genes after centering, the higher loadings in absolute value are correlated with higher standard deviation (b).

Next, we show that PC1 and PC2 are correlated with PC1 obtained after centering and scaling the data (c), showing that EMBER’s PC1 and PC2 is automatically centering and scaling the data at the same time removing batch effects. In order to confirm that no biological information is lost when performing PCA on the uncentered and unscaled data, we compared EMBER’s PC3 and PC4 with PC2 and PC3 from the scaled data respectively (d and e). The correlation is 1 in both cases for both cohorts: METABRIC and TCGA. We show that there is no need to center and scale prior to applying PCA in EMBER’s context.

Figure 8.18