3  Validating the molecular landscape

In the last chapters only TCGA and METABRIC samples were used in the training and testing of the PCA projection. To completely validate, we use another big cohort, SCANB, that was not used in the PCA fitting and testing before. This way we ensure that the embedding is independent of the cohort. Moreover, we use the SMC cohort (Kan et al. 2018), whose patients are from South Korea. We show how well it is embedded, and that is not dependent on population characteristics.

* The library is already synchronized with the lockfile.
    ../../results/plots/validation ../../results/rds_files/validation 
                             FALSE                              FALSE 
   ../../results/tables/validation 
                             FALSE 

3.1 SCANB embedding

The embedding was already calculate before when calculating the fit and tests, since the datasets were organized and loaded together. We also know already that all the 1044 genes are available in the SCANB cohort. SCANB is well mixed with all cohorts as Figure 3.1 shows.

Figure 3.1: PCA embedding colored by cohort. All samples from TCGA, METABRIC and SCANB were used.

Ideally there would be some overlap between SCANB and TCGA already in the first component. Moreover, if there isn’t, the cohort should be closer to TCGA than to METABRIC, due to technologies and normalization procedures, as both of them are log FPKM. Even though SCANB was processed in a way to get directly FPKM values, by using cufflinks, TCGA was normalized to get log FPKM as well, as discussed previously. Figure 6.3 shows how close SCANB is from TCGA, moreover there is already some overlap. The variability in the first component is bigger for METABRIC compared to TCGA and SCANB. Probably this is due to the technology processing, since they have genes in the same scale. Figure 2.33 contrasts the different normalization procedures. If genes are not scaled correctly, the PCA shows an inverse picture.

Figure 3.2: PCA embedding colored by cohort. The components used are the first and second components.

Figure 3.3 shows that the SCANB samples are also well mixed regarding the clinical factors, including the \(SET_{ER/PR}\) signature.

Figure 3.3: PCA embedding of all samples from TCGA, SCANB and METABRIC. (A) Colored by cohort, (B) colored by ER status, (C) colored by PAM50 molecular subtype, (D) colored by the \(SET_{ER/PR}\) signature and only SCANB samples.

3.2 Missing genes

Genes missing in publicly available datasets is very common. Usually this is due to processing pipelines or even data quality, therefore genes are removed. This should not be much of a problem, if the data is good enough, when calculating the qPCR-like normalization as it was shown before. The problem arises when multiplying the loadings obtained in the PCA with the normalized expression of the sample. If several genes are missing, 0s are added to the matrix and therefore the loadings for these genes cannot be used. Since 1000 genes were used, we investigate the loadings of genes and try to calculate a fuzziness score, indicating when the embedding should be trusted if genes are missing, and which genes are missing.

3.2.1 Fuzziness score

The idea of the fuzziness score is that we use the loadings of the missing genes and calculate the cumulative sum of their absolute values. The higher this value the more it indicates the missing genes are important. We start by removing random genes from random samples of the SCANB cohort. Each sample will have a random number of genes removed from the set of 1044 genes. Then two scores are calculated, one for PC3 and another for PC4. We also calculate the fuzziness score for the top loading genes and the low loadings genes, to compare these values.

We start by comparing the scores from top loadings and low loading genes. For this we remove the top 50 loadings and bottom 50 loadings, in terms of absolute values. Figure 3.4 shows the cumulative sum of the absolute values from the top and bottom loadings respectively. See how important the top loadings are compared to the bottom loadings. A way of thinking of this is that the top loadings weights are much more important when calculating the embedding.

Figure 3.4: Cumulative sum of absolute values for the top loadings and bottom loadings as a function of the number of selected genes.

This was using the top loadings, so probably the fuzziness score of a sample with relatively good data will be between 0 and 12 maximum, meaning that they have less than 200 genes missing. When genes are missing, we hope that they are in the bottom part of the loadings, so the fuzziness score is as small as possible.

Suppose now that the genes missing in a sample are random. We will calculate random fuzziness scores based on this to have a feeling of the scores.

Figure 3.5: Distribution of the fuzziness scores from random missing genes. Each color corresponds to a different distribution using a different number of genes.

Figure 3.5 shows the results as a function of the number of missing genes. We see that in average, if 200 genes are missing, the fuzziness score is around 5. If less than a 100 genes are missing, the fuzziness score is around 3. In the next section we show how this influences the position where the sample is projected into the embedding.

Finally, we calculate the fuzziness score as a number of top loadings missing. Suppose 150 genes are missing. We will change the proportion of genes that are in the top loadings and in the missing list. Since the cumulative sum for PC2 and PC3 are very similar, we focus only on PC2.

Figure 3.6: Distribution of the fuzziness scores from 150 random missing genes with different proportions of top loading genes in the list. Each color corresponds to a different distribution using a different proportion of genes.

Figure 3.6 shows the fuzziness score with different proportions of the top loading genes. We see that they are actually very important when calculating the score. The top loading genes were defined as the top 150 genes in the list. So when calculating the number of missing genes it is important to check the number of top loadings in the list.

3.2.2 Validating the score

In this section we calculate the embeddings with different number of genes missing from a given sample and then plot the original embedding with the newly one. For this we use samples from TCGA. The genes are removed even before the normalization procedure, as they are usually not available there. We select randomly a number of top genes from PC3 and PC4 to compose the list of missing genes. The proportion of the top loading genes is varied.

When looking at the figures above, it seems that whenever there are genes missing, and the more of top loadings they are, the closer they are to the origin. This makes sense, since PCA is doing a simple linear combination of the loadings. This should be fine to correct, since we can fit a line going through the origin and goes through the points above. Then the adjustment will depend on the number of genes that are missing and are top loadings. The more they are, the more adjustment we will need.

For the adjustment we only know two things, its current position and how many of the top loadings are missing. Based on this information we should be able to correct the samples. Given the list of genes that are missing, we fit lines that would go through the genes that are missing in the TCGA and METABRIC samples. We then try to find the line that is closes to the point. After performing the fit and the line, we can then move the point to closer to the METABRIC and TCGA samples. The point should be close to the line and close to points that have the same proportion of top loadings missing.

These plots show how the fuzziness score might be indicative of how good the embedding will be depending on the number of missing genes. The video below shows the images for 20 different patients. Notice how all of them are connected to the origin as well.

3.2.3 Systematic approach

We now use a more systematic approach to see the impact of the missing genes in the position in the embedding. The idea is that we extract some metrics from the TCGA samples as we increase the amount of the most important genes removed from the sample. What we simply do is for each patient we calculate the distance of the new embedding to the original embedding. Since for each proportion we ran the analysis 10 times we get 10 distances. We average over those. In the end we have 20 “distances” for each one of the 200 TCGA samples.

Figure 3.13 shows the relationship between the proportion of missing top loading genes among the 200 genes that were selected out and the average distance to the original embedding. Each dot corresponds to the a sample. In total there are 200 dots in each proportion.

Figure 3.13: Average distance to the original embedding

We see that the more top loading genes that are missing the worse the embedding gets. Still the average distance is a bit difficult to interpret, so next we show the relationship between the variance and the proportion of missing genes.

Figure 3.14

Indeed as the proportion of the most important genes goes missing, there is a higher variance in the average distance.

Next we evaluate the number of samples in the neighborhood of samples, so we can understand a bit better what is distance in the molecular embedder.

Figure 3.15 shows the proportion of samples in a neighborhood based on its radius. For samples within a radius of 1, in average the percentage of samples from the whole dataset including TCGA and METABRIC is of 2.5%.

Figure 3.15: Percentage of number of samples in the neighborhood with varyiing radii. The total number of samples is 2873, including all samples from TCGA and METABRIC.

This means that around 30% of top missing genes gives us neighborhoods that do not change much, since the average distance from the original embedding is below 1, so the neighborhood will not change much.

3.3 SMC embedding

By following the same procedures as previously, normalizing and getting the PC coordinates, Figure 3.16 shows the biplot of PC1 and PC2 when coloring by cohorts. The samples seem to be well mixed already in the RNA-seq samples. This is probably due to the fact that the SMC cohort has TPM values.

When checking the total number of genes missing, there are only 4 genes missing. And as seen before, this number is almost irrelevant, in the sense that the embedding will still be robust and locations will be precise enough.

Figure 3.16: Biplot of first two PCA components from SMC, TCGA, SCANB and METABRIC. SMC is highlighted in the plot and has a bigger dot.

As previously, Figure 3.17 shows the biplot of PC3 and PC4, highlighting the well mixing of the cohorts. When coloring by ER status, samples are in the right group. Molecular subtype is also correct. This cohort is special in the sense that there are more luminal B patients, as discussed in the paper. That is why there are more samples in the luminal B region.

Figure 3.17: PCA embedding of all samples from TCGA, SCANB and METABRIC including the SMC samples on top. (A) Colored by cohort, (B) colored by ER status, (C) colored by PAM50 molecular subtype.

3.4 PDX

Another way to validate the results is to use patient derived xenografts. In the Brisken lab the MIND model is used (Sflomos et al. 2016; Scabia et al. 2022). ER+ breast cancer samples coming from patients are processed in a specific way and injected into the mammary glands of mice. Based on that they can stablish themselves and proliferate. The environment should recapitulate what is seen in women as well, in the sense that estrogen (E2) levels are similar to those of post-menopausal women. Since these cells have to proliferate in order to establish and grow in the ducts, we expect them to be more proliferative in terms of RNA-seq components and also to lie in the luminal B region of the molecular landscape. We validate this now.

When checking the total number of genes missing, there are only 27 genes missing. And as seen before, this number is almost irrelevant, in the sense that the embedding will still be robust and locations will be precise enough.

Figure 3.18: Biplot of first two PCA components from PDX, TCGA, SCANB and METABRIC. PDX is highlighted in the plot and has bigger dot size.

As previously, Figure 3.19 shows the biplot of PC3 and PC4, highlighting the well mixing of the cohorts. As expected the samples are overall in the luminal B region, since these are tumor cells in a proliferative state.

Figure 3.19: Embedding of all samples from TCGA and METABRIC including the PDX samples on top.
Figure 3.20: Embedding of only PDX control samples on top of the METABRIC, SCANB and TCGA samples

And we check the PDXs response to estrogen and see how much they move away from the untreated sample.

Figure 3.21: Comparison of position in the molecular landscape between CTRL and P4 treated samples

It looks like some of the PDXs had a bigger shift towards the left when treated with estrogen or at least some stabilization. On the other hand, another set of PDXs, the last three, they didn’t see much change.

We now include as well the embedding for the P4 treated samples as well.

Figure 3.22: Embedding of only PDX control and P4 samples

We see that T110 has a change in position as well, it goes towards the center meaning that there is a decrease in proliferation and also ER signaling. T105 has a very strong effect, it goes up on the diagonal, meaning higher proliferation and lower ER signaling.

Figure 3.23: Comparison of position in the molecular landscape between CTRL, E2 and P4 treated samples

What we can see from these scores is that if there is an increase in G2M checkpoint and the control samples have scores below 0, then the fourth principal component will change it is value, it will go upwards and to the left in the molecular landscape, reflecting the biological response of the tumors to the given hormones.

Before we move on we also do the same using facet_grid instead of facet_wrap.

And the next two tables show the results for the differential expression analysis on the pathway level for each PDX individually. The estimate, i.e. difference, and the p-values are presented.

First for G2M:

Then to Estrogen Early:

Next we add these p-values to the plot.

Figure 3.24

We now further validate the results with the stainings performed on those samples. We will compare the differences with the stainings and the number of PR+ tumors cells.

Figure 3.25: Quantification of PR+ cells based on DAB stainings performed on the patient derived xenografts.

We see that T111 has a big increase in absolute values. T113 as well has a big increase, but has a baseline level already low and it does not go to higher values. Probably the reason there is not much difference in the fourth component of the PDX T113 is because of the low Ki67 index.

Based on the data shown here we see that there is an increase in proliferation upon P4 treatment for T111 and T113. Moreover, P4 reduced estrogen signaling and also change the third component of the three PDXs to a smaller value, towards a more basal like region. Overall, the molecular landscape is able to capture the differences between the treatments.

Lastly we check changes in the embedding upon E2 treatment.

Figure 3.26: Embedding of only PDX control and E2 treated samples

It looks like there is not big changes in the molecular landscape as there was for samples treated with progesterone.

And lastly we show for Ki67 as well.

Figure 3.27: Quantification of Ki67+ cells based on DAB stainings performed on the patient derived xenografts.

And below we plot both results together with the respective p-values.

Figure 3.28

3.4.1 Growth curves

Figure 3.29: Estimates of the 6 PDXs together. Comparison between P4 vs CTRL.
Figure 3.30: Average growth curves for each PDX individually with their estimates. The credible intervals corresponds to the 25th and 50th CIs.

3.5 Normal samples

We now further validate the molecular landscape on the normal samples that were sequenced using the same pipeline as the SCANB data. These are samples obtained from reduction mammoplasty of swiss women and were sent to Sweden for the tissue processing and sequecing. The hypotheses here is that the samples will be embedded in the region of the normal like BC samples.

When checking the total number of genes missing, there are only 0 genes missing, i.e., no genes were missing. This makes sense as these samples were processed in the same way as the ones from SCANB.

Figure 3.31: Biplot of first two PCA components from Normal Swiss Cohort, TCGA, SCANB and METABRIC. The normal cohort is highlighted in the plot and has bigger dot size.

As previously, Figure 3.32 shows the biplot of PC3 and PC4, highlighting the well mixing of the cohorts. As expected the samples are overall in the luminal B region, since these are tumor cells in a proliferative state.

Figure 3.32: Embedding of all samples from TCGA, SCNAB and METABRIC including the Normal samples on top.
Figure 3.33: Embedding of Normal Swiss cohort with the other cohorts colored by the PAM50.

And since we finished getting all the plots for the second figure in the paper, we combine it below.

Figure 3.34: All figures combined for Fig2 in the paper.

3.6 Third component as a prognostic measure?

We have seen that estrogen signaling is a continuous measure across the third component. The farthest to the right you are the higher the ER signaling scores are also. Here we hypothesized then that the third component is correlated to the prognostic nature of the estrogen signaling score. For this we select only samples with PC3 higher than 0, as these would be the ones that are from luminal A and B subtypes and we use PC3 as a score in the survival analysis.

cohort number_patients nb_events
SCANB 2777 456
METABRIC 724 445
SCANB High ER 2253 347

First we check the correlation of PC3 with estrogen response early and G2M checkpoint.

There is indeed the correlation with the components but still a lot of variability. So it will be unlikely to see such strong effects on the third component.

The table below shows the results for the survival analysis using all the samples from SCANB and METABRIC coming from patients that received only endocrine therapy and whose samples had a \(PC3 > 0\). OS means overall survival and RFS means recurrence free survival.

We see that in all cases the HR for PC3 is below 1, for OS the values are 0.98 and 0.97 and the HR and for RFS from 0.80 to 0.93. What is interesting to note here is that in all cases the confidence interval is crossing 1, as expected due to the variability of the correlation with SET ER/PR and estrogen response early. Remember that the hazard ratio scale is not linear. Indeed it seems that the more a sample is on the right side of the plot the better is its outcome. Notice that the scale for HR is different for PC3 and SET ER/PR, since one is a positive score ranging from 0 to 8 and the other is a score ranging from -1 to 1. If we were to rescale SET ER/PR to an interval between 0 to 10 by multiplying by 5 and then summing 5, we would get hazard ratios in the range of 0.8. Moreover, estrogen signaling is not the only thing fully explaining the third component as it can be seen from Figure 2.35. In this case, G2M Checkpoint and PI3K signaling have also a role in the third component.

The following figures correspond to the overall survival analysis and recurrence free survival analysis from the SCANB and METABRIC.

Figure 3.35: Overall and recurrence free survival analysis of the principal components PC3 and PC4. The rows correspond to a different PC and cohort and if the case sub cohort.

3.7 High risk patients

The Prosigna risk of recurrence score takes into account the tumor stage and the molecular subtype, this is the ROR-C and the formula from the original paper is presented below:

\[ ROR-C = 0.05\times basal + 0.11 \times HER2 - 0.23 \times LumA + 0.09 \times LumB + 0.17 \times Tumor size \]

We see that if a tumor sample has a high correlation to the luminal A subtype it will decrease the score, whereas for all other cases there is an increase in the risk. We want to evaluate if the position in the molecular landscape is somehow correlated with the risk score, specially in the forth component for luminal A patients.

In (Staaf et al. 2022) they showed that they are able to predict the Prosigna binary categories high or low/intermediate risk for the patients based only on the RNA-seq provided by their pipeline. We use these categories here to understand the molecular scores of the samples and also their correlation with the position and risk groups. All the following analysis are done for ER+ BC patients only.

First we stratify the risk groups by the molecular subtype.

As expected the majority of luminal A patients are low/intermediate risk, on the other hand most of the luminal B patients are considered of high risk.

If we look more specifically for luminal A patients and stratify on the tumor stage we get the following numbers.

We see that for tumor stage 1 the number of high individuals is lower in percentage than in tumor stage 3, as expected. We now plot (Figure 3.36) for all these patients the G2M molecular score, an index of proliferation, stratified by the tumor stage.

Figure 3.36: Luminal A BC patients G2M scores stratified by tumor stage.

We see that in all three cases in average the G2M scores are higher in the high risk groups than in the Low/Intermediate.

We saw previously that some of the G2M checkpoint genes have higher loadings for the fourth component. We now plot the PC4 components stratified by the risk groups for the luminal A patients (Figure 3.37).

Figure 3.37: Luminal A BC patients PC4 values stratified by risk category.

Indeed, in average the higher the PC4, the more likely it is to be in the High risk group. And now we compare the G2M Checkpoint, Estrogen response early and random 200 signature for each molecular subtype separately (Figure 3.38) of ER+ BC patients.

Figure 3.38: ER+ BC patients and molecular scores stratified by molecular subtype and binary risk categories.

We see that in this case the binary risk category has distinct distributions for the G2M checkpoint for all molecular subtypes, in particular with the basal like.