6  Expanding embedding and removing batch effects

We have used some numbers of genes for doing the embedding. Here we try to perform the embedding with all the genes (~10000 genes) and then we try to regress out the components.

* The library is already synchronized with the lockfile.
    ../../results/plots/expanding_ember ../../results/rds_files/expanding_ember 
                                  FALSE                                    TRUE 
   ../../results/tables/expanding_ember 
                                  FALSE 

6.1 Embedding with more than 1000 genes

Total number of stable genes: 39
Total number of genes: 8248
Number of samples: 1005
Total number of stable genes: 39
Total number of genes: 8248
Number of samples: 7216
Total number of stable genes: 39
Total number of genes: 8248
Number of samples: 1868
Total number of stable genes: 39
Total number of genes: 8248
Number of samples: 426

And now we can visualize the results.

Figure 6.1: PCA projections colored by different factors and organized by different components. (A) Plot of the first two components colored by cohort. (B) Plot of the first two components colored by ER status. (C) Plot of the second and third components colored by cohort. (D) Plot of the second and third components colored by ER status. (E) Plot of the second and third components colored by PAM50. (F) Plot of the second and third components colored by INTCLUST, only METABRIC has an assigned value for this variable, NAs are TCGA samples.

And now we can visualize the results.

Figure 6.2: PCA projections colored by different factors and organized by different components. (A) Plot of the first two components colored by cohort. (B) Plot of the first two components colored by ER status. (C) Plot of the second and third components colored by cohort. (D) Plot of the second and third components colored by ER status. (E) Plot of the second and third components colored by PAM50. (F) Plot of the second and third components colored by INTCLUST, only METABRIC has an assigned value for this variable, NAs are TCGA samples.

Figure 6.3: PCA embedding colored by cohort. The components used are the first and second components.

Figure 6.4 shows that the SCANB samples are also well mixed regarding the clinical factors, including the \(SET_{ER/PR}\) signature.

Figure 6.4: PCA embedding of all samples from TCGA, SCANB and METABRIC. (A) Colored by cohort, (B) colored by ER status, (C) colored by PAM50 molecular subtype, (D) colored by the \(SET_{ER/PR}\) signature and only SCANB samples.

Let us also check the other components in the biplots.

We see that the pam50 subtypes are being separated using the components 3, 4 and also 8. In the component 8 there is some expression that is able to differentiate between HER2-like from all other samples.

We now plot by cohort.

In PC1, 2 and 5 there are some small differences in the differences of the distributions from the datasets. We will then remove these 3 components up to component 10 with exception of components 3, 4 and 5.

6.2 Removing the batch effects

We now try to remove the batch effects by removing all the PCs that are not of interest, so we remove the first two components and the fifth as well. We then calculate the scores for each cohort separately and then compare to what was calculated before. We can then try to calculate the scores using samples coming from all cohorts.

In general it looks good, the luminal B population is a bit compressed towards the luminal A population though.

There is a very good merging of the datasets. We see mostly green points because they are the majority.

Now we see that the first loadings has a tail in a similar fashio as the other loadings.

We now try to calculate the scores of the samples using these new embeddings only from TCGA samples as we will compare the results later on.

And we can now check and compare the scores.

We see that for the three pathways the correlation is positive and strong. The difference is that it is a bit noisy, so the correlation is not perfect. What is interesting to see is that at the extremes the scores have less variation usually.

We now calculate the scores from SCANB and compare the results. Remember that SCANB was not used for the PCA training.

And we can now check and compare the scores.

The same phenomenon happened. Samples in the endges have lower variability, but samples in the middle are the ones that have high variability. For example there are several samples from the regressed data with SET_ERPR score close to 0.25 and negative scores.

What happens now if we use another dataset and then use only one sample to calculate the score. Let us include SCANB samples in the pipeline. For this we select 50 scanb samples and 100 random samples from tcga and metabric. We then combine all these samples together and calculate the GSVA scores.

We now plot only the scores from the SCANB samples.

So it seems there are still dataset related problems there when calculating the scores. The scores are positively correlated, but they are not quite the same as only using the SCANB dataset. Moreover, for G2M checkpoint the scores are concentrated around -0.4 from the regressed data but they are in a range going from - 0.4 to 0.1 in the unregressed data.

If we try to instead add some samples from the SCANB dataset as well randomly into the mix. How well does the scores get compared to the “ground truth”?

It does not work so well when including some SCANB samples together with the METABRIC and TCGA samples. We are trying to achieve a correlation of almost 1 with very low variability since we want to use the scores clinically.

Let us check the distribution of some genes for SCANB and TCGA after regressing out the first two PCs.

We see that even though the expression profile of the samples are overlapping there are discrepancies. That is why when using GSVA it does not work so well perhaps.

6.3 Scoring strategy with regressed data

We now try to use a slightly different scoring strategy. We can think of our data in a log scale, so what we do now instead is to sum all of our genes of interest and then subtract an average of the housekeeping genes expression levels.

First we check the distribution of this average for each cohort separately.

Figure 6.5: Distribution of the average of the housekeeping genes for distinct cohorts.

Indeed the tcga and metabric cohorts have an overlapping distribution, but the SCANB does not overlap completely, it is actually shifted. One should still take into account that the distributions are very close to 0, meaning that the housekeeping genes have very close to 0 values, as expected here in this case. We now calculate the score by taking this into account.

Figure 6.6: Scores obtained from regressed data of all the four cohorts TCGA, SCANB, METABRIC and POETIC compared to the original GSVA scores.

The scores are actually highly correlated, specially the estrogen related pathways. Also the correlation between the scores in general follows the same trends.

The plot below is a subset of the scores above to include in the publication.

Figure 6.7: Scores obtained from regressed data of all the four cohorts TCGA, SCANB, METABRIC and POETIC compared to the original GSVA scores.

We now compare the distributions of these scores across the different cohorts and pathways of interest.

And with all molecular subtypes together.

We see that the overlap is actually pretty good in general, even for SCANB. The only part where there is a bigger discrepancy is for G2M checkpoint. We see below that by calculating the GSVA score on the whole scanb cohort we don’t have this problem.

And as a comparison we plot with the original scores using GSVA on the whole cohort individually.

And all the subtypes together for the original score:

We see that by doing GSVA the distributions they overlap much more than using the regressed data and the simple sum of the values for those genes. Still for estrogen signaling it seems that the estrogen early signature is pretty robust.

6.3.1 Checking the patients from POETIC with new scores

We now use the new scores to evaluate the comparison with the neighboorhood, just as before in the previous chapter where we used the POETIC trial data.

[1] "treated_nb63_baseline"
[1] "treated_nb236_baseline"

Patient 63 Figure 6.8 is considered a responder. According to the scores, when comparing the estrogen early signature to its average distribution in the neighborhood, the score is higher.

Figure 6.8: Posterior distribution of the average scores in the neighborhood of patient 63 from POETIC trial. Each dot corresponds to a 1% quantile.

Patient 236 Figure 6.9 was considered a non responder. According to the scores, when comparing the estrogen early signature to its average distribution in the neighborhood, the score is lower, being in the 5% quantile.

Figure 6.9: Posterior distribution of the average scores in the neighborhood of patient 236 from POETIC trial. Each dot corresponds to a 1% quantile.

When comparing these two patients, there is also a difference in the androgen response score, which could be a reflection of different estrogen signaling.

It reflects what we saw previously as well.

6.3.2 Responder vs non responder comparison of new scores

We now compare the scores for responders and non responders.

The higher E2F targets in average the lower the estrogen early score is for non responders, whereas this correlation does not exist for responder patients. This difference is mostly drive by the high E2F target non responder tumors.

Now the plot below shows the correlation for baselien Ki67% and E2F targets for the score calculated in the regressed data in a single sample manner. We see a positive correlation between the two scores. Moreover, it is interesting to see that when the tumor has low Ki67, there is still a spectrum on the E2F targets score. Also there is no difference in the correlation between responders and non responders.

And lastly we compare the estrogen signaling signatures between responders and non-responder.

We see that the estrogen response early has a wider range when compared to the SET ER/PR score. That makes sense as this signature has over 100 genes, whereas SET ER/PR has only 18. Also the responder was very good in discriminating the responders and non responders. What we can conclude from this figure is that the samples with very low scores are usually non responders, which reflect their position in the molecular landscape, also in general responders have higher ER signaling than non responders.

Below is the figure with the GSVA scores before regressing the data, we see that SET ER/PR is not able to really distinguish completely the non-responders with very low score similar to what was done with GSVA. Even estrogen early had some differences.

And below is a figure with both GSVA and regressed scores together.

Figure 6.10: Comparing scoring strategies for responders and non responders in the POETIC dataset.

Below we show the correlation between these scores for all cohorts.

These scores are highly correlated but in different absolute scales. We now check the change in Ki67 for all patients and compare the scores with the original ones obtained with GSVA.

Figure 6.11: Comparison of original GSVA and regressed scores between responders and non responders compared to change in Ki67.

We see that the bigger the change the higher the score is when using the regressed data. Also the non responders tend to have scores mostly in the negative scale, whereas the responders are more shifted to the right. SET ER/PR is also a good measure here.

There is a very good match between the two scores in general, meaning that the regression scoring strategy capture relatively well the scores. There is one patient that is a non-responder and has a low regressed estrogen early but a GSVA score close to 0, let us check this patient.

It is a patient that is far to the left on the molecular landscape, among the basal like, and had an increase in Ki67 upon ET. So the new scoring system was able to capture this patient as with low score.

6.3.3 Survival analysis with the new scores

We can check how robust these scores are by performing survival analysis.

Figure 6.12: Overall survival analysis from SCANB and METABRIC cohorts and their scores. The formulas used were the same as for the estrogen signaling analysis. Coefficients were scaled before performing cox regression

The plot with TCGA is shown below.

Scores for TCGA are noisier than SCANB and METABRIC. Though the scaled scores they have comparable hazard ratios and also they follow what we saw previously. The only pathway that is a bit different is the PI3K AKT MTOR signaling. This pathway had a noisier correlation with the GSVA score as well.

6.4 Regressing the PDXs

We can use the regressing strategy to score the PDXs and then evaluate their scores.

Total number of stable genes: 39
Total number of genes: 8205
Number of samples: 76
[1] "Normalization done."

We start by comparing the scores across the different control samples and their positions in the molecular landscape.

Figure 6.13: Embedding of only regressed PDX control samples on top of the METABRIC, SCANB and TCGA samples

The results are jus twhat we got previously as well, meaning that by using all these genes the information of the PDXs are still captured. The main differences is that the luminal B region is more compact compare to Figure 3.20.

And now that we have the results we can compare the PDXs as well.

Figure 6.14: Scores obtained from the regressed data of the PDXs.

The random 200 is just a gene set with 200 random genes that works as a control and serves as a comparison to the other pathways.

And below we show the distribution of some of the scores calculated for all the other cohorts as a comparison.

We see that METS15 is the one with the highest proliferation and lowest EMT, it is a PDX that comes from a pleural efusion. Also it is the PDX with the lowest apoptosis and highest DNA repair, probably due to a higher replication rate. Moreover, all these PDXs they have a estrogen response early scores higher than 0 in average, which is about the average of the scores in the ER+ BC samples.

Now we check what happens with some of the PDXs when they are treated with P4, to compare with Figure 3.23.

Figure 6.15: Comparison of position in the molecular landscape between CTRL, E2 and P4 treated samples with the regressed data

The results are very similar to what was seen previously using the GSVA. What is interesting here is that we have an absolute scale, so we can compare the scores.

6.5 singscore

We can try to use singscore instead, it is a sample dependent measure so it does not estimate the distribution of the gene expression levels before calculating the enrichment score.

We start by comparing with the GSVA scores.

Figure 6.16

We see that in all cases there is a correlation, but the interpretation changes. In some of the pathways there is difference in the singscores based on the cohort, this is not seen on GSVA.

Figure 6.17

The next figure shows the same comparison but instead of using singscore, we used ssGSEA.

[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."
[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."
[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."
[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."

The picture is similar as before, but now we actually look at the distributions across the 3 big cohorts. We used ssGSEA without the normalization, as that would be necessary in the case of an individual sample.

Figure 6.18

We see that overall there is a discrepancy, sometimes big, depending on the pathway, including the random genes pathway. Next we show the same distribution for singscore. Similarly there is a big discrepancy in several cases.

Figure 6.19

But now, Figure 6.20 shows that when we calculate the scores with the regressed data using our methodology the distributions are matched and adjusted, with the expection of G2M, which seems to be something more fundamental instead of batch effect. Note that all the distributions are overlapping significantly.

Figure 6.20

The next two figures are the distributions of the singscore and GSVA filled by cohort.

For several pathways in the singscore the distributions are similar, but that is not always the case.

On the other hand GSVA gives very similar distributions for all the different cohorts. Therefore we actually need to use GSVA. It is important in the end to calculate scores based on the cohort.

We now move on and try the same analysis on the regressed data.

We start by comparing with the GSVA scores.

We see that in all cases there is a correlation, but the interpretation changes. In some of the pathways there is difference in the singscores based on the cohort, this is not seen on GSVA.

The next two figures are the distributions of the singscore and GSVA filled by cohort.

In the end the best way of scoring seems to be using GSVA on the unregressed data though it depends on the pathways. For example the hallmark estrogen response early and set ER/PR pathways seem to have been regressed quite well, so we can actually compare the results using the scores. On the other hand we can’t use the pathways G2M checkpoint, E2F targets and EMT. For the PI3K signaling score it seems it is comparable to SCANB only.