7  Tests and tries

This is a file where we try different things using the molecular data. We keep it here in case anyone finds anything here interesting and would like to further discuss with us for potential collaborations.

* The library is already synchronized with the lockfile.
    ../../results/plots/trying ../../results/rds_files/trying 
                         FALSE                          FALSE 
   ../../results/tables/trying 
                         FALSE 

7.1 Using several pathways or just one pathway for GSVA

We now evaluate the effect of the scores for each pathway when using all the 53 pathways combined and each one individually. We use only TCGA for this step.

7.1.1 One vs all

We now calculate the scores for each gene set individually.

It seems they are all positively correlated, so one could calculate the scores basically using just one gene set with exception of some of the gene sets.

7.2 Minimum number of samples for GSVA

According to the original GSVA paper, a minimum of 10 samples are necessary to have a good power when using GSVA. Since here we have at least 5 molecular subgroups, we hypothesize we need at least 50 samples to get a good score so we can compare across cohorts. We will try with different numbers of samples using TCGA, we start with 10, 25, 50 and 100 samples and we compare with the scores using the full dataset.

The conclusion we can take here is that depending on the gene set not even 150 samples is enough. For some only 25 patients is already enough, such as SET ER/PR, androgen response, estrogen response early and E2F targets.

Could we mix around 20 samples of SCANB together with more samples from TCGA and METABRIC to get the scores? Let us try to adress this question again. We are using the regressed data now here as this is the only way.

There is still some correlation but the scores are shrinked in the end.

7.3 Normalization strategy

We can also try another way of normalizing, namely for each sample we can z scale it.

And now we can visualize the results.

Figure 7.1: PCA projections colored by different factors and organized by different components. (A) Plot of the first two components colored by cohort. (B) Plot of the first two components colored by ER status. (C) Plot of the second and third components colored by cohort. (D) Plot of the second and third components colored by ER status. (E) Plot of the second and third components colored by PAM50. (F) Plot of the second and third components colored by INTCLUST, only METABRIC has an assigned value for this variable, NAs are TCGA samples.

We see that samples are well embedded as well.

Figure 7.2: PCA embedding colored by cohort. The components used are the first and second components.

SCANB and POETIC are in between as expected and they are closer to their sequencing technology. Interestingly the variation for SCANB and TCGA are compressed in the PC1 vs PC2 axis compared to METABRIC and POETIC.

Figure 7.3 shows that the SCANB samples are also well mixed regarding the clinical factors, including the \(SET_{ER/PR}\) signature.

Figure 7.3: PCA embedding of all samples from TCGA, SCANB and METABRIC. (A) Colored by cohort, (B) colored by ER status, (C) colored by PAM50 molecular subtype, (D) colored by the \(SET_{ER/PR}\) signature and only SCANB samples.

In the end normalizing by scaling the sample before doing any embedding works as well. Though the embedding seems to be more compressed. I would argue that the qPCR-like normalization gives a bit better embedding.

7.4 genefu

With genefu it is possible to calculate scores from comercially available signatures. Here we try to use it on SCANB, METABRIC and TCGA. Moreover, we can start with ABiM, as this dataset has the gold standard ROR score obtained from nCounter data. This is a good way of validating the pipeline.

We start by comparing the molecular subtypes:


FALSE  TRUE 
   14    86 

86 out of 100 were correctly called. Here we are using the gold standard wich is the molecular subtype called from FT samples. The normal like subtype is not available from the FT prosigna assay. So we now compare the results with the SSP obtained from the SCANB team.

Below is the confusion matrix to see which subtypes were misclassified.

Confusion Matrix and Statistics

          Reference
Prediction Basal Her2 LumB LumA Normal
    Basal     13    0    0    0      0
    Her2       0    8    0    0      0
    LumB       1    1   33    1      0
    LumA       1    0    7   32      0
    Normal     1    0    0    2      0

Overall Statistics
                                          
               Accuracy : 0.86            
                 95% CI : (0.7763, 0.9213)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7965          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Basal Class: Her2 Class: LumB Class: LumA
Sensitivity                0.8125      0.8889      0.8250      0.9143
Specificity                1.0000      1.0000      0.9500      0.8769
Pos Pred Value             1.0000      1.0000      0.9167      0.8000
Neg Pred Value             0.9655      0.9891      0.8906      0.9500
Prevalence                 0.1600      0.0900      0.4000      0.3500
Detection Rate             0.1300      0.0800      0.3300      0.3200
Detection Prevalence       0.1300      0.0800      0.3600      0.4000
Balanced Accuracy          0.9062      0.9444      0.8875      0.8956
                     Class: Normal
Sensitivity                     NA
Specificity                   0.97
Pos Pred Value                  NA
Neg Pred Value                  NA
Prevalence                    0.00
Detection Rate                0.00
Detection Prevalence          0.03
Balanced Accuracy               NA

All luminal A were predicted correctly. Some normal like were predicted to be luminal A or basal and Her2. There were some mismatches for the LumB, that were misclassified as LumA. Still it is ok.

Let us now calculate the ROR for the ABiM cohort.

Figure 7.4: Correlation between ROR obtained from prosigna assay and R package genefu

The correlation is relatively good and respects the molecular subtypes in general. The ROR obtained from genefu for some luminal B are not quite right as they are below 50.

We now move on to the other signatures: Mammaprint, EndoPredict (EP) and OncotypeDX Risk Score (RS).

Figure 7.5: Correlation between ROR obtained from prosigna assay and the mammaprint signature obtained from the R package genefu

Figure 7.5 shows the correlation between the mammaprint signature (named GENE70) and the ROR from prosigna.

Let us now compare the GENE70 signature with the risk score developed in the previous chapters. We use the ABiM 100 cohort to make the comparison. One side note, we are using only 51 out of the 70 probes available in the package. 15 probes don’t have any corresponding symbol. Out of the available genes, some are duplicated. For the duplicated genes, the correlation coefficients are all similar, so we can select just the first ocurrence in the list.

There is a good correlation between the two signatures, suggesting that the Mammaprint signature is somehow based on the position of the samples in the molecular landscape.

We now proceed and calculate the mammaprint scores for SCANB and METABRIC as well and only for ER+ BC samples.

Figure 7.6: Embedding of ER+ BC samples colored by the mammaprint score.

We see that the pattern is very close to what we have from the risk score created in the previous chapter. The figure below shows the correlation between mammaprint and the risk score once again.

Now for OncotypeDX:

Figure 7.7: Correlation between oncotypeDX and R package genefu

There is also a correlation between the risk score and oncotypeDX. The paper https://www.nature.com/articles/s41523-022-00492-0#Sec8 also used the genefu for calculating the oncotypeDX.

7.4.1 genefu and AIMS

Here we try to use AIMS to calculate the PAM50 subtypes for the and compare with the results obtained previously. We want to compare the classifications with the molecular landscape.

The confusion matrix below show the results of the AIMS PAM50 as the predicted and the reference is the PAM50 subtypes obtained from the SCANB team.

Confusion Matrix and Statistics

          Reference
Prediction luma her2 normal lumb basal
    luma   2301   61     51  664     0
    her2      9  627     16  200     4
    normal  724   37    830    6    11
    lumb     17   17      0  938     0
    basal     0   75     14    0   614

Overall Statistics
                                         
               Accuracy : 0.7359         
                 95% CI : (0.7255, 0.746)
    No Information Rate : 0.4228         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.6411         
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: luma Class: her2 Class: normal Class: lumb
Sensitivity               0.7542     0.76744        0.9111      0.5188
Specificity               0.8137     0.96421        0.8766      0.9937
Pos Pred Value            0.7478     0.73248        0.5162      0.9650
Neg Pred Value            0.8188     0.97013        0.9856      0.8607
Prevalence                0.4228     0.11322        0.1262      0.2506
Detection Rate            0.3189     0.08689        0.1150      0.1300
Detection Prevalence      0.4264     0.11863        0.2228      0.1347
Balanced Accuracy         0.7839     0.86583        0.8938      0.7563
                     Class: basal
Sensitivity               0.97615
Specificity               0.98649
Pos Pred Value            0.87340
Neg Pred Value            0.99770
Prevalence                0.08717
Detection Rate            0.08509
Detection Prevalence      0.09742
Balanced Accuracy         0.98132

We see that there is a big misclassification from luminal A samples to normal like samples and several luminal B are misclassified as luminal A.

The plot below shows the coloring by the reference PAM50 and the predicted by AIMS.

Figure 7.8: SCANB embedding stratified by PAM50 algorithms by both the ones provided by the SCANB (SSP) and the AIMS algorithm

On the left there is the coloring for the AIMS algorithm and on the right from the reference. We see the missclassified samples are not randomly misclassified, they are on the boundaries.

The plot below shows the maximum probability score for each sample individually.

The majority of samples have probabilities of either 1 or 0. Not so often in between.

And the figure below shows the confusion matrix for the ABIM cohort.

Confusion Matrix and Statistics

          Reference
Prediction luma lumb her2 normal
    luma     28   17    1      3
    lumb      0   16    0      0
    her2      0    0   10      0
    normal    2    0    2      2

Overall Statistics
                                          
               Accuracy : 0.6914          
                 95% CI : (0.5789, 0.7893)
    No Information Rate : 0.4074          
    P-Value [Acc > NIR] : 2.253e-07       
                                          
                  Kappa : 0.5401          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: luma Class: lumb Class: her2 Class: normal
Sensitivity               0.9333      0.4848      0.7692       0.40000
Specificity               0.5882      1.0000      1.0000       0.94737
Pos Pred Value            0.5714      1.0000      1.0000       0.33333
Neg Pred Value            0.9375      0.7385      0.9577       0.96000
Prevalence                0.3704      0.4074      0.1605       0.06173
Detection Rate            0.3457      0.1975      0.1235       0.02469
Detection Prevalence      0.6049      0.1975      0.1235       0.07407
Balanced Accuracy         0.7608      0.7424      0.8846       0.67368

We see here that in this case that there are several luminal B samples missclassified as luminal A.

The misclassified samples from AIMS they really seem to be on the top region together with the luminal A. Some of the normal samples are even in the luminal A region.

And another validation is using the normal cohort. Below is the number of all predicted molecular subtypes.


Normal 
    66 

All samples are normal-like as expected, since they are normal.

7.5 nanostring signatures

In this section we try the nanostring panel signatures provided by the Breast Cancer 360 test. We will compare some of the signatures by using GSVA on the original datasets and by calculating the sum of the gene expression levels of the genes when regressing out the 1st, 2nd and 5th components.

The table below shows the number of genes available for each signature and each dataset. In general it seems that all signatures have over 70% of the genes available, a good indication.

pathway tcga scanb metabric poetic n average_percentage
adhesion_and_migration 65 61 68 83 83 0.83
angiogenesis 27 27 29 33 34 0.85
antigen_presentation 20 19 19 19 21 0.92
apoptosis 7 7 5 9 9 0.78
cytokine_and_chemokine_signaling 38 29 37 47 50 0.76
dna_damage_repair 126 108 118 133 143 0.85
emt 72 71 63 82 85 0.85
er_signaling 27 25 24 25 27 0.94
epigenetic_regulation 16 16 14 17 18 0.88
hedgehog 14 10 17 20 20 0.76
immune_infiltration 24 21 23 34 34 0.75
jak_stat 37 36 35 45 47 0.81
mapk 67 61 68 95 100 0.73
notch 21 19 16 21 22 0.88
pi3k 72 67 63 91 96 0.76
proliferation 121 113 114 136 144 0.84
stromal_markers 6 6 6 6 6 1
subtypes 69 63 62 69 70 0.94
tgf_beta 47 47 42 53 57 0.83
transcriptional_misregulation 44 37 44 61 63 0.74
triple_negative_biology 38 32 41 49 50 0.8
tumor_metabolism 15 15 14 14 15 0.97
wnt 37 34 44 51 51 0.81
internal_reference_gene 18 18 16 17 18 0.96

We now compare the scores obtained by GSVA on the datasets and see how they relate with nanostring scores obtained with the regressed data.

First we compare GSVA SET ER/PR with ER signaling obtained by the scoring procedure when regressing the data.

Next we compare EMT, G2M with proliferation and PI3K signaling.

In general there is a correlation but the scores are not so well correlated and they are noisy. One that works fairly well is the ER signaling score, but the others not so well. The proliferation score barely shows any difference between luminal A and B patients in any of the cohorts.

Perhaps if we calculate the scores using the hallmarks pathways the scores are better correlated, since they are the same gene sets.

Below we plot the correlation between the proliferation pathway using GSVA and the regressed scoring strategy for the TCGA cohort.

It is perfectly correlated.

And for ER signaling:

Also it is very much correlated. So the problem in the end is comparing the hallmarks signatures with the nanostring. But interestingly the nanostring proliferation signature was not good to pick up the differences between luminal A and B.

7.6 Risk stratification and the molecular landscape

One common problem in the clinics is what to do with patients that are in the so called intermediate risk group. Here we use the SSP classification from the SCANB team to check what is the position of these patients in the molecular landscape.

Figure 7.9: Embedding of the SCANB samples colored by their predicted ROR risk category.

Figure 7.9 shows that the embedding detects the positions where the intermediate risk patients are, namely they are in the intersection of the luminal A and B subtypes.

And Figure 7.10 below shows that there is a shift in PC4 for the intermediate to the high risk group, which makes sense as this is exactly where the distinction between luminal A and B are.

Figure 7.10: Distribution of the principal components for the different risk groups defined by the SSP ROR scores.

7.7 Late distant recurrence: trying to find biomarkers

One of the key problems in breast cancer research is finding patients that will develop late recurrence. Here we use METABRIC as it has a long follow-up history. We start by plotting the embedding of all METABRIC patients that had a followup longer than 10 years and received only ET. We then try to predict which patients recurr or not based on the molecular data by using lasso regression. For this we use the package glmnet that provides such functionality.

In general there is no distinction in terms of the molecular landscape among the patients that recurred, both in terms of molecular subtype and position in the molecular landscape.

We now compare their molecular scores.

Even here we don’t see much difference. Unless there is some kind of linear combination of pathways, I don’t think it would be possible to classify these two groups by using only transcriptomic data and these pathways. In any case we proceed now with the lasso regression using the 50 pathways.

We see that actually it does not work. And when we predict all the samples are predicted as there is no recurrence.

.
 no 
407 

This is probably because of the unbalanced dataset. Next we try with random forest as well.


Call:
 randomForest(formula = as.factor(is_there_recurrence) ~ ., data = et_metabric_late_data[,      c("is_there_recurrence", pathways_to_compare, "AGE_AT_DIAGNOSIS",          "LYMPH_NODES_EXAMINED_POSITIVE", "npi")], ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 7

        OOB estimate of  error rate: 17.44%
Confusion matrix:
     no yes class.error
no  336   3 0.008849558
yes  68   0 1.000000000

That is not the case, still we can’t predict if a patient will develop recurrence or not in this cohort.

7.8 Correlation of pathways with proliferation

One thing that is very common is defining a gene signature and then using this gene signature to calculate survival analysis. The thing is, proliferation related genes are pretty much present in almost any signature and hence all these pathways are correlated to G2M and E2F to some extent (Venet, Dumont, and Detours 2011).

Here we calculate the pathway scores by using the regressed data and do all pairwise comparison of some of the most important pathways to see their correlations.

First we see that G2M and E2F are encoding basically the same information. PI3K AKT MTOR, AR and the Biocarta HER2 and SET ER/PR are all correlated to some extent to proliferation. Surprisingly the two HER2 signaling scores are not able to distinguish the HER2 subtype. The SET ER/PR correlation with G2M is possibly explained by the fact that the basal molecular subtype has higher G2M and lower ER signaling compared to luminal A and B.

7.9 ILC and position in the molecular landscape

Invasive lobular carcinoma is a subtype of ER+ BC. They tend to respond better to ET in the short term but they recurr after 20 years. Here we just show where the ILC BC lie in the molecular landscape.

Figure 7.11: Embedding of the lobular samples from SCANB on top METABRIC and TCGA

We see that they are mostly in the normal and luminal A region. Still there are several samples that are luminal B, meaning they are more proliferative.

Is there any difference in the SET ER/PR for lobulars when comparing to ER+ HER2- Ductal carcinomas?

They seem to be the same, also when checking the risk score calculated for all patients, the lobular patients don’t seem to have a decreased risk overall, it is mostly because they are in the same region of the molecular landscape. Interestingly there is way less luminal B patients in the lobular, but still their distributions are very similar.

And we also plot by intrinsic molecular subtype.

We also compare EMT.

EMT is very similar across both histological subtypes. The plot below shows the densities for ER percentage, notice how the two overlap.

We now check the lobular signature LobSig that is touted to be superior to OncotypeDX, Prosigna’s PAM50 and other signatures. In total, 63% of the genes are available here. Based on my experience this is already a good number to make comparisons. For example, G2M checkpoint has 71% of the genes available, estrogen response early has 66%.

We see that the LobSig is highgly correlated with G2M checkpoint and also the risk of recurrence score obtained from PAM50. The random 200 is a signature of 200 random genes that works as a negative control.

The figure below shows the distinction of the signature among the 3 different risk categories from ROR.

There is an increase in the score as expected based on the previous results that show where the intermediate risk group is according to the molecular landscape.

Lastly we check the CDH1 expression levels on the regressed data to see if the differences are still captured upon the whole process.

The differences are still captured, and CDH1 is decreased when comparing lobular vs NST. Note also that the difference seems to be the biggest in the luminal B subtype.

We also check the PC3 and PC4 loadings for CDH1, to see how important this gene is in the molecular landscape.

  gene          PC3        PC4
1 CDH1 0.0002872375 0.01509738

PC4 has a relatively high value which puts it into the 95 percentile as shown below.

[1] 0.9547769

Next we can plot the lobular samples from Ciriello’s paper and color by the 3 subtypes they found. Note here that there is a mismatch between the cbioportal data and what is found in the original papers (Curtis et al. 2012). We will use the Curtis 2012 annotation as this is what matches Ciriello’s paper. Surprisingly the mismatch is not a systematic issue, rather 88 ILC samples are matching and the non-matching samples are NST but the other info matches usually, such as NPI.

Figure 7.12: Embedding of the lobular samples from METABRIC on top SCANB and TCGA

The table below shows the number of subtypes for each molecular subtype. In general there are samples in each one of the molecular subtypes. The reactive-like is actually more enriched in the normal region.

        subtype basal claudin-low her2 luma lumb normal
 Immune-Related     0           3    2   27   10      4
  Proliferative     3           3    5   11   15      4
  Reactive-like     0           5    1   22    2     20

And below is a plot of the lobular samples from Ciriello/Curtis with the molecular subtypes defined by Ciriello.

It seems that more proliferative saaples are in the luminal B region whereas the Reactive-like and Immune-related are not so distinguishable, they are in the luminal A region.

Some of the samples are NST on cbioportal and Lobular on Curtis. We selected the following samples and checked on cbioPortal (https://www.cbioportal.org/study/clinicalData?id=brca_metabric) their histologic subtype directly.

  sample_name   npi
1     MB-0035 3.056
2     MB-0050 4.066
3     MB-0083 3.026
4     MB-0101 3.068
5     MB-0102 5.080
6     MB-0112 6.300

All the 6 samples are Ductal/NST and not Lobular.

7.10 Robustness to library protocol

The figure below shows the first two components for SCANB stratified by the library protocol used in the study. We see that PC2 is partially capturing the differences due to library protocol.

7.11 Distance in 4th component for POETIC

We have seen previosuly that there is a bigger difference in between the matched samples in the responder than to non-responder patients. Hallmark G2M checkpoint is one of the signatures driving the PC4, meaning that the position wrt the fourth component might be used to define a new response status.

Below we plot the matched samples for all non responder patients with no change at all in Ki67 (change >= 0%) and their respective positions in the molecular landscape.

Now we do the same for the top responders (change in Ki67 \(\leq\) -90%).

In general the distance for the responder samples seem to be higher and not only that we see that the surgery sample goes to the bottom part of the embedding, suggesting that the reduction of the proliferation is reflected in the molecular landscape.