library(rlang)
library(dplyr)
library(readxl)
library(ggplot2)
library(tidyr)
library(janitor)
library(SummarizedExperiment)
3 SCANB (2022 release)
Data downloaded from: https://data.mendeley.com/datasets/yzxtxn4nmd/3, version 3, 2023-01-25. We downloaded the data that is not adjusted by protocol to look like Tru-Seq. That means we cannot directly take the gene expression levels and compare among all the cohorts, we can only calculate scores and then compare them, depending on the tool being used. The folder from which data was downloaded is StringTie FPKM Gene Data unadjusted. The description is given as:
Gene expression FPKM data as outputted by StringTie and summarized on gene identifier. The gene expression data used for training SSP models and for classification of samples using the trained SSPs.
3.1 Loading Rdata files
First we load the Rdata files that contain the gene expression levels and also the gene annotation.
sapply(
list.files(
path = "../../Data/20230125_scanb",
full.names = TRUE,
pattern = "Rdata"
),
load,envir = globalenv()
)
<- SCANB.9206.mymatrix
scanb_9206 rm(SCANB.9206.mymatrix)
<- ABiM.100.mymatrix
abim_100 <- ABiM.405.mymatrix
abim_405 rm(ABiM.100.mymatrix)
rm(ABiM.405.mymatrix)
<- Normal.66.mymatrix
normal_66 rm(Normal.66.mymatrix)
<- OSLO2EMIT0.103.mymatrix
oslo_103 rm(OSLO2EMIT0.103.mymatrix)
<- Gene.ID.ann
gene_id_ann rm(Gene.ID.ann)
<- list()
datasets <- list() which_assay
All the gene anottations are available in the object gene_id_ann
, which will be used for scoring.
3.2 Loading clinical data from SCAN-B
<- c("SCANB.9206", "ABiM.100", "OSLO2EMITO.103", "ABiM.405", "Normal.66")
sheets
<- suppressWarnings({readxl::read_excel(
clin_data "../../Data/20230125_scanb/Supplementary Data Table 1 - 2023-01-13.xlsx",
sheet = sheets[1],
progress = TRUE
)})
$scanb <- SummarizedExperiment::SummarizedExperiment(
datasetsassays = list("FPKM" = scanb_9206[, dplyr::pull(clin_data, "GEX.assay")]),
colData = clin_data %>% data.frame %>%
`rownames<-`(clin_data$GEX.assay)
)$scanb <- "FPKM" which_assay
# convert ensembl genes to hugo IDs
rownames(datasets$scanb) <- gene_id_ann[rownames(scanb_9206), "Gene.Name"]
# check what are the duplicated genes to see if we can safely drop them
<- rownames(datasets$scanb)[duplicated(rownames(datasets$scanb))] %>%
dup_genes
table
dup_genes
Due to the lower amount of genes and the fact each gene has only one copy duplicated, we select the first one.
$scanb <- datasets$scanb[!duplicated(rownames(datasets$scanb)), ]
datasets
saveRDS(datasets$scanb, "../../Data/20230125_scanb/scanb_sumexp.rds")