【原】帶臨床信息的腫瘤突變maf文件分析維度更多

健明 2021-09-16

展開全文

每個癌癥都去找各自的腫瘤突變maf文件很麻煩,，所以我們才會選擇 PanCanAtlas Publications Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines ，詳見：https://gdc./about-data/publications/mc3-2017 ：它提供如下所示的文件：

MC3 Public MAF - mc3.v0.2.8.PUBLIC.maf.gz
MC3 Controlled MAF - mc3.v0.2.8.CONTROLLED.maf.gz
mc3.v0.2.8.BROAD_VALIDATIONv2.maf.gz
mc3.v0.2.9.CONTROLLED_lt3_b.maf.gz

目前只有第一個文件 mc3.v0.2.8.PUBLIC.maf.gz 是公開可以下載的，718M的文件,， MC3 這個項目,，使用了以下的標準過濾germline variants:

low normal depth coverage
non-exonic sites
sites outside of capture kit
sites marked by the Broad Panel of Normals
samples marked as being contaminated by ContEst
variants that were only called by a single caller

其中controlled-access MAF 文件包含 22,485,627 variants （來自10,510個腫瘤樣本，共有13,044,511 single-nucleotide variant (SNV) 和 9,441,116 indels）,，但是 open-access MAF文件包含3,600,963 variants （10,295樣本,，3,427,680 SNV和173,283 indels），所以粗略估算一下,，open-access 的MC3 call set 僅占原始MAF中variant call 的44%,。

我看到了GitHub有一個project把這個數(shù)據(jù)文件制作成為了R包，https://github.com/PoisonAlien/TCGAmutations ,，不知道是否有授權,，是否合規(guī)。我們無需使用這樣的r包了,，自己下載文件,，并且讀入這個718M的文件 mc3.v0.2.8.PUBLIC.maf.gz 即可！代碼如下：

rm(list = ls())
require(maftools) 
laml=read.maf('mc3.v0.2.8.PUBLIC.maf.gz')
laml
save(laml,file = 'mc3.Rdata')

rm(list = ls())
require(maftools) 
load(file = 'mc3.Rdata')

加上臨床信息

同樣的是在 https://gdc./about-data/publications/pancanatlas 頁面下載：

TCGA-Clinical Data Resource (CDR) Outcome - TCGA-CDR-SupplementalTableS1.xlsx

簡單檢查臨床信息

文件里面主要關心的臨床指標有：

"age_at_initial_pathologic_diagnosis"
"gender"
"race"
"ajcc_pathologic_tumor_stage"
"clinical_stage"
"histological_type"
"histological_grade"
"initial_pathologic_dx_year"
"menopause_status"
"birth_days_to"

自行下載Excel后打開 TCGA-CDR-SupplementalTableS1.xlsx查看,，如下所示：

主要關心的臨床指標

首先看看性別情況：

library(readxl)
phe=read_excel('TCGA-CDR-SupplementalTableS1.xlsx',sheet = 1)
library(gplots)
balloonplot(table(phe$gender,phe$type))

可以看到蠻多癌癥具有性別特異性,，比如婦科癌癥的卵巢癌子宮癌宮頸癌等等：

癌癥具有性別特異性

也可以看到 clinical_stage 信息比較亂；

clinical_stage

相比起來 ,，另外一個臨床指標 ajcc_pathologic_tumor_stage 就好看很多：

ajcc_pathologic_tumor_stage

簡單檢查結局事件生存情況

并且病人的結局事件也是多種多樣：

OS: overall survival event, 1 for death from any cause, 0 for alive.
DSS: disease-specific survival event
DFI: disease-free interval event
PFI: progression-free interval event

自行下載Excel后打開 TCGA-CDR-SupplementalTableS1.xlsx查看,，如下所示：

結局事件及其對應的統(tǒng)計時間

也可以簡單的統(tǒng)計：

balloonplot(table(phe$vital_status,phe$type))

各個癌癥腫瘤死亡率不一樣

這個時候，替換為比例好一點,。

首先病人按照臨床指標分組后看突變差異

這里我們僅僅是按照stage舉例,，首先把stage給格式化：

table(phe$ajcc_pathologic_tumor_stage)
phe$new_stage = ifelse(phe$ajcc_pathologic_tumor_stage %in% c( 'Stage I','Stage IA','Stage IB'),'s1',
                      ifelse(phe$ajcc_pathologic_tumor_stage %in% c('Stage II' ,'Stage IIA','Stage IIB','Stage IIC'),'s2',
                             ifelse(phe$ajcc_pathologic_tumor_stage %in% c('Stage III','Stage IIIA','Stage IIIB','Stage IIIC'),'s3',
                                    ifelse(phe$ajcc_pathologic_tumor_stage %in% c( 'Stage IV','Stage IVA' ,'Stage IVB','Stage IVC'),'s4','other'
                                    )  ) ) )
table(phe$new_stage )
balloonplot(table(phe$new_stage,phe$type))

現(xiàn)在看起來清晰很多，我們就選擇BRCA吧：

每次加載前面的 mc3.Rdata 文件,，耗費好幾分鐘,，所以還是按照癌癥拆分開來！

require(maftools) 
load(file = 'mc3.Rdata') 
library(readxl)
phe=read_excel('TCGA-CDR-SupplementalTableS1.xlsx',sheet = 1)[,2:3]
phe=as.data.frame(phe) 
table(phe$type)
cp_list=split(phe$bcr_patient_barcode,phe$type)
cg_tmp=laml@data
lapply(1:length(cp_list), function(i){
  cp=cp_list[[i]]
  pro=names(cp_list)[i]
  cg_cp_tmp=cg_tmp[substring(cg_tmp$Tumor_Sample_Barcode,1,12) %in% cp,]
  laml = read.maf(maf = cg_cp_tmp)
  save(laml,file= paste0('maftools-',pro,'.Rdata') ) 
})

得到每個癌癥各自的rdata文件大小如下所示：

 14M Sep 16 11:39 maftools-ACC.Rdata
167M Sep 16 11:39 maftools-BLCA.Rdata
149M Sep 16 11:40 maftools-BRCA.Rdata
 98M Sep 16 11:40 maftools-CESC.Rdata
3.4M Sep 16 11:40 maftools-CHOL.Rdata
290M Sep 16 11:40 maftools-COAD.Rdata
8.2M Sep 16 11:40 maftools-DLBC.Rdata
 44M Sep 16 11:40 maftools-ESCA.Rdata
 81M Sep 16 11:41 maftools-GBM.Rdata
128M Sep 16 11:41 maftools-HNSC.Rdata
3.8M Sep 16 11:41 maftools-KICH.Rdata
 36M Sep 16 11:41 maftools-KIRC.Rdata
 37M Sep 16 11:41 maftools-KIRP.Rdata
 10M Sep 16 11:41 maftools-LAML.Rdata
 51M Sep 16 11:41 maftools-LGG.Rdata
 63M Sep 16 11:41 maftools-LIHC.Rdata
235M Sep 16 11:42 maftools-LUAD.Rdata
219M Sep 16 11:42 maftools-LUSC.Rdata
4.4M Sep 16 11:42 maftools-MESO.Rdata
 64M Sep 16 11:42 maftools-OV.Rdata
 36M Sep 16 11:42 maftools-PAAD.Rdata
3.3M Sep 16 11:42 maftools-PCPG.Rdata
 38M Sep 16 11:43 maftools-PRAD.Rdata
 77M Sep 16 11:43 maftools-READ.Rdata
 30M Sep 16 11:43 maftools-SARC.Rdata
542M Sep 16 11:43 maftools-SKCM.Rdata
255M Sep 16 11:44 maftools-STAD.Rdata
3.7M Sep 16 11:44 maftools-TGCT.Rdata
 14M Sep 16 11:44 maftools-THCA.Rdata
5.0M Sep 16 11:44 maftools-THYM.Rdata
907M Sep 16 11:45 maftools-UCEC.Rdata
 13M Sep 16 11:45 maftools-UCS.Rdata
2.4M Sep 16 11:45 maftools-UVM.Rdata
5.6G Sep  5 00:30 mc3.Rdata

需要認真讀文檔：https://www./packages/devel/bioc/vignettes/maftools/inst/doc/maftools.html

首先查看BRCA整個癌癥的突變特征:

load(file = 'Rdata/maftools-BRCA.Rdata')  
oncoplot(maf = laml, top = 10)

然后加上前面制作好的stage信息：

[email protected]
pos=match(substring(as.character([email protected]$Tumor_Sample_Barcode),1,12),phe$bcr_patient_barcode)
[email protected]$stage=phe$new_stage[pos]
table([email protected]$stage)
other    s1    s2    s3    s4 
   24   173   590   220    19 
oncoplot(maf = laml,
         clinicalFeatures=c('stage'),sortByAnnotation=T,
         top = 10)

這樣雖然可以肉眼查看不同stage的BRCA病人的突變全景圖,，但是很難區(qū)分各個基因是否有比例差異,。首先呢,，我們可以借助mafCompare: compare two cohorts (MAF). ，可以舉例說明s1和s3兩個分組的差異,。

df=as.data.frame(laml@data)
cg=as.character([email protected]$Tumor_Sample_Barcode)[[email protected]$stage=='s1']
cg
s1.maf <- read.maf(df[df$Tumor_Sample_Barcode %in% cg, ])

cg=as.character([email protected]$Tumor_Sample_Barcode)[[email protected]$stage=='s3']
cg
s3.maf <- read.maf(df[df$Tumor_Sample_Barcode %in% cg, ])

pt.vs.rt <- mafCompare(m1 = s1.maf, m2 = s3.maf, m1Name = 'stage1',
                       m2Name = 'stage3', minMut = 5)
forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1)