每個癌癥都去找各自的腫瘤突變maf文件很麻煩,,所以我們才會選擇 PanCanAtlas Publications Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines ,詳見:https://gdc./about-data/publications/mc3-2017 :它提供如下所示的文件: - MC3 Public MAF - mc3.v0.2.8.PUBLIC.maf.gz
- MC3 Controlled MAF - mc3.v0.2.8.CONTROLLED.maf.gz
- mc3.v0.2.8.BROAD_VALIDATIONv2.maf.gz
- mc3.v0.2.9.CONTROLLED_lt3_b.maf.gz
目前只有第一個文件 mc3.v0.2.8.PUBLIC.maf.gz 是公開可以下載的,718M的文件,, MC3 這個項目,,使用了以下的標準 過濾germline variants: - low normal depth coverage
- sites outside of capture kit
- sites marked by the Broad Panel of Normals
- samples marked as being contaminated by ContEst
- variants that were only called by a single caller
其中controlled-access MAF 文件包含 22,485,627 variants (來自10,510個腫瘤樣本,共有13,044,511 single-nucleotide variant (SNV) 和 9,441,116 indels),, 但是 open-access MAF文件包含3,600,963 variants (10,295樣本,,3,427,680 SNV和173,283 indels),所以粗略估算一下,,open-access 的MC3 call set 僅占原始MAF中variant call 的44%,。 我看到了GitHub有一個project把這個數(shù)據(jù)文件制作成為了R包,https://github.com/PoisonAlien/TCGAmutations ,,不知道是否有授權,,是否合規(guī)。我們無需使用這樣的r包了,,自己下載文件,,并且 讀入這個718M的文件 mc3.v0.2.8.PUBLIC.maf.gz 即可!代碼如下: rm(list = ls()) require(maftools) laml=read.maf('mc3.v0.2.8.PUBLIC.maf.gz') laml save(laml,file = 'mc3.Rdata')
rm(list = ls()) require(maftools) load(file = 'mc3.Rdata')
加上臨床信息同樣的是在 https://gdc./about-data/publications/pancanatlas 頁面下載 : - TCGA-Clinical Data Resource (CDR) Outcome - TCGA-CDR-SupplementalTableS1.xlsx
簡單檢查臨床信息文件里面主要關心的臨床指標有: - "age_at_initial_pathologic_diagnosis"
- "ajcc_pathologic_tumor_stage"
- "initial_pathologic_dx_year"
自行下載Excel后打開 TCGA-CDR-SupplementalTableS1.xlsx查看,,如下所示: 主要關心的臨床指標首先看看性別情況: library(readxl) phe=read_excel('TCGA-CDR-SupplementalTableS1.xlsx',sheet = 1) library(gplots) balloonplot(table(phe$gender,phe$type))
可以看到蠻多癌癥具有性別特異性,,比如婦科癌癥的卵巢癌子宮癌宮頸癌等等: 癌癥具有性別特異性也可以看到 clinical_stage 信息比較亂; clinical_stage相比起來 ,,另外一個臨床指標 ajcc_pathologic_tumor_stage 就好看很多 : ajcc_pathologic_tumor_stage簡單檢查結局事件生存情況并且病人的結局事件也是 多種多樣: OS: overall survival event, 1 for death from any cause, 0 for alive. DSS: disease-specific survival event DFI: disease-free interval event PFI: progression-free interval event
自行下載Excel后打開 TCGA-CDR-SupplementalTableS1.xlsx查看,,如下所示: 結局事件及其對應的統(tǒng)計時間也可以簡單的統(tǒng)計: balloonplot(table(phe$vital_status,phe$type))
各個癌癥腫瘤死亡率不一樣這個時候,替換為比例好一點,。 首先病人按照臨床指標分組后看突變差異這里我們僅僅是按照stage舉例,,首先把stage給格式化: table(phe$ajcc_pathologic_tumor_stage) phe$new_stage = ifelse(phe$ajcc_pathologic_tumor_stage %in% c( 'Stage I','Stage IA','Stage IB'),'s1', ifelse(phe$ajcc_pathologic_tumor_stage %in% c('Stage II' ,'Stage IIA','Stage IIB','Stage IIC'),'s2', ifelse(phe$ajcc_pathologic_tumor_stage %in% c('Stage III','Stage IIIA','Stage IIIB','Stage IIIC'),'s3', ifelse(phe$ajcc_pathologic_tumor_stage %in% c( 'Stage IV','Stage IVA' ,'Stage IVB','Stage IVC'),'s4','other' ) ) ) ) table(phe$new_stage ) balloonplot(table(phe$new_stage,phe$type))
現(xiàn)在看起來清晰很多,我們就選擇BRCA吧: 每次加載前面的 mc3.Rdata 文件,,耗費好幾分鐘,,所以還是按照癌癥拆分開來! require(maftools) load(file = 'mc3.Rdata') library(readxl) phe=read_excel('TCGA-CDR-SupplementalTableS1.xlsx',sheet = 1)[,2:3] phe=as.data.frame(phe) table(phe$type) cp_list=split(phe$bcr_patient_barcode,phe$type) cg_tmp=laml@data lapply(1:length(cp_list), function(i){ cp=cp_list[[i]] pro=names(cp_list)[i] cg_cp_tmp=cg_tmp[substring(cg_tmp$Tumor_Sample_Barcode,1,12) %in% cp,] laml = read.maf(maf = cg_cp_tmp) save(laml,file= paste0('maftools-',pro,'.Rdata') ) })
得到每個癌癥各自的rdata文件大小如下所示: 14M Sep 16 11:39 maftools-ACC.Rdata 167M Sep 16 11:39 maftools-BLCA.Rdata 149M Sep 16 11:40 maftools-BRCA.Rdata 98M Sep 16 11:40 maftools-CESC.Rdata 3.4M Sep 16 11:40 maftools-CHOL.Rdata 290M Sep 16 11:40 maftools-COAD.Rdata 8.2M Sep 16 11:40 maftools-DLBC.Rdata 44M Sep 16 11:40 maftools-ESCA.Rdata 81M Sep 16 11:41 maftools-GBM.Rdata 128M Sep 16 11:41 maftools-HNSC.Rdata 3.8M Sep 16 11:41 maftools-KICH.Rdata 36M Sep 16 11:41 maftools-KIRC.Rdata 37M Sep 16 11:41 maftools-KIRP.Rdata 10M Sep 16 11:41 maftools-LAML.Rdata 51M Sep 16 11:41 maftools-LGG.Rdata 63M Sep 16 11:41 maftools-LIHC.Rdata 235M Sep 16 11:42 maftools-LUAD.Rdata 219M Sep 16 11:42 maftools-LUSC.Rdata 4.4M Sep 16 11:42 maftools-MESO.Rdata 64M Sep 16 11:42 maftools-OV.Rdata 36M Sep 16 11:42 maftools-PAAD.Rdata 3.3M Sep 16 11:42 maftools-PCPG.Rdata 38M Sep 16 11:43 maftools-PRAD.Rdata 77M Sep 16 11:43 maftools-READ.Rdata 30M Sep 16 11:43 maftools-SARC.Rdata 542M Sep 16 11:43 maftools-SKCM.Rdata 255M Sep 16 11:44 maftools-STAD.Rdata 3.7M Sep 16 11:44 maftools-TGCT.Rdata 14M Sep 16 11:44 maftools-THCA.Rdata 5.0M Sep 16 11:44 maftools-THYM.Rdata 907M Sep 16 11:45 maftools-UCEC.Rdata 13M Sep 16 11:45 maftools-UCS.Rdata 2.4M Sep 16 11:45 maftools-UVM.Rdata 5.6G Sep 5 00:30 mc3.Rdata
需要認真讀文檔:https://www./packages/devel/bioc/vignettes/maftools/inst/doc/maftools.html 首先查看BRCA整個癌癥的突變特征: load(file = 'Rdata/maftools-BRCA.Rdata') oncoplot(maf = laml, top = 10)
然后加上前面制作好的stage信息: [email protected] pos=match(substring(as.character([email protected]$Tumor_Sample_Barcode),1,12),phe$bcr_patient_barcode) [email protected]$stage=phe$new_stage[pos] table([email protected]$stage) other s1 s2 s3 s4 24 173 590 220 19 oncoplot(maf = laml, clinicalFeatures=c('stage'),sortByAnnotation=T, top = 10)
這樣雖然可以肉眼查看不同stage的BRCA病人的突變全景圖,,但是很難區(qū)分各個基因是否有比例差異,。首先呢,,我們可以借助mafCompare: compare two cohorts (MAF). ,可以舉例說明s1和s3兩個分組的差異,。 df=as.data.frame(laml@data) cg=as.character([email protected]$Tumor_Sample_Barcode)[[email protected]$stage=='s1'] cg s1.maf <- read.maf(df[df$Tumor_Sample_Barcode %in% cg, ])
cg=as.character([email protected]$Tumor_Sample_Barcode)[[email protected]$stage=='s3'] cg s3.maf <- read.maf(df[df$Tumor_Sample_Barcode %in% cg, ])
pt.vs.rt <- mafCompare(m1 = s1.maf, m2 = s3.maf, m1Name = 'stage1', m2Name = 'stage3', minMut = 5) forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1)
代碼難度有點多,,需要R語言功底扎實。兩個分組,,差異還是有的,,這個結果就是大家的生物學知識發(fā)揮的用武之地咯! 下期預告:有生存統(tǒng)計信息就可以做生存分析,,根據(jù)突變位點,, 可以在各個癌癥內部批量進行生存分析檢驗,挖掘有意義的基因或者基因列表,。
|