前些天被TCGA的終結新聞刷屏,,但是一直比較忙,,還沒來得及仔細研讀,,但是筆記本躺著的一些TCGA教程快發(fā)霉了,借此契機好好整理一下吧,,預計二十篇左右的筆記 ——jimmy 往期目錄如下: 使用R語言的cgdsr包獲取TCGA數據 第二篇目錄 - TCGA數據源 - R包RTCGA的簡單介紹 - 首先安裝及加載包 - 指定任意基因從任意癌癥里面獲取芯片表達數據 - 繪制指定基因在不同癌癥的表達量區(qū)別boxplot - 更多boxplot參數 - 指定任意基因從任意癌癥里面獲取測序表達數據 - 用全部的rnaseq的表達數據來做主成分分析 - 用5個基因在3個癌癥的表達量做主成分分析 - 用突變數據做生存分析 - 多個基因在多種癌癥的表達量熱圖正文 TCGA數據源 眾所周知,TCGA數據庫是目前最綜合全面的癌癥病人相關組學數據庫,,包括的測序數據有: DNA Sequencing miRNA Sequencing Protein Expression mRNA Sequencing Total RNA Sequencing Array-based Expression DNA Methylation Copy Number 知名的腫瘤研究機構都有著自己的TCGA數據庫探索工具,,比如: Broad Institute FireBrowse portal, The Broad Institute cBioPortal for Cancer Genomics, Memorial Sloan-Kettering Cancer Center TCGA Batch Effects, MD Anderson Cancer Center Regulome Explorer, Institute for Systems Biology Next-Generation Clustered Heat Maps, MD Anderson Cancer Center R包RTCGA的簡單介紹 而RTCGA這個包是 Marcin Marcin Kosinski et al. 等人開發(fā)的,工作流程如下: img 這不是簡單的一個包,,而是一系列根據數據類型分離的包,,相當于要先下載這些離線數據R包之后再直接從離線數據包里面獲取TCGA的所有數據。 作者寫了詳細的文檔: https://rtcga./RTCGA/index.html 最新的數據版本是2016-01-28,,可以加載以下的包: RTCGA.mutations.20160128 RTCGA.rnaseq.20160128 RTCGA.clinical.20160128 RTCGA.mRNA.20160128 RTCGA.miRNASeq.20160128 RTCGA.RPPA.20160128 RTCGA.CNV.20160128 RTCGA.methylation.20160128 舊版本已經可以考慮棄用了,,下面是基于 2015-11-01 版本的 TCGA 數據 RTCGA.mutations RTCGA.rnaseq RTCGA.clinical RTCGA.PANCAN12 RTCGA.mRNA RTCGA.miRNASeq RTCGA.RPPA RTCGA.CNV RTCGA.methylation 這里就介紹如何使用R語言的RTCGA包來獲取任意TCGA數據吧。 首先安裝及加載包 這里僅僅是測序mRNA表達量數據以及臨床信息,,所以只需要下載及安裝下面的包:# Load the bioconductor installer. source("https:///biocLite.R") # Install the main RTCGA package biocLite("RTCGA") # Install the clinical and mRNA gene expression data packages biocLite("RTCGA.clinical") ## 14Mb biocLite('RTCGA.rnaseq') ## (612.6 MB) biocLite("RTCGA.mRNA") ## (85.0 MB) biocLite('RTCGA.mutations') ## (103.8 MB) 安裝成功之后就可以加載,,可以看到,有些數據包非常大,,如果網速不好,,下載會很可怕。也可以自己想辦法獨立下載,。https:///packages/3.6/data/experiment/src/contrib/RTCGA.rnaseq_20151101.8.0.tar.gz https:///packages/3.6/data/experiment/src/contrib/RTCGA.mRNA_1.6.0.tar.gz https:///packages/3.6/data/experiment/src/contrib/RTCGA.clinical_20151101.8.0.tar.gz https:///packages/3.6/data/experiment/src/contrib/RTCGA.mutations_20151101.8.0.tar.gz library(RTCGA) ## Welcome to the RTCGA (version: 1.8.0). all_TCGA_cancers=infoTCGA() DT::datatable(all_TCGA_cancers) library(RTCGA.clinical) library(RTCGA.mRNA) ## ?mRNA ## ?clinical 指定任意基因從任意癌癥里面獲取芯片表達數據 這里我們拿下面3種癌癥做示范: Breast invasive carcinoma (BRCA) Ovarian serous cystadenocarcinoma (OV) Lung squamous cell carcinoma (LUSC)library(RTCGA) library(RTCGA.mRNA) expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA, LUSC.mRNA, extract.cols = c("GATA3", "PTEN", "XBP1","ESR1", "MUC1")) ## Warning in flatten_bindable(dots_values(...)): '.Random.seed' is not an ## integer vector but of type 'NULL', so ignored expr ## # A tibble: 1,305 x 7 ## bcr_patient_barcode dataset GATA3 PTEN XBP1 ## ## 1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.mRNA 2.870500 1.3613571 2.983333 ## 2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.mRNA 2.166250 0.4283571 2.550833 ## 3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.mRNA 1.323500 1.3056429 3.020417 ## 4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.mRNA 1.841625 0.8096429 3.131333 ## 5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.mRNA -6.025250 0.2508571 -1.451750 ## 6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.mRNA 1.804500 1.3107857 4.041083 ## 7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.mRNA -4.879250 -0.2369286 -0.724750 ## 8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.mRNA -3.143250 -1.2432143 -1.193083 ## 9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.mRNA 2.034000 1.2074286 2.278833 ## 10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.mRNA -0.293125 0.2883571 -1.605083 ## # ... with 1,295 more rows, and 2 more variables: ESR1, MUC1 可以看到我們感興趣的5個基因在這3種癌癥的表達量數據都獲取了,,但是樣本量并不一定是最新的TCGA樣本量,如下:nb_samples <- table(expr$dataset) nb_samples ## ## BRCA.mRNA LUSC.mRNA OV.mRNA ## 590 154 561 其中要注意的是mRNA并不是rnaseq,,兩者不太一樣,,具體樣本數量,可以看最前面的表格,。 下面簡化一下標識,,方便可視化展現expr$dataset <- gsub(pattern = ".mRNA", replacement = "", expr$dataset) expr$bcr_patient_barcode <- paste0(expr$dataset, c(1:590, 1:561, 1:154)) expr ## # A tibble: 1,305 x 7 ## bcr_patient_barcode dataset GATA3 PTEN XBP1 ESR1 ## ## 1 BRCA1 BRCA 2.870500 1.3613571 2.983333 3.0842500 ## 2 BRCA2 BRCA 2.166250 0.4283571 2.550833 2.3860000 ## 3 BRCA3 BRCA 1.323500 1.3056429 3.020417 0.7912500 ## 4 BRCA4 BRCA 1.841625 0.8096429 3.131333 2.4954167 ## 5 BRCA5 BRCA -6.025250 0.2508571 -1.451750 -4.8606667 ## 6 BRCA6 BRCA 1.804500 1.3107857 4.041083 2.7970000 ## 7 BRCA7 BRCA -4.879250 -0.2369286 -0.724750 -4.4860833 ## 8 BRCA8 BRCA -3.143250 -1.2432143 -1.193083 -1.6274167 ## 9 BRCA9 BRCA 2.034000 1.2074286 2.278833 4.1155833 ## 10 BRCA10 BRCA -0.293125 0.2883571 -1.605083 0.4731667 ## # ... with 1,295 more rows, and 1 more variables: MUC1 繪制指定基因在不同癌癥的表達量區(qū)別boxplotlibrary(ggpubr) ## Loading required package: ggplot2 ## Loading required package: magrittr # GATA3 ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco") img# PTEN ggboxplot(expr, x = "dataset", y = "PTEN", title = "PTEN", ylab = "Expression", color = "dataset", palette = "jco") img## 注意這個配色可以自選的: RColorBrewer::display.brewer.all() 這里選擇的是 ggsci 包的配色方案,包括: “npg”, “aaas”, “l(fā)ancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”,,針對常見的SCI雜志的需求開發(fā)的,。 還可以加上P值信息my_comparisons <- list(c("BRCA", "OV"), c("OV", "LUSC")) ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco")+ stat_compare_means(comparisons = my_comparisons) img 這些統計學檢驗,也是被包裝成了函數:compare_means(c(GATA3, PTEN, XBP1) ~ dataset, data = expr) ## # A tibble: 9 x 8 ## .y. group1 group2 p p.adj p.format p.signif ## ## 1 GATA3 BRCA OV 1.111768e-177 3.335304e-177 < 2e-16 **** ## 2 GATA3 BRCA LUSC 6.684016e-73 1.336803e-72 < 2e-16 **** ## 3 GATA3 OV LUSC 2.965702e-08 2.965702e-08 3.0e-08 **** ## 4 PTEN BRCA OV 6.791940e-05 6.791940e-05 6.8e-05 **** ## 5 PTEN BRCA LUSC 1.042830e-16 3.128489e-16 < 2e-16 **** ## 6 PTEN OV LUSC 1.280576e-07 2.561153e-07 1.3e-07 **** ## 7 XBP1 BRCA OV 2.551228e-123 7.653685e-123 < 2e-16 **** ## 8 XBP1 BRCA LUSC 1.950162e-42 3.900324e-42 < 2e-16 **** ## 9 XBP1 OV LUSC 4.239570e-11 4.239570e-11 4.2e-11 **** ## # ... with 1 more variables: method 更多boxplot參數label.select.criteria <- list(criteria = "`y` > 3.9 & `x` %in% c('BRCA', 'OV')") ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", label = "bcr_patient_barcode", # column containing point labels label.select = label.select.criteria, # Select some labels to display font.label = list(size = 9, face = "italic"), # label font repel = TRUE # Avoid label text overplotting ) img 其中 combine = TRUE 會把多個boxplot并排畫在一起,,其實沒有ggplot自帶的分面好用,。 還可以使用 merge = TRUE or merge = “asis” or merge = "flip" 來把多個boxplot 合并,效果不一樣,。 還有翻轉,,如下:ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", rotate = TRUE) img 更多可視化詳見: http://www./english/articles/24-ggpubr-publication-ready-plots/77-facilitating-exploratory-data-visualization-application-to-tcga-genomic-data/ 指定任意基因從任意癌癥里面獲取測序表達數據 還是同樣的3種癌癥和5個基因做示范,這個時候的基因ID稍微有點麻煩,,不僅僅是要symbol還要entrez的ID,,具體需要看 https://wiki.nci./display/TCGA/RNASeq+Version+2 的解釋 如下:library(RTCGA) library(RTCGA.rnaseq) expr <- expressionsTCGA(BRCA.rnaseq, OV.rnaseq, LUSC.rnaseq, extract.cols = c("GATA3|2625", "PTEN|5728", "XBP1|7494","ESR1|2099", "MUC1|4582")) expr ## # A tibble: 2,071 x 7 ## bcr_patient_barcode dataset `GATA3|2625` `PTEN|5728` ## ## 1 TCGA-3C-AAAU-01A-11R-A41B-07 BRCA.rnaseq 14337.4623 1724.328 ## 2 TCGA-3C-AALI-01A-11R-A41B-07 BRCA.rnaseq 7437.7379 1106.580 ## 3 TCGA-3C-AALJ-01A-31R-A41B-07 BRCA.rnaseq 10252.9465 1478.695 ## 4 TCGA-3C-AALK-01A-11R-A41B-07 BRCA.rnaseq 8761.6880 1877.120 ## 5 TCGA-4H-AAAK-01A-12R-A41B-07 BRCA.rnaseq 14068.5106 1739.574 ## 6 TCGA-5L-AAT0-01A-12R-A41B-07 BRCA.rnaseq 16511.5120 1596.715 ## 7 TCGA-5L-AAT1-01A-12R-A41B-07 BRCA.rnaseq 6721.2714 1374.083 ## 8 TCGA-5T-A9QA-01A-11R-A41B-07 BRCA.rnaseq 13485.3556 2181.485 ## 9 TCGA-A1-A0SB-01A-11R-A144-07 BRCA.rnaseq 601.4191 2529.114 ## 10 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.rnaseq 12982.8798 1875.775 ## # ... with 2,061 more rows, and 3 more variables: `XBP1|7494`, ## # `ESR1|2099`, `MUC1|4582` nb_samples <- table(expr$dataset) nb_samples ## ## BRCA.rnaseq LUSC.rnaseq OV.rnaseq ## 1212 552 307 library(ggpubr) # ESR1|2099 ggboxplot(expr, x = "dataset", y = "`PTEN|5728`", title = "ESR1|2099", ylab = "Expression", color = "dataset", palette = "jco") img 更多可視化見:http://rtcga./RTCGA/articles/Visualizations.html 用全部的rnaseq的表達數據來做主成分分析## RNASeq expressions library(RTCGA.rnaseq) library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union expressionsTCGA(BRCA.rnaseq, OV.rnaseq, HNSC.rnaseq) %>% dplyr::rename(cohort = dataset) %>% filter(substr(bcr_patient_barcode, 14, 15) == "01") -> BRCA.OV.HNSC.rnaseq.cancer pcaTCGA(BRCA.OV.HNSC.rnaseq.cancer, "cohort") -> pca_plot plot(pca_plot) img 因為是全部的表達數據,所以非常耗時,,但是可以很明顯看到乳腺癌和卵巢癌關系要近一點,,頭頸癌癥就要遠一點。 用5個基因在3個癌癥的表達量做主成分分析expr %>% filter(substr(bcr_patient_barcode, 14, 15) == "01") -> rnaseq.5genes.3cancers DT::datatable(rnaseq.5genes.3cancers) #pcaTCGA(rnaseq.5genes.3cancers, "dataset") -> pca_plot #plot(pca_plot) 該包里面的pcaTCGA函數不好用,其實可以自己做PCA分析,。 用突變數據做生存分析library(RTCGA.mutations) # library(dplyr) if did not load at start library(survminer) mutationsTCGA(BRCA.mutations, OV.mutations) %>% filter(Hugo_Symbol == 'TP53') %>% filter(substr(bcr_patient_barcode, 14, 15) == "01") %>% # cancer tissue mutate(bcr_patient_barcode = substr(bcr_patient_barcode, 1, 12)) -> BRCA_OV.mutations library(RTCGA.clinical) survivalTCGA( BRCA.clinical, OV.clinical, extract.cols = "admin.disease_code" ) %>% dplyr::rename(disease = admin.disease_code) -> BRCA_OV.clinical BRCA_OV.clinical %>% left_join( BRCA_OV.mutations, by = "bcr_patient_barcode" ) %>% mutate(TP53 = ifelse(!is.na(Variant_Classification), "Mut","WILDorNOINFO")) -> BRCA_OV.clinical_mutations BRCA_OV.clinical_mutations %>% select(times, patient.vital_status, disease, TP53) -> BRCA_OV.2plot kmTCGA( BRCA_OV.2plot, explanatory.names = c("TP53", "disease"), break.time.by = 400, xlim = c(0,2000), pval = TRUE) -> km_plot ## Scale for 'colour' is already present. Adding another scale for ## 'colour', which will replace the existing scale. ## Scale for 'fill' is already present. Adding another scale for 'fill', ## which will replace the existing scale. print(km_plot) img 多個基因在多種癌癥的表達量熱圖library(RTCGA.rnaseq) # perfrom plot # library(dplyr) if did not load at start expressionsTCGA( ACC.rnaseq, BLCA.rnaseq, BRCA.rnaseq, OV.rnaseq, extract.cols = c("MET|4233", "ZNF500|26048", "ZNF501|115560") ) %>% dplyr::rename(cohort = dataset, MET = `MET|4233`) %>% #cancer samples filter(substr(bcr_patient_barcode, 14, 15) == "01") %>% mutate(MET = cut(MET, round(quantile(MET, probs = seq(0,1,0.25)), -2), include.lowest = TRUE, dig.lab = 5)) -> ACC_BLCA_BRCA_OV.rnaseq ACC_BLCA_BRCA_OV.rnaseq %>% select(-bcr_patient_barcode) %>% group_by(cohort, MET) %>% summarise_each(funs(median)) %>% mutate(ZNF500 = round(`ZNF500|26048`), ZNF501 = round(`ZNF501|115560`)) -> ACC_BLCA_BRCA_OV.rnaseq.medians ## `summarise_each()` is deprecated. ## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead. ## To map `funs` over all variables, use `summarise_all()` heatmapTCGA(ACC_BLCA_BRCA_OV.rnaseq.medians, "cohort", "MET", "ZNF500", title = "Heatmap of ZNF500 expression") img 細心的同學可以發(fā)現,,本教程其實里面含有大量的外鏈,因為微信自身的限制沒辦法跳轉,,大家可以去生信技能樹論壇查看,,謝謝合作哦。 一個R包不僅僅是提供一個數據下載接口,,更重要的是里面封裝了一些便于使用的統計分析函數,。 生信技能樹GATK4系列教程 GATK4的gvcf流程 你以為的可能不是你以為的 新鮮出爐的GATK4培訓教材全套PPT,趕快下載學習吧 曾老濕最新私已:GATK4實戰(zhàn)教程 GATK4的CNV流程-hg38 |
|