久久国产成人av_抖音国产毛片_a片网站免费观看_A片无码播放手机在线观看,色五月在线观看,亚洲精品m在线观看,女人自慰的免费网址,悠悠在线观看精品视频,一级日本片免费的,亚洲精品久,国产精品成人久久久久久久

分享

TCGA的28篇教程- 使用R語言的RTCGA包獲取TCGA數據

 健明 2021-07-14

前些天被TCGA的終結新聞刷屏,,但是一直比較忙,,還沒來得及仔細研讀,,但是筆記本躺著的一些TCGA教程快發(fā)霉了,借此契機好好整理一下吧,,預計二十篇左右的筆記

——jimmy

往期目錄如下:

使用R語言的cgdsr包獲取TCGA數據

第二篇目錄

- TCGA數據源

- R包RTCGA的簡單介紹

- 首先安裝及加載包

- 指定任意基因從任意癌癥里面獲取芯片表達數據

- 繪制指定基因在不同癌癥的表達量區(qū)別boxplot

- 更多boxplot參數

- 指定任意基因從任意癌癥里面獲取測序表達數據

- 用全部的rnaseq的表達數據來做主成分分析

- 用5個基因在3個癌癥的表達量做主成分分析

- 用突變數據做生存分析

- 多個基因在多種癌癥的表達量熱圖正文

TCGA數據源

眾所周知,TCGA數據庫是目前最綜合全面的癌癥病人相關組學數據庫,,包括的測序數據有:

DNA Sequencing

miRNA Sequencing

Protein Expression

mRNA Sequencing

Total RNA Sequencing

Array-based Expression

DNA Methylation

Copy Number

知名的腫瘤研究機構都有著自己的TCGA數據庫探索工具,,比如:

Broad Institute FireBrowse portal, The Broad Institute

cBioPortal for Cancer Genomics, Memorial Sloan-Kettering Cancer Center

TCGA Batch Effects, MD Anderson Cancer Center

Regulome Explorer, Institute for Systems Biology

Next-Generation Clustered Heat Maps, MD Anderson Cancer Center

R包RTCGA的簡單介紹

而RTCGA這個包是 Marcin Marcin Kosinski et al. 等人開發(fā)的,工作流程如下:

img

這不是簡單的一個包,,而是一系列根據數據類型分離的包,,相當于要先下載這些離線數據R包之后再直接從離線數據包里面獲取TCGA的所有數據。

作者寫了詳細的文檔: https://rtcga./RTCGA/index.html

最新的數據版本是2016-01-28,,可以加載以下的包:

RTCGA.mutations.20160128

RTCGA.rnaseq.20160128

RTCGA.clinical.20160128

RTCGA.mRNA.20160128

RTCGA.miRNASeq.20160128

RTCGA.RPPA.20160128

RTCGA.CNV.20160128

RTCGA.methylation.20160128

舊版本已經可以考慮棄用了,,下面是基于 2015-11-01 版本的 TCGA 數據

RTCGA.mutations

RTCGA.rnaseq

RTCGA.clinical

RTCGA.PANCAN12

RTCGA.mRNA

RTCGA.miRNASeq

RTCGA.RPPA

RTCGA.CNV

RTCGA.methylation

這里就介紹如何使用R語言的RTCGA包來獲取任意TCGA數據吧。

首先安裝及加載包

這里僅僅是測序mRNA表達量數據以及臨床信息,,所以只需要下載及安裝下面的包:# Load the bioconductor installer.

source("https:///biocLite.R")

# Install the main RTCGA package

biocLite("RTCGA")

# Install the clinical and mRNA gene expression data packages

biocLite("RTCGA.clinical") ## 14Mb

biocLite('RTCGA.rnaseq') ##  (612.6 MB)

biocLite("RTCGA.mRNA") ##  (85.0 MB)

biocLite('RTCGA.mutations')  ## (103.8 MB)

安裝成功之后就可以加載,,可以看到,有些數據包非常大,,如果網速不好,,下載會很可怕。也可以自己想辦法獨立下載,。https:///packages/3.6/data/experiment/src/contrib/RTCGA.rnaseq_20151101.8.0.tar.gz

https:///packages/3.6/data/experiment/src/contrib/RTCGA.mRNA_1.6.0.tar.gz

https:///packages/3.6/data/experiment/src/contrib/RTCGA.clinical_20151101.8.0.tar.gz

https:///packages/3.6/data/experiment/src/contrib/RTCGA.mutations_20151101.8.0.tar.gz

library(RTCGA)

## Welcome to the RTCGA (version: 1.8.0).

all_TCGA_cancers=infoTCGA()

DT::datatable(all_TCGA_cancers)

library(RTCGA.clinical)

library(RTCGA.mRNA)

## ?mRNA

## ?clinical

指定任意基因從任意癌癥里面獲取芯片表達數據

這里我們拿下面3種癌癥做示范:

Breast invasive carcinoma (BRCA)

Ovarian serous cystadenocarcinoma (OV)

Lung squamous cell carcinoma (LUSC)library(RTCGA)

library(RTCGA.mRNA)

expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA, LUSC.mRNA,

extract.cols = c("GATA3", "PTEN", "XBP1","ESR1", "MUC1"))

## Warning in flatten_bindable(dots_values(...)): '.Random.seed' is not an

## integer vector but of type 'NULL', so ignored

expr

## # A tibble: 1,305 x 7

##             bcr_patient_barcode   dataset     GATA3       PTEN      XBP1

##                                        

##  1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.mRNA  2.870500  1.3613571  2.983333

##  2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.mRNA  2.166250  0.4283571  2.550833

##  3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.mRNA  1.323500  1.3056429  3.020417

##  4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.mRNA  1.841625  0.8096429  3.131333

##  5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.mRNA -6.025250  0.2508571 -1.451750

##  6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.mRNA  1.804500  1.3107857  4.041083

##  7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.mRNA -4.879250 -0.2369286 -0.724750

##  8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.mRNA -3.143250 -1.2432143 -1.193083

##  9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.mRNA  2.034000  1.2074286  2.278833

## 10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.mRNA -0.293125  0.2883571 -1.605083

## # ... with 1,295 more rows, and 2 more variables: ESR1, MUC1

可以看到我們感興趣的5個基因在這3種癌癥的表達量數據都獲取了,,但是樣本量并不一定是最新的TCGA樣本量,如下:nb_samples <- table(expr$dataset)

nb_samples

##

## BRCA.mRNA LUSC.mRNA   OV.mRNA

##       590       154       561

其中要注意的是mRNA并不是rnaseq,,兩者不太一樣,,具體樣本數量,可以看最前面的表格,。

下面簡化一下標識,,方便可視化展現expr$dataset <- gsub(pattern = ".mRNA", replacement = "",  expr$dataset)

expr$bcr_patient_barcode <- paste0(expr$dataset, c(1:590, 1:561, 1:154))

expr

## # A tibble: 1,305 x 7

##    bcr_patient_barcode dataset     GATA3       PTEN      XBP1       ESR1

##                                   

##  1               BRCA1    BRCA  2.870500  1.3613571  2.983333  3.0842500

##  2               BRCA2    BRCA  2.166250  0.4283571  2.550833  2.3860000

##  3               BRCA3    BRCA  1.323500  1.3056429  3.020417  0.7912500

##  4               BRCA4    BRCA  1.841625  0.8096429  3.131333  2.4954167

##  5               BRCA5    BRCA -6.025250  0.2508571 -1.451750 -4.8606667

##  6               BRCA6    BRCA  1.804500  1.3107857  4.041083  2.7970000

##  7               BRCA7    BRCA -4.879250 -0.2369286 -0.724750 -4.4860833

##  8               BRCA8    BRCA -3.143250 -1.2432143 -1.193083 -1.6274167

##  9               BRCA9    BRCA  2.034000  1.2074286  2.278833  4.1155833

## 10              BRCA10    BRCA -0.293125  0.2883571 -1.605083  0.4731667

## # ... with 1,295 more rows, and 1 more variables: MUC1

繪制指定基因在不同癌癥的表達量區(qū)別boxplotlibrary(ggpubr)

## Loading required package: ggplot2

## Loading required package: magrittr

# GATA3

ggboxplot(expr, x = "dataset", y = "GATA3",

title = "GATA3", ylab = "Expression",

color = "dataset", palette = "jco")

img# PTEN

ggboxplot(expr, x = "dataset", y = "PTEN",

title = "PTEN", ylab = "Expression",

color = "dataset", palette = "jco")

img## 注意這個配色可以自選的: RColorBrewer::display.brewer.all()  

這里選擇的是 ggsci 包的配色方案,包括: “npg”, “aaas”, “l(fā)ancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”,,針對常見的SCI雜志的需求開發(fā)的,。

還可以加上P值信息my_comparisons <- list(c("BRCA", "OV"), c("OV", "LUSC"))

ggboxplot(expr, x = "dataset", y = "GATA3",

title = "GATA3", ylab = "Expression",

color = "dataset", palette = "jco")+

stat_compare_means(comparisons = my_comparisons)

img

這些統計學檢驗,也是被包裝成了函數:compare_means(c(GATA3, PTEN, XBP1) ~ dataset, data = expr)

## # A tibble: 9 x 8

##      .y. group1 group2             p         p.adj p.format p.signif

##                        

## 1  GATA3   BRCA     OV 1.111768e-177 3.335304e-177  < 2e-16     ****

## 2  GATA3   BRCA   LUSC  6.684016e-73  1.336803e-72  < 2e-16     ****

## 3  GATA3     OV   LUSC  2.965702e-08  2.965702e-08  3.0e-08     ****

## 4   PTEN   BRCA     OV  6.791940e-05  6.791940e-05  6.8e-05     ****

## 5   PTEN   BRCA   LUSC  1.042830e-16  3.128489e-16  < 2e-16     ****

## 6   PTEN     OV   LUSC  1.280576e-07  2.561153e-07  1.3e-07     ****

## 7   XBP1   BRCA     OV 2.551228e-123 7.653685e-123  < 2e-16     ****

## 8   XBP1   BRCA   LUSC  1.950162e-42  3.900324e-42  < 2e-16     ****

## 9   XBP1     OV   LUSC  4.239570e-11  4.239570e-11  4.2e-11     ****

## # ... with 1 more variables: method

更多boxplot參數label.select.criteria <- list(criteria = "`y` > 3.9 & `x` %in% c('BRCA', 'OV')")

ggboxplot(expr, x = "dataset",

y = c("GATA3", "PTEN", "XBP1"),

combine = TRUE,

color = "dataset", palette = "jco",

ylab = "Expression",

label = "bcr_patient_barcode",              # column containing point labels

label.select = label.select.criteria,       # Select some labels to display

font.label = list(size = 9, face = "italic"), # label font

repel = TRUE                                # Avoid label text overplotting

)

img

其中 combine = TRUE 會把多個boxplot并排畫在一起,,其實沒有ggplot自帶的分面好用,。

還可以使用 merge = TRUE or merge = “asis” or merge = "flip" 來把多個boxplot 合并,效果不一樣,。

還有翻轉,,如下:ggboxplot(expr, x = "dataset", y = "GATA3",

title = "GATA3", ylab = "Expression",

color = "dataset", palette = "jco",

rotate = TRUE)

img

更多可視化詳見: http://www./english/articles/24-ggpubr-publication-ready-plots/77-facilitating-exploratory-data-visualization-application-to-tcga-genomic-data/

指定任意基因從任意癌癥里面獲取測序表達數據

還是同樣的3種癌癥和5個基因做示范,這個時候的基因ID稍微有點麻煩,,不僅僅是要symbol還要entrez的ID,,具體需要看 https://wiki.nci./display/TCGA/RNASeq+Version+2 的解釋

如下:library(RTCGA)

library(RTCGA.rnaseq)

expr <- expressionsTCGA(BRCA.rnaseq, OV.rnaseq, LUSC.rnaseq,

extract.cols = c("GATA3|2625", "PTEN|5728", "XBP1|7494","ESR1|2099", "MUC1|4582"))

expr

## # A tibble: 2,071 x 7

##             bcr_patient_barcode     dataset `GATA3|2625` `PTEN|5728`

##                                           

##  1 TCGA-3C-AAAU-01A-11R-A41B-07 BRCA.rnaseq   14337.4623    1724.328

##  2 TCGA-3C-AALI-01A-11R-A41B-07 BRCA.rnaseq    7437.7379    1106.580

##  3 TCGA-3C-AALJ-01A-31R-A41B-07 BRCA.rnaseq   10252.9465    1478.695

##  4 TCGA-3C-AALK-01A-11R-A41B-07 BRCA.rnaseq    8761.6880    1877.120

##  5 TCGA-4H-AAAK-01A-12R-A41B-07 BRCA.rnaseq   14068.5106    1739.574

##  6 TCGA-5L-AAT0-01A-12R-A41B-07 BRCA.rnaseq   16511.5120    1596.715

##  7 TCGA-5L-AAT1-01A-12R-A41B-07 BRCA.rnaseq    6721.2714    1374.083

##  8 TCGA-5T-A9QA-01A-11R-A41B-07 BRCA.rnaseq   13485.3556    2181.485

##  9 TCGA-A1-A0SB-01A-11R-A144-07 BRCA.rnaseq     601.4191    2529.114

## 10 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.rnaseq   12982.8798    1875.775

## # ... with 2,061 more rows, and 3 more variables: `XBP1|7494`,

## #   `ESR1|2099`, `MUC1|4582`

nb_samples <- table(expr$dataset)

nb_samples

##

## BRCA.rnaseq LUSC.rnaseq   OV.rnaseq

##        1212         552         307

library(ggpubr)

# ESR1|2099

ggboxplot(expr, x = "dataset", y = "`PTEN|5728`",

title = "ESR1|2099", ylab = "Expression",

color = "dataset", palette = "jco")

img

更多可視化見:http://rtcga./RTCGA/articles/Visualizations.html

用全部的rnaseq的表達數據來做主成分分析## RNASeq expressions

library(RTCGA.rnaseq)

library(dplyr)  

##

## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':

##

##     filter, lag

## The following objects are masked from 'package:base':

##

##     intersect, setdiff, setequal, union

expressionsTCGA(BRCA.rnaseq, OV.rnaseq, HNSC.rnaseq) %>%

dplyr::rename(cohort = dataset) %>%  

filter(substr(bcr_patient_barcode, 14, 15) == "01") -> BRCA.OV.HNSC.rnaseq.cancer

pcaTCGA(BRCA.OV.HNSC.rnaseq.cancer, "cohort") -> pca_plot

plot(pca_plot)

img

因為是全部的表達數據,所以非常耗時,,但是可以很明顯看到乳腺癌和卵巢癌關系要近一點,,頭頸癌癥就要遠一點。

用5個基因在3個癌癥的表達量做主成分分析expr %>%  

filter(substr(bcr_patient_barcode, 14, 15) == "01") -> rnaseq.5genes.3cancers

DT::datatable(rnaseq.5genes.3cancers)

#pcaTCGA(rnaseq.5genes.3cancers, "dataset") -> pca_plot

#plot(pca_plot)

該包里面的pcaTCGA函數不好用,其實可以自己做PCA分析,。

用突變數據做生存分析library(RTCGA.mutations)

# library(dplyr) if did not load at start

library(survminer)

mutationsTCGA(BRCA.mutations, OV.mutations) %>%

filter(Hugo_Symbol == 'TP53') %>%

filter(substr(bcr_patient_barcode, 14, 15) ==

"01") %>% # cancer tissue

mutate(bcr_patient_barcode =

substr(bcr_patient_barcode, 1, 12)) ->

BRCA_OV.mutations

library(RTCGA.clinical)

survivalTCGA(

BRCA.clinical,

OV.clinical,

extract.cols = "admin.disease_code"

) %>%

dplyr::rename(disease = admin.disease_code) ->

BRCA_OV.clinical

BRCA_OV.clinical %>%

left_join(

BRCA_OV.mutations,

by = "bcr_patient_barcode"

) %>%

mutate(TP53 =

ifelse(!is.na(Variant_Classification), "Mut","WILDorNOINFO")) ->

BRCA_OV.clinical_mutations

BRCA_OV.clinical_mutations %>%

select(times, patient.vital_status, disease, TP53) -> BRCA_OV.2plot

kmTCGA(

BRCA_OV.2plot,

explanatory.names = c("TP53", "disease"),

break.time.by = 400,

xlim = c(0,2000),

pval = TRUE) -> km_plot

## Scale for 'colour' is already present. Adding another scale for

## 'colour', which will replace the existing scale.

## Scale for 'fill' is already present. Adding another scale for 'fill',

## which will replace the existing scale.

print(km_plot)

img

多個基因在多種癌癥的表達量熱圖library(RTCGA.rnaseq)

# perfrom plot

# library(dplyr) if did not load at start

expressionsTCGA(

ACC.rnaseq,

BLCA.rnaseq,

BRCA.rnaseq,

OV.rnaseq,

extract.cols =

c("MET|4233",

"ZNF500|26048",

"ZNF501|115560")

) %>%

dplyr::rename(cohort = dataset,

MET = `MET|4233`) %>%

#cancer samples

filter(substr(bcr_patient_barcode, 14, 15) ==

"01") %>%

mutate(MET = cut(MET,

round(quantile(MET, probs = seq(0,1,0.25)), -2),

include.lowest = TRUE,

dig.lab = 5)) -> ACC_BLCA_BRCA_OV.rnaseq

ACC_BLCA_BRCA_OV.rnaseq %>%

select(-bcr_patient_barcode) %>%

group_by(cohort, MET) %>%

summarise_each(funs(median)) %>%

mutate(ZNF500 = round(`ZNF500|26048`),

ZNF501 = round(`ZNF501|115560`)) ->

ACC_BLCA_BRCA_OV.rnaseq.medians

## `summarise_each()` is deprecated.

## Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.

## To map `funs` over all variables, use `summarise_all()`

heatmapTCGA(ACC_BLCA_BRCA_OV.rnaseq.medians,

"cohort", "MET", "ZNF500",

title = "Heatmap of ZNF500 expression")

img

細心的同學可以發(fā)現,,本教程其實里面含有大量的外鏈,因為微信自身的限制沒辦法跳轉,,大家可以去生信技能樹論壇查看,,謝謝合作哦。

一個R包不僅僅是提供一個數據下載接口,,更重要的是里面封裝了一些便于使用的統計分析函數,。

生信技能樹GATK4系列教程

GATK4的gvcf流程

你以為的可能不是你以為的

新鮮出爐的GATK4培訓教材全套PPT,趕快下載學習吧

曾老濕最新私已:GATK4實戰(zhàn)教程

GATK4的CNV流程-hg38

    轉藏 分享 獻花(0

    0條評論

    發(fā)表

    請遵守用戶 評論公約

    類似文章 更多