學(xué)習(xí)目標(biāo)本課程介紹RNAseq數(shù)據(jù)的差異表達分析,。它將從原始fastq文件一直到差異表達基因列表,通過將讀數(shù)映射到參考基因組和使用limma包進行統(tǒng)計分析,。 感謝美國蒙納生物信息學(xué)平臺提供本節(jié)的原始材料,。 # Download the installer scriptsource('http:///biocLite.R')biocLite() 數(shù)據(jù)預(yù)處理首先,刪除所有以前保存的R對象,,并定義RNAseq數(shù)據(jù)分析的工作目錄,。 # Delete all previously saved R objectsrm(list=ls()) 考慮到該課程的RNAseq部分的數(shù)據(jù)已經(jīng)從ArrayExpress(http://www./arrayexpress)下載并且對應(yīng)于來自人腦和肝臟的8個RNA測序文庫。 原始測序數(shù)據(jù)通常以FASTQ格式提供,,F(xiàn)ASTQ格式是一種定義明確的基于文本的格式,,用于存儲生物序列(通常是核苷酸序列)及其相應(yīng)的質(zhì)量分?jǐn)?shù)。已將本研究的原始數(shù)據(jù)(8Gb / fastq文件)下載到共享目錄“?/ data / RNAseq / raw_data”中,。 # define shared directory for RNAseq dataRNAseqDATADIR <- 'data/raw_data' #list the fastq files in the raw data directorydir(RNAseqDATADIR) ## [1] 'ERR420388_mini_1.fastq.gz' 'ERR420388_mini_2.fastq.gz' ## [3] 'ERR420388_subsamp_1.fastq.gz' 'ERR420388_subsamp_2.fastq.gz'## [5] 'experiment_design.txt' RNAseq分析的第一步是對您的數(shù)據(jù)進行快速質(zhì)量檢查,,這將使您了解原始數(shù)據(jù)的質(zhì)量,包括每個庫的讀取次數(shù),,讀取長度,,讀數(shù)的平均質(zhì)量得分, GC內(nèi)容,,序列重復(fù)級別,,可能未從數(shù)據(jù)中正確刪除的適配器等。 fastQC工具快速簡便,,可以從這里下載:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/。 為了確保序列的最高質(zhì)量以用于后續(xù)的映射和差異表達分析步驟,,還可以使用Trimmomatic工具修剪讀數(shù)(Lohse等人,,2012,http://www.usadellab.org/cms/,?page = turbimmatic ) ,。 制圖一旦讀數(shù)經(jīng)過質(zhì)量檢查和修剪,下一步就是將讀數(shù)映射到參考基因組(在我們的例子中是人類基因組“hg19”),。這可以通過Bioconductor軟件包Rsubread完成(Y et al.2013),。 library(Rsubread) 注意請記住,您也可以使用命令行工具(如BWA和featureCounts)執(zhí)行這些步驟,,如Unix shell課程中所述,。 如果您的系統(tǒng)上尚未安裝該軟件包,,則可以通過鍵入以下命令進行安裝: source(' biocLite('Rsubread') Rsubread為最常見的生物提供參考基因組指數(shù):人和小鼠。如果您正在使用不同的有機體,,則可以使用buildindex命令構(gòu)建自己的索引,。 # define the reference genome fasta fileREF_GENOME <- 'hg19.fa'# define the output directory for the Rsubread index# (admin note: requires data/ref_data/download_hg19.sh to be run first)RSUBREAD_INDEX_PATH <- 'data/ref_data'# define the basename for the indexRSUBREAD_INDEX_BASE <- 'hg19'# check what is in the reference directorydir(RSUBREAD_INDEX_PATH) 注意出于本課程的目的,已經(jīng)構(gòu)建了索引,,并且映射將在小部分讀取上完成,,因為此步驟每個庫可能需要幾個小時。請不要運行以下命令,。 # build the indexbuildindex(basename=file.path(RSUBREAD_INDEX_PATH,RSUBREAD_INDEX_BASE), reference=REF_GENOME) 創(chuàng)建Rsubread索引后,,您可以通過運行align命令將讀取映射到基因組。 下面的代碼將用于映射特定庫的讀數(shù)與已構(gòu)建索引的基因組,。 注意出于本課程的目的,,映射將僅由教師和小部分讀取執(zhí)行,因為此步驟每個庫可能需要幾個小時,。請不要運行以下命令,。 # list files in the raw data directorydir(RNAseqDATADIR)# define the fastq file with forward readsinputfilefwd <- file.path(RNAseqDATADIR,'ERR420388_subsamp_1.fastq.gz')# define the fastq file with reverse readsinputfilervs <- file.path(RNAseqDATADIR,'ERR420388_subsamp_2.fastq.gz')# run the align command to map the readsalign(index=file.path(RSUBREAD_INDEX_PATH,RSUBREAD_INDEX_BASE), readfile1=inputfilefwd, readfile2=inputfilervs, output_file='ERR420388.sam', output_format='SAM') 注意:可以在align命令中使用nthreads參數(shù)來加速進程并使用多個并行的CPU運行對齊。 propmapped函數(shù)返回輸出SAM文件中映射讀取的比例:輸入讀取總數(shù),,映射讀取數(shù)和映射讀取比例,。我們來看一個小型SAM文件作為示例: 計數(shù)讀取每個功能 Rsubread提供了一個讀取摘要功能featureCounts,它將SAM或BAM文件作為輸入,,并將它們分配給基因組功能,。這給出了每個基因映射的讀數(shù)的數(shù)量,然后可以將其轉(zhuǎn)化為RPKM值(每百萬讀數(shù)每個基元),,標(biāo)準(zhǔn)化并測試差異表達,。 出于本課程的目的,我們將使用先前構(gòu)建的索引及其相應(yīng)的GTF文件來識別和計算映射到每個特征(基因)的讀取,。 # Getting read counts using the index previously builtmycounts<-featureCounts(outputsamfile, annot.ext=file.path(RSUBREAD_INDEX_PATH,'hg19.genes.gtf'), isGTFAnnotationFile=TRUE, isPairedEnd=TRUE)# Checking your output objectsummary(mycounts)dim(mycounts$counts)head(mycounts$annotation)mycounts$targetsmycounts$stat 注意:mm9,,mm10和hg19的內(nèi)置注釋也是為了方便用戶而提供的,但我們不會在本課程中使用它,。請不要運行以下命令 使用內(nèi)置的hg19注釋獲取讀取計數(shù): mycounts.inbuilt < - featureCounts(myoutputfile,,annot.inbuilt =“hg19”,isGTFAnnotationFile = FALSE,,isPairedEnd = TRUE) 出于本課程的目的,,已經(jīng)對所有庫執(zhí)行了讀取摘要步驟。您需要加載相應(yīng)的RData文件才能獲得這些讀取計數(shù),。 MAPPINGDIR <- 'data/mapping'# load the counts previously calculatedload(file.path(MAPPINGDIR,'RawCounts.RData'))# check the presence of read counts for the 8 librariessummary(counts) ## Length Class Mode ## counts 205616 -none- numeric ## annotation 2 data.frame list ## targets 8 -none- character counts$targets ## [1] '/data/Intro_to_R/RNAseq/mapping/ERR420386.sam'## [2] '/data/Intro_to_R/RNAseq/mapping/ERR420387.sam'## [3] '/data/Intro_to_R/RNAseq/mapping/ERR420388.sam'## [4] '/data/Intro_to_R/RNAseq/mapping/ERR420389.sam'## [5] '/data/Intro_to_R/RNAseq/mapping/ERR420390.sam'## [6] '/data/Intro_to_R/RNAseq/mapping/ERR420391.sam'## [7] '/data/Intro_to_R/RNAseq/mapping/ERR420392.sam'## [8] '/data/Intro_to_R/RNAseq/mapping/ERR420393.sam' 然后,,您可以在文本文件中打印這些計數(shù)以供將來參考。 # print out counts table for every samplewrite.table(counts$counts,file='~/raw_read_counts.txt',sep='\t', quote=F,append=F) 注意:要節(jié)省計算機上的磁盤空間,請記住使用samtools將SAM文件轉(zhuǎn)換為BAM格式并壓縮fastq文件,。這可以在Rstudio中的shell窗口中完成:
您創(chuàng)建的BAM文件的可視化也可以在對這些文件進行排序和索引之后通過基因組瀏覽器(例如IGV(http://www./igv/))完成,。 質(zhì)量控制和統(tǒng)計數(shù)據(jù)Rsubread提供映射到每個基因的讀數(shù)的數(shù)量,然后可用于繪制質(zhì)量控制圖和差異表達分析,。 可以繪制映射讀數(shù)計數(shù)的QC數(shù)據(jù)并研究潛在的異常值庫并確認(rèn)樣品的分組,。 在繪制QC數(shù)據(jù)之前,獲得實驗設(shè)計很有用,。這將允許使用它們所屬的樣本組或任何其他感興趣的參數(shù)標(biāo)記數(shù)據(jù),。 對應(yīng)于此研究的實驗設(shè)計文件已從ArrayExpress網(wǎng)頁下載并格式化為制表符分隔文件,以用于此分析目的,。您可以在共享數(shù)據(jù)目錄中找到它,。 # define the experiment design file (tab separated text file is best)EXPMT_DESIGN_FILE <- file.path(RNAseqDATADIR,'experiment_design.txt')# read the experiment design file and save it into memoryexperiment_design<-read.table(EXPMT_DESIGN_FILE,header=T,sep='\t')## set the rownames to the sampleID to allow for orderingrownames(experiment_design) <- experiment_design$SampleID# order the design following the counts sample orderexperiment_design.ord <- experiment_design[colnames(counts$counts),]# look at the designexperiment_design.ord ## SampleID Source_Name organism sex age tissue## ERR420386 ERR420386 brain_sample_1 Homo_sapiens male 26 brain## ERR420387 ERR420387 brain_sample_1 Homo_sapiens male 26 brain## ERR420388 ERR420388 liver_sample_1 Homo_sapiens male 30 liver## ERR420389 ERR420389 liver_sample_1 Homo_sapiens male 30 liver## ERR420390 ERR420390 liver_sample_1 Homo_sapiens male 30 liver## ERR420391 ERR420391 brain_sample_1 Homo_sapiens male 26 brain## ERR420392 ERR420392 brain_sample_1 Homo_sapiens male 26 brain## ERR420393 ERR420393 liver_sample_1 Homo_sapiens male 30 liver## Extract_Name Material_Type Assay_Name technical_replicate_group## ERR420386 GCCAAT RNA Assay4 group_2## ERR420387 ACAGTG RNA Assay2 group_1## ERR420388 GTGAAA RNA Assay7 group_4## ERR420389 GTGAAA RNA Assay8 group_4## ERR420390 CTTGTA RNA Assay6 group_3## ERR420391 ACAGTG RNA Assay1 group_1## ERR420392 GCCAAT RNA Assay3 group_2## ERR420393 CTTGTA RNA Assay5 group_3 # list the ordered samples for future usesamples <- as.character(experiment_design.ord$SampleID)# create factors for future plottinggroup<-factor(experiment_design.ord$tissue)group ## [1] brain brain liver liver liver brain brain liver## Levels: brain liver age<-factor(experiment_design.ord$age)age ## [1] 26 26 30 30 30 26 26 30## Levels: 26 30 基本QC圖每個文庫的對數(shù)強度分布的密度圖可以疊加在單個圖上,以便更好地比較文庫和識別具有奇怪分布的文庫,。在箱線圖上,,原始對數(shù)強度的密度分布預(yù)計不會相同,但仍然不完全不同,。 注意:該PNG函數(shù)將創(chuàng)建一個PNG文件,,其中保存創(chuàng)建后直的情節(jié),將關(guān)閉該文件時dev.off()被調(diào)用,。 要以交互方式查看您的圖表,,只需省略這兩行。 # density plot of raw read counts (log10)png(file='~/Raw_read_counts_per_gene.density.png')logcounts <- log(counts$counts[,1],10) d <- density(logcounts)plot(d,xlim=c(1,8),main='',ylim=c(0,.45),xlab='Raw read counts per gene (log10)', ylab='Density')for (s in 2:length(samples)){ logcounts <- log(counts$counts[,s],10) d <- density(logcounts) lines(d)}dev.off() log10轉(zhuǎn)換后原始讀取的箱圖計數(shù),。 ## box plots of raw read counts (log10)png(file='~/Raw_read_counts_per_gene.boxplot.png')logcounts <- log(counts$counts,10)boxplot(logcounts, main='', xlab='', ylab='Raw read counts per gene (log10)',axes=FALSE)axis(2)axis(1,at=c(1:length(samples)),labels=colnames(logcounts),las=2,cex.axis=0.8)dev.off() 為了研究樣本之間的關(guān)系,,可以使用stats包中的熱圖函數(shù)執(zhí)行分層聚類。在這個例子中,,熱圖計算了100個最高表達基因的映射讀數(shù)計數(shù)的歐氏距離矩陣,。 # select data for the 100 most highly expressed genesselect = order(rowMeans(counts$counts), decreasing=TRUE)[1:100]highexprgenes_counts <- counts$counts[select,]# heatmap with sample name on X-axispng(file='~/High_expr_genes.heatmap.png')heatmap(highexprgenes_counts, col=topo.colors(50), margin=c(10,6))dev.off() 您會注意到樣本聚類不遵循數(shù)據(jù)矩陣中的原始順序(字母順序“ERR420386”到“ERR420393”),。根據(jù)100個基因表達譜的相似性對它們進行了重新排序,。為了理解這種聚類下的生物學(xué)效應(yīng),,可以使用樣本注釋進行標(biāo)記(樣本組,,年齡,性別等),。 # heatmap with condition group as labelscolnames(highexprgenes_counts)<- group# plotpng(file='~/High_exprs_genes.heatmap.group.png')heatmap(highexprgenes_counts, col = topo.colors(50), margin=c(10,6))dev.off() 練習(xí):熱圖為50個最高表達的基因制作熱圖,,并用它們的年齡注釋樣品,。
主成分分析還可以使用cmdscale函數(shù)(來自stats包)對這些數(shù)據(jù)執(zhí)行主成分分析(PCA),,該函數(shù)執(zhí)行數(shù)據(jù)矩陣的經(jīng)典多維縮放,。 在使用cmdscale函數(shù)進行分析之前,需要對讀取計數(shù)進行轉(zhuǎn)置,,即基因應(yīng)該在列中,,樣本應(yīng)該在行中,。這是在進一步的步驟之前轉(zhuǎn)置和檢查數(shù)據(jù)的代碼: # select data for the 1000 most highly expressed genesselect = order(rowMeans(counts$counts), decreasing=TRUE)[1:100]highexprgenes_counts <- counts$counts[select,]# annotate the data with condition group as labelscolnames(highexprgenes_counts)<- group# transpose the data to have variables (genes) as columnsdata_for_PCA <- t(highexprgenes_counts)dim(data_for_PCA) ## [1] 8 100 該cmdscale函數(shù)將計算不相似矩陣從你的換位數(shù)據(jù),也將提供有關(guān)解釋方差通過計算特征值的比例信息,。 ## calculate MDS (matrix of dissimilarities)mds <- cmdscale(dist(data_for_PCA), k=3, eig=TRUE) # k = the maximum dimension of the space which the data are to be represented in# eig = indicates whether eigenvalues should be returned 變量mds $ eig提供前8個主成分的特征值: mds$eig ## [1] 9.490938e+13 1.099639e+13 1.125271e+11 1.026586e+10 1.500500e+07## [6] 6.240239e+06 3.206875e+06 2.285361e-03 將此變量繪制為百分比將幫助您確定有多少組件可以解釋數(shù)據(jù)集中的可變性,,從而確定應(yīng)該查看的維數(shù)。 # transform the Eigen values into percentageeig_pc <- mds$eig * 100 / sum(mds$eig)# plot the PCApng(file='~/PCA_PropExplainedVariance.png')barplot(eig_pc, las=1, xlab='Dimensions', ylab='Proportion of explained variance (%)', y.axis=NULL, col='darkgrey')dev.off() 在大多數(shù)情況下,,前兩個組件解釋了數(shù)據(jù)集中一半以上的可變性,,可用于繪圖。使用默認(rèn)參數(shù)運行的cmdscale函數(shù)將對給定的數(shù)據(jù)矩陣執(zhí)行主成分分析,,繪圖函數(shù)將為個體表示提供散點圖,。 ## calculate MDSmds <- cmdscale(dist(data_for_PCA)) # Performs MDS analysis #Samples representationpng(file='~/PCA_Dim1vsDim2.png')plot(mds[,1], -mds[,2], type='n', xlab='Dimension 1', ylab='Dimension 2', main='')text(mds[,1], -mds[,2], rownames(mds), cex=0.8) dev.off() 前兩個組分的PCA圖顯示了第一維上腦和肝樣品的清晰分離。在每個樣本組中,,我們還可以注意到每組的4個樣本之間的分割,,這些樣本似乎成對聚類。這種觀察可以通過數(shù)據(jù)的另一個變異因素來解釋,,通常是批量效應(yīng)或另一種生物學(xué)變化,,如年齡或性別。 練習(xí):PCA根據(jù)500個最高表達基因的讀數(shù)計數(shù)產(chǎn)生PCA圖并更改標(biāo)記,,直到您可以確定同一組織樣本之間分裂的原因,。
差異表達在進行差異表達分析之前,,過濾掉非常低表達的基因是有用的,。這將有助于提高分析的統(tǒng)計能力,同時保留感興趣的基因,。一種常見的方法是通過過濾掉一半樣本中具有少于1個百萬分之一讀數(shù)(cpm)的基因,。“edgeR”庫提供了可在此處使用的cpm函數(shù)。 # load required librarieslibrary(edgeR) ## Loading required package: limma # get the expression counts from previous alignment stepmycounts <- counts$countsdim(mycounts) ## [1] 25702 8 mycounts[1:5,1:3] ## ERR420386 ERR420387 ERR420388## 1 2 56 5461## 2 3723 2270 27665## 9 14 3 148## 10 1 1 373## 12 42 34 20969 # filtering#Keep genes with least 1 count-per-million reads (cpm) in at least 4 samplesisexpr <- rowSums(cpm(mycounts)>1) >= 4table(isexpr) ## isexpr## FALSE TRUE ## 10686 15016 mycounts <- mycounts[isexpr,]genes <- rownames(mycounts)dim(mycounts) ## [1] 15016 8 的LIMMA包(因為版本3.16.0)提供VOOM函數(shù),,將歸一化讀取計數(shù)和計算差異表達的適度化的t-統(tǒng)計前應(yīng)用的線性模型的歸一化的數(shù)據(jù),。 # load required librarieslibrary(limma)# check your samples groupingexperiment_design.ord[colnames(mycounts),]$tissue == group ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE # create design matrix for limmadesign <- model.matrix(~0+group)# substitute 'group' from the design column namescolnames(design)<- gsub('group','',colnames(design))# check your design matrixdesign ## brain liver## 1 1 0## 2 1 0## 3 0 1## 4 0 1## 5 0 1## 6 1 0## 7 1 0## 8 0 1## attr(,'assign')## [1] 1 1## attr(,'contrasts')## attr(,'contrasts')$group## [1] 'contr.treatment' # calculate normalization factors between librariesnf <- calcNormFactors(mycounts)## normalise the read counts with 'voom' functiony <- voom(mycounts,design,lib.size=colSums(mycounts)*nf)# extract the normalised read countscounts.voom <- y$E# save normalised expression data into output dirwrite.table(counts.voom,file='~/counts.voom.txt',row.names=T,quote=F,sep='\t')# fit linear model for each gene given a series of librariesfit <- lmFit(y,design)# construct the contrast matrix corresponding to specified contrasts of a set of parameterscont.matrix <- makeContrasts(liver-brain,levels=design)cont.matrix ## Contrasts## Levels liver - brain## brain -1## liver 1 # compute estimated coefficients and standard errors for a given set of contrastsfit <- contrasts.fit(fit, cont.matrix)# compute moderated t-statistics of differential expression by empirical Bayes moderation of the standard errorsfit <- eBayes(fit)options(digits=3)# check the output fitdim(fit) ## [1] 15016 1 所述topTable功能總結(jié)了來自LIMMA以表格式的輸出。通過選擇p值小于所選截止值的基因和/或大于該表中選擇值的倍數(shù)變化,,可以鑒定用于特定比較的顯著DE基因,。默認(rèn)情況下,表格將通過增加調(diào)整的p值進行排序,,顯示頂部最重要的DE基因,。 # set adjusted pvalue threshold and log fold change thresholdmypval=0.01myfc=3# get the coefficient name for the comparison of interestcolnames(fit$coefficients) ## [1] 'liver - brain' mycoef='liver - brain'# get the output table for the 10 most significant DE genes for this comparisontopTable(fit,coef=mycoef) ## logFC AveExpr t P.Value adj.P.Val B## 5265 10.45 7.33 86.7 1.89e-13 1.37e-09 18.6## 3240 11.28 7.62 89.4 1.46e-13 1.37e-09 18.5## 2335 6.82 8.56 67.2 1.53e-12 2.93e-09 18.4## 213 11.77 10.69 69.3 1.19e-12 2.93e-09 18.3## 338 11.51 8.31 70.4 1.04e-12 2.93e-09 18.0## 2243 11.56 7.17 80.0 3.64e-13 1.37e-09 17.7## 716 7.49 6.75 61.0 3.38e-12 4.22e-09 17.7## 229 10.90 6.45 82.0 2.98e-13 1.37e-09 17.4## 1571 9.64 6.62 63.8 2.33e-12 3.18e-09 17.3## 125 11.54 7.08 64.3 2.19e-12 3.18e-09 16.9 # get the full table ('n = number of genes in the fit')limma.res <- topTable(fit,coef=mycoef,n=dim(fit)[1])# get significant DE genes only (adjusted p-value < mypval)limma.res.pval <- topTable(fit,coef=mycoef,n=dim(fit)[1],p.val=mypval)dim(limma.res.pval) ## [1] 8183 6 # get significant DE genes with low adjusted p-value high fold changelimma.res.pval.FC <- limma.res.pval[which(abs(limma.res.pval$logFC)>myfc),]dim(limma.res.pval.FC) ## [1] 3044 6 # write limma output table for significant genes into a tab delimited filefilename = paste('~/DEgenes_',mycoef,'_pval',mypval,'_logFC',myfc,'.txt',sep='')write.table(limma.res.pval.FC,file=filename,row.names=T,quote=F,sep='\t') 練習(xí):Limma獲得技術(shù)組1和技術(shù)組2(所有腦樣本)之間的DE基因數(shù),其中adj值<0.01,。
基因注釋來自RNAseq數(shù)據(jù)的EntrezGene ID的注釋可以使用BioMart數(shù)據(jù)庫完成,該數(shù)據(jù)庫包含許多物種,包括人,,小鼠,,斑馬魚,雞和大鼠,。 # get the Ensembl annotation for human genomelibrary(biomaRt) ## Loading required package: methods mart<- useDataset('hsapiens_gene_ensembl', useMart('ENSEMBL_MART_ENSEMBL',host='www.ensembl.org'))# get entrez gene IDs from limma output tableentrez_genes <- as.character(rownames(limma.res.pval.FC))length(entrez_genes) ## [1] 3044 # interrogate the BioMart database to get gene symbol and description for these genes detags.IDs <- getBM( filters= 'entrezgene', attributes= c('entrezgene','hgnc_symbol','description'), values= entrez_genes, mart= mart)dim(detags.IDs) ## [1] 2876 3 head(detags.IDs) ## entrezgene hgnc_symbol## 1 10 NAT2## 2 10000 AKT3## 3 100033415 SNORD116-3## 4 100033421 SNORD116-9## 5 100033427 SNORD116-15## 6 100033442 SNORD115-5## description## 1 N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646]## 2 AKT serine/threonine kinase 3 [Source:HGNC Symbol;Acc:HGNC:393]## 3 small nucleolar RNA, C/D box 116-3 [Source:HGNC Symbol;Acc:HGNC:33069]## 4 small nucleolar RNA, C/D box 116-9 [Source:HGNC Symbol;Acc:HGNC:33075]## 5 small nucleolar RNA, C/D box 116-15 [Source:HGNC Symbol;Acc:HGNC:33081]## 6 small nucleolar RNA, C/D box 115-5 [Source:HGNC Symbol;Acc:HGNC:33024] 在許多情況下,,每個entrez基因ID可獲得幾個注釋。這導(dǎo)致getBM輸出表中的重復(fù)條目,。處理此問題的最簡單方法是刪除重復(fù)項,,盡管它們也可以在某些方面連接。 一旦獲得了所有DE基因的注釋,,該表可以與limma的輸出表合并,,以獲得完整的結(jié)果和更容易的解釋。 # remove duplicatesdetags.IDs.matrix<-detags.IDs[-which(duplicated(detags.IDs$entrezgene)),]# select genes of interest onlyrownames(detags.IDs.matrix)<-detags.IDs.matrix$entrezgeneentrez_genes.annot <- detags.IDs.matrix[as.character(entrez_genes),]# join the two tablesrownames(limma.res.pval.FC) <- limma.res.pval.FC$IDlimma.res.pval.FC.annot <- cbind(entrez_genes.annot,limma.res.pval.FC)# check the annotated tablehead(limma.res.pval.FC.annot) ## entrezgene hgnc_symbol## 5265 5265 SERPINA1## 3240 3240 HP## 2335 2335 FN1## 213 213 ALB## 338 338 APOB## 2243 2243 FGA## description logFC## 5265 serpin family A member 1 [Source:HGNC Symbol;Acc:HGNC:8941] 10.45## 3240 haptoglobin [Source:HGNC Symbol;Acc:HGNC:5141] 11.28## 2335 fibronectin 1 [Source:HGNC Symbol;Acc:HGNC:3778] 6.82## 213 albumin [Source:HGNC Symbol;Acc:HGNC:399] 11.77## 338 apolipoprotein B [Source:HGNC Symbol;Acc:HGNC:603] 11.51## 2243 fibrinogen alpha chain [Source:HGNC Symbol;Acc:HGNC:3661] 11.56## AveExpr t P.Value adj.P.Val B## 5265 7.33 86.7 1.89e-13 1.37e-09 18.6## 3240 7.62 89.4 1.46e-13 1.37e-09 18.5## 2335 8.56 67.2 1.53e-12 2.93e-09 18.4## 213 10.69 69.3 1.19e-12 2.93e-09 18.3## 338 8.31 70.4 1.04e-12 2.93e-09 18.0## 2243 7.17 80.0 3.64e-13 1.37e-09 17.7 基因集富集基因本體論(GO)富集是使用基因本體論分類系統(tǒng)研究基因組的方法,,其中基因被分配給三個主要結(jié)構(gòu)域的特定術(shù)語集:細(xì)胞成分,,生物過程和分子功能。 GOstats包可以使用Hypergeometric測試來測試GO術(shù)語的過度和不足表示,。分析的輸出通常是GO術(shù)語的排序列表,,每個術(shù)語與p值相關(guān)聯(lián)。 超幾何測試將需要一個選定基因列表(即您的DE基因)和一個“宇宙”列表(例如您正在使用的基因組中注釋的所有基因),,所有這些都由它們的“EntrezGene”ID表示,。 # load the librarylibrary(GOstats) ## Loading required package: Biobase ## Loading required package: BiocGenerics ## Loading required package: parallel ## ## Attaching package: 'BiocGenerics' ## The following objects are masked from 'package:parallel':## ## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,## clusterExport, clusterMap, parApply, parCapply, parLapply,## parLapplyLB, parRapply, parSapply, parSapplyLB ## The following object is masked from 'package:limma':## ## plotMA ## The following objects are masked from 'package:stats':## ## IQR, mad, xtabs ## The following objects are masked from 'package:base':## ## anyDuplicated, append, as.data.frame, cbind, colnames,## do.call, duplicated, eval, evalq, Filter, Find, get, grep,## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,## Position, rank, rbind, Reduce, rownames, sapply, setdiff,## sort, table, tapply, union, unique, unsplit, which, which.max,## which.min ## Welcome to Bioconductor## ## Vignettes contain introductory material; view with## 'browseVignettes()'. To cite Bioconductor, see## 'citation('Biobase')', and for packages 'citation('pkgname')'. ## Loading required package: Category ## Loading required package: stats4 ## Loading required package: AnnotationDbi ## Loading required package: IRanges ## Loading required package: S4Vectors ## ## Attaching package: 'S4Vectors' ## The following objects are masked from 'package:base':## ## colMeans, colSums, expand.grid, rowMeans, rowSums ## Loading required package: Matrix ## ## Attaching package: 'Matrix' ## The following object is masked from 'package:S4Vectors':## ## expand ## Loading required package: graph ## ## ## Attaching package: 'GOstats' ## The following object is masked from 'package:AnnotationDbi':## ## makeGOGraph # Define list of genes of interest (DE genes - EntrezGene IDs)entrezgeneids <- as.character(rownames(limma.res.pval.FC))length(entrezgeneids) ## [1] 3044 # Define the universeuniverseids <- rownames(mycounts)length(universeids) ## [1] 15016 在使用hyperGTest函數(shù)運行超幾何測試之前,您需要定義測試的參數(shù)(基因列表,,本體,,測試方向)以及要使用的注釋數(shù)據(jù)庫。待測試的本體可以是三個GO域中的任何一個:生物過程(“BP”),,細(xì)胞組分(“CC”)或分子功能(“MF”),。 在下面的例子中,我們將在我們的差異表達基因列表中測試過度代表的生物過程,。 # define the p-value cut off for the hypergeometric testhgCutoff <- 0.05params <- new('GOHyperGParams',annotation='org.Hs.eg',geneIds=entrezgeneids,universeGeneIds=universeids,ontology='BP',pvalueCutoff=hgCutoff,testDirection='over') ## Loading required package: org.Hs.eg.db ## ## Warning in makeValidParams(.Object): removing geneIds not in## universeGeneIds # Run the testhg <- hyperGTest(params)# Check resultshg ## Gene to GO BP test for over-representation ## 9211 GO BP ids tested (3527 have p < 0.05)## Selected gene set size: 1530 ## Gene universe size: 12138 ## Annotation package: org.Hs.eg 只有通過p.adjust函數(shù)調(diào)整pvalues,,才能從重要GO術(shù)語的測試中獲取輸出表。 ## Get the p-values of the testhg.pv <- pvalues(hg)## Adjust p-values for multiple test (FDR)hg.pv.fdr <- p.adjust(hg.pv,'fdr')## select the GO terms with adjusted p-value less than the cut offsigGO.ID <- names(hg.pv.fdr[hg.pv.fdr < hgCutoff])length(sigGO.ID) ## [1] 2518 # get table from HyperG test resultdf <- summary(hg)# keep only significant GO terms in the tableGOannot.table <- df[df[,1] %in% sigGO.ID,]head(GOannot.table) ## GOBPID Pvalue OddsRatio ExpCount Count Size## 1 GO:0042221 6.43e-79 3.01 351 658 2783## 2 GO:0050896 1.43e-69 2.69 721 1041 5723## 3 GO:0044699 1.12e-65 4.14 1174 1413 9312## 4 GO:0032501 3.05e-64 2.54 573 877 4542## 5 GO:0044707 6.88e-63 2.53 533 831 4226## 6 GO:0044710 2.06e-60 2.56 406 683 3224## Term## 1 response to chemical## 2 response to stimulus## 3 single-organism process## 4 multicellular organismal process## 5 single-multicellular organism process## 6 single-organism metabolic process Gene Ontology富集結(jié)果可以保存在文本文件或html文件中以供將來參考,。 # Create text report of the significantly over-represented GO termswrite.table(GOannot.table,file='~/GOterms_OverRep_BP.txt',sep='\t',row.names=F)# Create html report of all over-represented GO termshtmlReport(hg, file='~/GOterms_OverRep_BP.html') 練習(xí):GOstats在您的DE基因列表中識別分子功能域中過度表示的GO術(shù)語(pvalue <0.01),。
其他軟件可用于研究過度代表的途徑,例如GeneGO(https://portal./)和Ingenuity(http://www./products/ipa),。這些軟件的優(yōu)勢在于它們可以維護精選和最新的廣泛數(shù)據(jù)庫,。它們還提供直觀的可視化和網(wǎng)絡(luò)建模工具。 在進入本課程的下一部分之前,,您可以保存RNAseq分析的圖像,。 RDataFile <- '~/RNAseq_DE_analysis_with_R.RData'save.image(RDataFile) 記錄包和版本信息 |
|