maftools : 總結(jié),、分析,、可視化

頭頭了不起 2020-11-07

展開全文

轉(zhuǎn)自：程涼皮兒：https://www.jianshu.com/p/7b02459defed

maftools_2

clp

02 June, 2020

maftools : 總結(jié)、分析,、可視化 MAF文件maftools : 總結(jié),、分析、可視化 MAF文件

1 簡介

隨著腫瘤基因組學(xué)的發(fā)展,，突變注釋格式(MAF)正被廣泛接受并用于存儲檢測到的體細胞變異,。癌癥基因組圖譜項目已經(jīng)對30多種不同的癌癥進行了排序,，每種癌癥的樣本量超過200。由體細胞變體組成的結(jié)果數(shù)據(jù)以突變注釋格式存儲。只要數(shù)據(jù)為MAF格式,，此R包將嘗試以高效的方式匯總、分析,、注釋和可視化來自TCGA來源或任何內(nèi)部研究的MAF文件,。

1.1 引用

如果你覺得這個軟件有幫助到你，請用如下方法引用：

Mayakonda A, Lin DC, Assenov Y, Plass C, Koeffler HP. 2018. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Resarch PMID: 30341162

2 構(gòu)建MAF文件

對于VCF文件或簡單的表格文件,，簡單的選擇是使用vcf2maf軟件,，該軟件將對VCF進行注釋，確定記錄的優(yōu)先順序,，并生成MAF,。
如果您使用ANNOVAR 進行變異注釋，maftools有一個很方便的函數(shù)annovarToMaf,，用于將注釋輸出轉(zhuǎn)換為MAF格式,。

3 MAF field 要求

MAF文件包含從染色體名稱到cosmic注釋的許多字段。然而，maftools中的大多數(shù)分析使用以下字段,。 - 必填字段: Hugo_Symbol, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, Variant_Classification, Variant_Type and Tumor_Sample_Barcode. - 推薦的可選字段：包含VAF(變異等位基因頻率)和氨基酸變化信息的非MAF特定字段,。

有關(guān)MAF文件的完整規(guī)范，請參閱NCI GDC文檔頁面,。本教程在TCGA LAML隊列1中的一個示例MAF文件上演示了maftools的用法和應(yīng)用,。

4 安裝

#  if (!require("BiocManager"))
#      install.packages("BiocManager")
#  BiocManager::install("maftools")

5 該R包概覽

maftools功能主要分為可視化模塊和分析模塊。下面總結(jié)了這些功能和簡短的描述,，如下所示,。使用方法很簡單，只需使用read.maf讀取MAF文件(如果可用,，還會附帶copy-number數(shù)據(jù)),，然后將生成的MAF對象傳遞給所需的函數(shù)進行繪圖或分析。

6 讀取并初步探索maf文件

6.1 Required input files

MAF文件-可以.gz壓縮文件,，必需輸入文件,。
與MAF中的每個樣本/Tumor_Sample_Barcode相關(guān)的可選但推薦的臨床數(shù)據(jù)。
可選的copy number數(shù)據(jù)(如果可用),?？梢允?code>GISTIC輸出格式，也可以是包含sample names, gene names 和 copy-number status (Amp or Del),。

6.2 Reading MAF files

read.maf函數(shù)讀取MAF文件,，以各種方式對其進行匯總，并將其存儲為MAF對象,。盡管MAF文件足夠獨立,，建議您在MAF中提供與示例相關(guān)聯(lián)的注釋文件。如果可用,，還可以集成拷貝數(shù)(copy number data)數(shù)據(jù),。

library(maftools)

#path to TCGA LAML MAF file
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools') 
#clinical information containing survival information and histology. This is optional
laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools') 

laml = read.maf(maf = laml.maf, clinicalData = laml.clin)
#> -Reading
#> -Validating
#> -Silent variants: 475 
#> -Summarizing
#> -Processing clinical data
#> -Finished in 0.243s elapsed (0.869s cpu)

6.3 MAF 對象

匯總MAF文件存儲為MAF對象。MAF對象包含主MAF文件,、匯總數(shù)據(jù)和任何關(guān)聯(lián)的示例注釋,。有一些訪問器方法可以從MAF對象訪問有用的slots。

#Typing laml shows basic summary of MAF file.
laml
#> An object of class  MAF 
#>                    ID          summary  Mean Median
#>  1:        NCBI_Build               37    NA     NA
#>  2:            Center genome.wustl.edu    NA     NA
#>  3:           Samples              193    NA     NA
#>  4:            nGenes             1241    NA     NA
#>  5:   Frame_Shift_Del               52 0.271      0
#>  6:   Frame_Shift_Ins               91 0.474      0
#>  7:      In_Frame_Del               10 0.052      0
#>  8:      In_Frame_Ins               42 0.219      0
#>  9: Missense_Mutation             1342 6.990      7
#> 10: Nonsense_Mutation              103 0.536      0
#> 11:       Splice_Site               92 0.479      0
#> 12:             total             1732 9.021      9

#  #Shows sample summry.
#  getSampleSummary(laml)
#  #Shows gene summary.
#  getGeneSummary(laml)
#  #shows clinical data associated with samples
#  getClinicalData(laml)
#  #Shows all fields in MAF
#  getFields(laml)
#  #Writes maf summary to an output file with basename laml.
#  write.mafSummary(maf = laml, basename = 'laml')

7 可視化

7.1 Plotting MAF summary

我們可以使用plotmafSummary來繪制maf文件的摘要,，該文件將每個樣本中的變體數(shù)量顯示為堆疊的條形圖,，將變量類型顯示為由Variant_Classification分類匯總的箱形圖。我們可以將平均值或中值線添加到堆疊條形圖中,，以顯示隊列中的平均/中位數(shù)變量,。

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

image.png

7.2 Oncoplots瀑布圖

7.2.1 Drawing oncoplots

可以將maf文件更好地表示為附圖，也稱為瀑布圖,。側(cè)方條形圖和頂部條形圖可以分別由draRowBar和draColBar參數(shù)控制,。

#oncoplot for top ten mutated genes.
oncoplot(maf = laml, top = 10)

image.png

注：標記為Multi_Hit的變異是指在同一樣本中發(fā)生多次突變的基因,。有關(guān)自定義的更多詳細信息，請參閱自定義專題圖小節(jié),。

7.3 Oncostrip

我們可以使用oncostrip函數(shù)可視化任何一組基因,，該函數(shù)在每個樣本中繪制突變,，類似于cBioPortal上的OncoPrinter工具,。oncostrip可用于使用top或gene參數(shù)繪制任意數(shù)量的基因。

oncostrip(maf = laml, genes = c('DNMT3A','NPM1', 'RUNX1'))

image.png

7.4轉(zhuǎn)換和顛倒

titv 函數(shù)將SNP分類為 Transitions and Transversions ,，并以各種方式返回匯總表的列表,。匯總的數(shù)據(jù)還可以可視化為顯示六個不同轉(zhuǎn)換的總體分布的boxplot圖，以及顯示每個樣本中的轉(zhuǎn)換分數(shù)的堆疊條形圖,。

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv)

image.png

7.5 Lollipop plots for amino acid changes

棒糖圖是顯示蛋白質(zhì)結(jié)構(gòu)突變點的簡單有效的方法,。許多癌基因都有一個優(yōu)先位點，其突變頻率比其他任何位點都要高,。這些點被認為是突變熱點,，棒棒糖曲線圖可以用來顯示它們和其他突變。我們可以使用函數(shù)lollipopPlot繪制這樣的曲線圖,。這個函數(shù)要求我們在maf文件中有氨基酸變化信息,。然而，MAF文件對氨基酸變化的字段命名沒有明確的指導(dǎo)方針,，不同的研究對氨基酸變化有不同的字段(或列)名稱,。默認情況下，lollipopPlot查找列AAChange,，如果在MAF文件中找不到,，則打印所有可用字段并顯示一條警告消息。例如,，MAF文件在字段/列名為‘Protein_Change’下包含氨基酸變化,。我們將使用參數(shù)AACol手動指定此參數(shù)。此函數(shù)還將繪圖作為ggplot對象返回,，用戶稍后可以根據(jù)需要對其進行修改,。

#lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.
lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE)
#>      HGNC refseq.ID protein.ID aa.length
#> 1: DNMT3A NM_175629  NP_783328       912
#> 2: DNMT3A NM_022552  NP_072046       912
#> 3: DNMT3A NM_153759  NP_715640       723

image.png

請注意，lollipopPlot警告用戶針對給定基因的不同轉(zhuǎn)錄本的可用性,。如果我們事先知道轉(zhuǎn)錄本ID,，我們可以將其指定為refSeqID或protein ID。默認情況下,，lollipopPlot使用較長的轉(zhuǎn)錄本,。

7.5.1 Labelling points

我們還可以使用參數(shù)labelPos標記lollipopPlot上的點。如果labelPos設(shè)置為all,，則所有點都會高亮顯示,。

lollipopPlot(maf = laml, gene = 'DNMT3A', showDomainLabel = FALSE, labelPos = 882)
#>      HGNC refseq.ID protein.ID aa.length
#> 1: DNMT3A NM_175629  NP_783328       912
#> 2: DNMT3A NM_022552  NP_072046       912
#> 3: DNMT3A NM_153759  NP_715640       723

image.png

7.6 Rainfall plots

腫瘤基因組,，特別是實體瘤的特征是基因組位點具有局部性的超突變 5.這種超突變的基因組區(qū)域可以通過在線性基因組尺度上繪制變異間距離圖來可視化。這些地塊通常被稱為rainfallPlot,，我們可以使用rainfallPlot來繪制這樣的地塊,。如果detectChangePoints設(shè)置為TRUE，則rainfall plot還會亮顯事件間距離的潛在更改所在的區(qū)域,。

brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)

rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.6)
#>    Chromosome Start_Position End_Position nMuts Avg_intermutation_dist
#> 1:          8       98129391     98133560     6               833.8000
#> 2:          8       98398603     98403536     8               704.7143
#> 3:          8       98453111     98456466     8               479.2857
#> 4:          8      124090506    124096810    21               315.2000
#> 5:         12       97437934     97439705     6               354.2000
#> 6:         17       29332130     29336153     7               670.5000
#>    Size Tumor_Sample_Barcode C>G C>T
#> 1: 4169         TCGA-A8-A08B   4   2
#> 2: 4933         TCGA-A8-A08B   1   7
#> 3: 3355         TCGA-A8-A08B  NA   8
#> 4: 6304         TCGA-A8-A08B   1  20
#> 5: 1771         TCGA-A8-A08B   3   3
#> 6: 4023         TCGA-A8-A08B   4   3

image.png

“Kataegis”被定義為包含6個或更多連續(xù)突變,、平均突變間距離小于或等于100bp的基因組片段5。

7.7將突變載量與TCGA隊列進行比較

TCGA包含30多個不同的癌癥隊列,，它們之間的中位突變載量從低至7個外顯子(嗜鉻細胞瘤和腎上腺副神經(jīng)節(jié)瘤)到高達315個外顯子(皮膚黑色素瘤)不等,。看看給定MAF中的突變負荷如何與TCGA隊列相抗衡,，這是很有意義的,。這可以通過函數(shù)tcgaComapre來實現(xiàn)，該函數(shù)繪制從33個TCGA地標隊列中的10000多個WXS樣本編譯而成的變異的分布,。生成的圖類似于Alexandrov等人5的描述,。

laml.mutload = tcgaCompare(maf = laml, cohortName = 'Example-LAML')

image.png

7.8 Plotting VAF

此函數(shù)將不同的等位基因頻率繪制為箱式圖，這有助于快速估計頂級突變基因的克隆狀態(tài)(假設(shè)純樣本,，克隆基因的平均等位基因頻率通常在~50%左右),。

plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')

image.png

7.9 Genecloud

我們可以用geneCloud函數(shù)繪制突變基因的文字云圖。每個基因的大小與其突變/變異的樣本總數(shù)成正比,。

geneCloud(input = laml, minMut = 3)

image.png

8 處理拷貝數(shù)數(shù)據(jù)

8.1.讀取并匯總`gistic`輸出文件

我們可以匯總GISTIC程序生成的輸出文件,。如前所述，我們需要GISTIC生成的四個文件,，即all_lesions.conf_XX.txt,、amp_genes.conf_XX.txt、del_genes.conf_XX.txt和scores.gistic,，其中XX為置信度,。詳情見GISTIC documentation]。

all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")

laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)
#> -Processing Gistic files..
#> --Processing amp_genes.conf_99.txt
#> --Processing del_genes.conf_99.txt
#> --Processing scores.gistic
#> --Summarizing by samples

#GISTIC object
laml.gistic
#> An object of class  GISTIC 
#>           ID summary
#> 1:   Samples     191
#> 2:    nGenes    2622
#> 3: cytoBands      16
#> 4:       Amp     388
#> 5:       Del   26481
#> 6:     total   26869

與MAF對象類似,，訪問GISTIC對象slots的方法有getSampleSummary,、getGeneSummary和getCytoBandSummary?？梢允褂?code>write.Gistic Summary函數(shù)將匯總結(jié)果寫入輸出文件,。 ### 8.2可視化GISTIC結(jié)果。有三種類型的曲線圖可用于可視化GISTIC結(jié)果,。

8.2.1基因組圖

gisticChromPlot(gistic = laml.gistic, markBands = "all")

image.png

8.2.2 Bubble plot

gisticBubblePlot(gistic = laml.gistic)

image.png

8.2.3 oncoplot

除了拷貝數(shù)數(shù)據(jù)外,，與oncoplots類似。如果有注釋,，可以再次根據(jù)注釋對矩陣進行排序,。下圖是LAML的GISTIC結(jié)果,，根據(jù)FAB分類排序。結(jié)果表明,，7q缺失在M4亞型中幾乎不存在,，而在其他亞型中普遍存在。

gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, top = 10)

image.png

8.2.4 Visualizing CBS segments

tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg)
#> NULL

image.png

9 數(shù)據(jù)分析

9.1 體細胞突變互動

許多癌癥致病基因在其突變模式中是共生的或表現(xiàn)出很強的排他性,。這種互斥或共現(xiàn)的基因組可以使用somaticInteractions函數(shù)來檢測,，該函數(shù)執(zhí)行成對的費舍爾精確測試來檢測這種重要的基因?qū)Α?code>somaticInteractions函數(shù)還使用cometExactTest來識別涉及6中>2 個基因的潛在改變的基因組。

#exclusive/co-occurance event analysis on top 10 mutated genes. 
somaticInteractions(maf = laml, top = 25, pvalue = c(0.05, 0.1))

image.png

#>      gene1  gene2       pValue oddsRatio  00 11 01 10              Event
#>   1: ASXL1  RUNX1 0.0001541586 55.215541 176  4 12  1       Co_Occurence
#>   2:  IDH2  RUNX1 0.0002809928  9.590877 164  7  9 13       Co_Occurence
#>   3:  IDH2  ASXL1 0.0004030636 41.077327 172  4  1 16       Co_Occurence
#>   4:  FLT3   NPM1 0.0009929836  3.763161 125 17 16 35       Co_Occurence
#>   5:  SMC3 DNMT3A 0.0010451985 20.177713 144  6 42  1       Co_Occurence
#>  ---                                                                    
#> 296: PLCE1  ASXL1 1.0000000000  0.000000 184  0  5  4 Mutually_Exclusive
#> 297: RAD21  FAM5C 1.0000000000  0.000000 183  0  5  5 Mutually_Exclusive
#> 298: PLCE1  FAM5C 1.0000000000  0.000000 184  0  5  4 Mutually_Exclusive
#> 299: PLCE1  RAD21 1.0000000000  0.000000 184  0  5  4 Mutually_Exclusive
#> 300:  EZH2  PLCE1 1.0000000000  0.000000 186  0  4  3 Mutually_Exclusive
#>              pair event_ratio
#>   1: ASXL1, RUNX1        4/13
#>   2:  IDH2, RUNX1        7/22
#>   3:  ASXL1, IDH2        4/17
#>   4:   FLT3, NPM1       17/51
#>   5: DNMT3A, SMC3        6/43
#>  ---                         
#> 296: ASXL1, PLCE1         0/9
#> 297: FAM5C, RAD21        0/10
#> 298: FAM5C, PLCE1         0/9
#> 299: PLCE1, RAD21         0/9
#> 300:  EZH2, PLCE1         0/7

9.2 基于位置聚類的癌癥驅(qū)動基因檢測

maftools有一個名為oncodrive的函數(shù),，它可以從給定的MAF中識別癌癥基因(driver gene),。oncodrive是一個基于oncodriveCLUST算法的函數(shù),，最初是用Python語言實現(xiàn)的,。這一概念是基于這樣一個事實，即致癌基因中的大多數(shù)變異在少數(shù)幾個特定的位點(也就是熱點)上富集,。這種方法利用這些位置來識別癌癥基因,。如果您使用此函數(shù)，請注明OncodriveCLUST article 7引用,。

laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')

head(laml.sig)
#>    Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
#> 1:        IDH1               0               0            0            0
#> 2:        IDH2               0               0            0            0
#> 3:        NPM1               0              33            0            0
#> 4:        NRAS               0               0            0            0
#> 5:       U2AF1               0               0            0            0
#> 6:         KIT               1               1            0            1
#>    Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
#> 1:                18                 0           0    18             18
#> 2:                20                 0           0    20             20
#> 3:                 1                 0           0    34             33
#> 4:                15                 0           0    15             15
#> 5:                 8                 0           0     8              8
#> 6:                 7                 0           0    10              8
#>    AlteredSamples clusters muts_in_clusters clusterScores protLen   zscore
#> 1:             18        1               18     1.0000000     414 5.546154
#> 2:             20        2               20     1.0000000     452 5.546154
#> 3:             33        2               32     0.9411765     294 5.093665
#> 4:             15        2               15     0.9218951     189 4.945347
#> 5:              8        1                7     0.8750000     240 4.584615
#> 6:              8        2                9     0.8500000     976 4.392308
#>            pval          fdr fract_muts_in_clusters
#> 1: 1.460110e-08 1.022077e-07              1.0000000
#> 2: 1.460110e-08 1.022077e-07              1.0000000
#> 3: 1.756034e-07 8.194826e-07              0.9411765
#> 4: 3.800413e-07 1.330144e-06              1.0000000
#> 5: 2.274114e-06 6.367520e-06              0.8750000
#> 6: 5.607691e-06 1.308461e-05              0.9000000

可以用plotOncodrive來繪制這些結(jié)果.

plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE)

image.png

9.3 Adding and summarizing pfam domains

maftools附帶了函數(shù)pfamDomains,，該函數(shù)將pfam結(jié)構(gòu)域信息添加到氨基酸變化中。pnamDomain還根據(jù)受影響的結(jié)構(gòu)域匯總氨基酸變化,。這是為了知道在給定的癌癥隊列中,，哪個結(jié)構(gòu)域最容易受到影響。該函數(shù)的靈感來自于MuSic tool 8中的PFAM注釋模塊,。

laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)
#> Warning in pfamDomains(maf = laml, AACol = "Protein_Change", top = 10):
#> Removed 50 mutations for which AA position was not available

image.png

#Protein summary (Printing first 7 columns for display convenience)
laml.pfam$proteinSummary[,1:7, with = FALSE]
#>         HGNC AAPos Variant_Classification  N total  fraction   DomainLabel
#>    1: DNMT3A   882      Missense_Mutation 27    54 0.5000000 AdoMet_MTases
#>    2:   IDH1   132      Missense_Mutation 18    18 1.0000000      PTZ00435
#>    3:   IDH2   140      Missense_Mutation 17    20 0.8500000      PTZ00435
#>    4:   FLT3   835      Missense_Mutation 14    52 0.2692308      PKc_like
#>    5:   FLT3   599           In_Frame_Ins 10    52 0.1923077      PKc_like
#>   ---                                                                     
#> 1512: ZNF646   875      Missense_Mutation  1     1 1.0000000          <NA>
#> 1513: ZNF687   554      Missense_Mutation  1     2 0.5000000          <NA>
#> 1514: ZNF687   363      Missense_Mutation  1     2 0.5000000          <NA>
#> 1515: ZNF75D     5      Missense_Mutation  1     1 1.0000000          <NA>
#> 1516: ZNF827   427        Frame_Shift_Del  1     1 1.0000000          <NA>
#Domain summary (Printing first 3 columns for display convenience)
laml.pfam$domainSummary[,1:3, with = FALSE]
#>        DomainLabel nMuts nGenes
#>   1:      PKc_like    55      5
#>   2:      PTZ00435    38      2
#>   3: AdoMet_MTases    33      1
#>   4:         7tm_1    24     24
#>   5:       COG5048    17     17
#>  ---                           
#> 499:    ribokinase     1      1
#> 500:   rim_protein     1      1
#> 501: sigpep_I_bact     1      1
#> 502:           trp     1      1
#> 503:        zf-BED     1      1

上面的圖和結(jié)果表明,，AdoMet_MTases結(jié)構(gòu)域經(jīng)常發(fā)生突變，但與7tm_1結(jié)構(gòu)域等其他結(jié)構(gòu)域相比,，含有該結(jié)構(gòu)域的基因只有一個(DNMT3A),，后者在24個不同的基因中發(fā)生了突變。這表明了甲基轉(zhuǎn)移結(jié)構(gòu)域突變在白血病中的重要性,。

9.4 Pan-Cancer comparison

Lawrence等人對21個癌癥隊列進行了MutSigCV分析,，確定了200多個顯著突變的基因，其中包括以前未訂閱的新基因9,。他們的結(jié)果表明,，只有少數(shù)基因在多個隊列中發(fā)生突變，而其中許多基因是組織/隊列特異性的,。我們可以將mutSig結(jié)果與這個泛癌顯著突變基因列表進行比較,，以查看特定隊列中特定突變的基因。此函數(shù)需要MutSigCV結(jié)果(通常名為sig_genes.txt)作為輸入,。

#MutsigCV results for TCGA-AML
laml.mutsig <- system.file("extdata", "LAML_sig_genes.txt.gz", package = "maftools")
pancanComparison(mutsigResults = laml.mutsig, qval = 0.1, cohortName = 'LAML', inputSampleSize = 200, label = 1)
#>       gene pancan            q nMut log_q_pancan     log_q
#>  1:  CEBPA  1.000 3.500301e-12   13   0.00000000 11.455895
#>  2:   EZH2  1.000 7.463546e-05    3   0.00000000  4.127055
#>  3: GIGYF2  1.000 6.378338e-03    2   0.00000000  2.195292
#>  4:    KIT  0.509 1.137517e-05    8   0.29328222  4.944042
#>  5:   PHF6  0.783 6.457555e-09    6   0.10623824  8.189932
#>  6: PTPN11  0.286 7.664584e-03    9   0.54363397  2.115511
#>  7:  RAD21  0.929 1.137517e-05    5   0.03198429  4.944042
#>  8:  SMC1A  0.801 2.961696e-03    6   0.09636748  2.528460
#>  9:   TET2  0.907 2.281625e-13   17   0.04239271 12.641756
#> 10:    WT1  1.000 2.281625e-13   12   0.00000000 12.641756

image.png

#>          gene   pancan            q nMut log_q_pancan    log_q
#>   1:   ACVR1B 6.11e-02 1.000000e+00    0     1.213959  0.00000
#>   2:     AKT1 2.68e-10 1.000000e+00    0     9.571865  0.00000
#>   3:      APC 1.36e-13 1.000000e+00    0    12.866461  0.00000
#>   4:    APOL2 7.96e-03 1.000000e+00    0     2.099087  0.00000
#>   5: ARHGAP35 2.32e-12 1.000000e+00    1    11.634512  0.00000
#>  ---                                                          
#> 120:    U2AF1 4.07e-08 4.503311e-13    8     7.390406 12.34647
#> 121:      VHL 2.32e-12 1.000000e+00    0    11.634512  0.00000
#> 122:      WT1 1.00e+00 2.281625e-13   12     0.000000 12.64176
#> 123:   ZNF180 8.60e-02 1.000000e+00    0     1.065502  0.00000
#> 124:   ZNF483 2.37e-02 1.000000e+00    0     1.625252  0.00000

9.5生存分析

生存分析是基于隊列的測序項目的重要組成部分,。函數(shù)mafSurvive根據(jù)用戶定義基因的突變狀態(tài)或手動提供的組成一組的樣本進行分組,，進行生存分析并繪制Kaplan Meier曲線。此函數(shù)要求輸入數(shù)據(jù)包含TOMOR_SAMPLE_BARCODE(請確保與MAF文件中的匹配),、二進制事件(1/0)和到達事件的時間,。我們的注釋數(shù)據(jù)已經(jīng)包含生存信息，如果您將生存數(shù)據(jù)存儲在單獨的表中,，請通過參數(shù)clinicalData

9.5.1在任何給定基因中突變來提供這些數(shù)據(jù)

#Survival analysis based on grouping of DNMT3A mutation status
mafSurvival(maf = laml, genes = 'DNMT3A', time = 'days_to_last_followup', Status = 'Overall_Survival_Status', isTCGA = TRUE)
#> DNMT3A 
#>     48 
#>     Group medianTime   N
#> 1: Mutant        245  45
#> 2:     WT        396 137

image.png

9.5.2 Predict genesets associated with survival

確定導(dǎo)致生存不良的一組基因

#Using top 20 mutated genes to identify a set of genes (of size 2) to predict poor prognostic groups
prog_geneset = survGroup(maf = laml, top = 20, geneSetSize = 2, time = "days_to_last_followup", Status = "Overall_Survival_Status", verbose = FALSE)

print(prog_geneset)
#>     Gene_combination P_value    hr  WT Mutant
#>  1:      FLT3_DNMT3A 0.00104 2.510 164     18
#>  2:      DNMT3A_SMC3 0.04880 2.220 176      6
#>  3:      DNMT3A_NPM1 0.07190 1.720 166     16
#>  4:      DNMT3A_TET2 0.19600 1.780 176      6
#>  5:        FLT3_TET2 0.20700 1.860 177      5
#>  6:        NPM1_IDH1 0.21900 0.495 176      6
#>  7:      DNMT3A_IDH1 0.29300 1.500 173      9
#>  8:       IDH2_RUNX1 0.31800 1.580 176      6
#>  9:        FLT3_NPM1 0.53600 1.210 165     17
#> 10:      DNMT3A_IDH2 0.68000 0.747 178      4
#> 11:      DNMT3A_NRAS 0.99200 0.986 178      4

以上結(jié)果顯示與生存不良相關(guān)的基因組合(N=2)(P < 0.05),。利用函數(shù)mafSurvGroup可以繪制上述結(jié)果的KM曲線

mafSurvGroup(maf = laml, geneSet = c("DNMT3A", "FLT3"), time = "days_to_last_followup", Status = "Overall_Survival_Status")
#>     Group medianTime   N
#> 1: Mutant      242.5  18
#> 2:     WT      379.5 164

image.png

9.6.對比兩個隊列(MAF)。

癌癥在突變模式方面各不相同,。我們可以比較兩個不同的隊列來檢測這種差異突變的基因,。例如，Madan et. al 9的研究表明,，復(fù)發(fā)的急性早幼粒細胞白血病(APL)患者往往存在PML和RAR基因突變,，而這兩種基因在疾病的初發(fā)期是不存在的?？梢允褂煤瘮?shù)mafComapre來檢測兩個隊列(在這種情況下是原發(fā)性和復(fù)發(fā)性APL)之間的這種差異,，該函數(shù)對兩個隊列之間的所有基因執(zhí)行Fisher測試，以檢測差異突變的基因,。

#Primary APL MAF
primary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools")
primary.apl = read.maf(maf = primary.apl)
#Relapse APL MAF
relapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools")
relapse.apl = read.maf(maf = relapse.apl)

9.6.1 Forest plots 森林圖

以上結(jié)果顯示兩個基因PML和RARA在復(fù)發(fā)性APL中較原發(fā)性APL高度突變,。我們可以將這些結(jié)果可視化為 forestplot森林圖。

#Considering only genes which are mutated in at-least in 5 samples in one of the cohort to avoid bias due to genes mutated in single sample.
pt.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'Primary', m2Name = 'Relapse', minMut = 5)
print(pt.vs.rt)
#> $results
#>    Hugo_Symbol Primary Relapse         pval         or       ci.up
#> 1:         PML       1      11 1.529935e-05 0.03537381   0.2552937
#> 2:        RARA       0       7 2.574810e-04 0.00000000   0.3006159
#> 3:       RUNX1       1       5 1.310500e-02 0.08740567   0.8076265
#> 4:        FLT3      26       4 1.812779e-02 3.56086275  14.7701728
#> 5:      ARID1B       5       8 2.758396e-02 0.26480490   0.9698686
#> 6:         WT1      20      14 2.229087e-01 0.60619329   1.4223101
#> 7:        KRAS       6       1 4.334067e-01 2.88486293 135.5393108
#> 8:        NRAS      15       4 4.353567e-01 1.85209500   8.0373994
#> 9:      ARID1A       7       4 7.457274e-01 0.80869223   3.9297309
#>         ci.low      adjPval
#> 1: 0.000806034 0.0001376942
#> 2: 0.000000000 0.0011586643
#> 3: 0.001813280 0.0393149868
#> 4: 1.149009169 0.0407875250
#> 5: 0.064804160 0.0496511201
#> 6: 0.263440988 0.3343630535
#> 7: 0.337679367 0.4897762916
#> 8: 0.553883512 0.4897762916
#> 9: 0.195710173 0.7457273717
#> 
#> $SampleSummary
#>     Cohort SampleSize
#> 1: Primary        124
#> 2: Relapse         58

forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1, color = c('royalblue', 'maroon'), geneFontSize = 0.8)

image.png

9.6.2 Co-onco plots

顯示上述結(jié)果的另一種替代方式是并排繪制兩個oncoplots,。coOncoplot函數(shù)獲取兩個maf對象并并排打印,，以便更好地進行比較。

genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3")
coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE)

image.png

#coBarplot(m1 = primary.apl, m2 = relapse.apl, m1Name = "Primary", m2Name = "Relapse")

9.6.3 Lollipop plot-2棒棒糖圖

以及顯示隊列差異的曲線圖,，也可以用lollipopPlot2函數(shù)顯示基因差異,。

lollipopPlot2(m1 = primary.apl, m2 = relapse.apl, gene = "PML", AACol1 = "amino_acid_change", AACol2 = "amino_acid_change", m1_name = "Primary", m2_name = "Relapse")

image.png

9.7 臨床富集分析

clinicalEnrichment是另一個函數(shù)，它提取與樣本相關(guān)的任何臨床特征并進行富集分析,。它執(zhí)行各種groupwise和pairwise成對比較,，以確定臨床特征中每個類別的豐富突變。下面是一個識別與FAB_classfication相關(guān)的突變的示例,。

fab.ce = clinicalEnrichment(maf = laml, clinicalFeature = 'FAB_classification')
#> 
#> M0 M1 M2 M3 M4 M5 M6 M7 
#> 19 44 44 21 39 19  3  3

#Results are returned as a list. Significant associations p-value < 0.05
fab.ce$groupwise_comparision[p_value < 0.05]
#>    Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2
#> 1:        IDH1     M1   Rest         11 of 44         7 of 149
#> 2:        TP53     M7   Rest           3 of 3        12 of 190
#> 3:      DNMT3A     M5   Rest         10 of 19        38 of 174
#> 4:       CEBPA     M2   Rest          7 of 44         6 of 149
#> 5:       RUNX1     M0   Rest          5 of 19        11 of 174
#> 6:        NPM1     M5   Rest          7 of 19        26 of 174
#> 7:       CEBPA     M1   Rest          6 of 44         7 of 149
#>         p_value OR_low   OR_high       fdr
#> 1: 0.0002597371      0 0.3926994 0.0308575
#> 2: 0.0003857187      0 0.1315271 0.0308575
#> 3: 0.0057610493      0 0.6406007 0.3072560
#> 4: 0.0117352110      0 0.6874270 0.3757978
#> 5: 0.0117436825      0 0.6466787 0.3757978
#> 6: 0.0248582372      0 0.8342897 0.6628863
#> 7: 0.0478737468      0 0.9869971 1.0000000

以上結(jié)果表明,，與隊列的其余部分相比，IDH1突變在M1亞型白血病中富集,。同樣,，DNMT3A在M5中，RUNX1在M0中,，依此類推,。這些都是眾所周知的結(jié)果，此函數(shù)有效地重新捕獲了它們,?？梢允褂萌魏晤愋偷呐R床特征來執(zhí)行這樣的分析,。還有一個小函數(shù)-plotEnrichmentResults，可用于繪制這些結(jié)果,。

plotEnrichmentResults(enrich_res = fab.ce, pVal = 0.05)

image.png

9.8 藥物與基因的相互作用

從藥物基因相互作用數(shù)據(jù)庫匯編而來的藥物-基因相互作用和基因和藥性信息可用drugInteractions函數(shù)查詢

dgi = drugInteractions(maf = laml, fontSize = 0.75)

image.png

上圖顯示了潛在的可用藥基因類別,，以及與之相關(guān)的最多5個基因。人們還可以提取關(guān)于藥物與基因相互作用的信息,。例如,，下面是已知/報告的藥物與DNMT3A相互作用的結(jié)果。

dnmt3a.dgi = drugInteractions(genes = "DNMT3A", drugs = TRUE)
#> Number of claimed drugs for given genes:
#>      Gene N
#> 1: DNMT3A 7
#Printing selected columns.
dnmt3a.dgi[,.(Gene, interaction_types, drug_name, drug_claim_name)]
#>      Gene interaction_types    drug_name drug_claim_name
#> 1: DNMT3A                                            N/A
#> 2: DNMT3A                   DAUNORUBICIN    Daunorubicin
#> 3: DNMT3A                     DECITABINE      Decitabine
#> 4: DNMT3A                     IDARUBICIN      IDARUBICIN
#> 5: DNMT3A                     DECITABINE      DECITABINE
#> 6: DNMT3A         inhibitor   DECITABINE   CHEMBL1201129
#> 7: DNMT3A         inhibitor  AZACITIDINE      CHEMBL1489

如果您認為此函數(shù)對你有用 10,，請引用DGIdb article,。 免責(zé)聲明：此功能中使用的資源僅用于研究目的。它不應(yīng)用于緊急情況或醫(yī)療或?qū)I(yè)建議,。,。

9.9致癌信號通路

OncogenicPathways函數(shù)查看TCGA隊列中已知的致癌信號通路的富集情況 11。

OncogenicPathways(maf = laml)
#>        Pathway  N n_affected_genes fraction_affected
#>  1:    RTK-RAS 85               18        0.21176471
#>  2:      Hippo 38                7        0.18421053
#>  3:      NOTCH 71                6        0.08450704
#>  4:        MYC 13                3        0.23076923
#>  5:        WNT 68                3        0.04411765
#>  6:       TP53  6                2        0.33333333
#>  7:       NRF2  3                1        0.33333333
#>  8:       PI3K 29                1        0.03448276
#>  9: Cell_Cycle 15                0        0.00000000
#> 10:   TGF-Beta  7                0        0.00000000

image.png

也有可能將完整的通路可視化,。

PlotOncogenicPathways(maf = laml, pathways = "RTK-RAS")

image.png

抑癌基因用紅色表示,，癌基因用藍色字體表示。

9.10 腫瘤異質(zhì)性與數(shù)學(xué)成績

9.10.1 腫瘤樣本的異質(zhì)性

腫瘤一般是異質(zhì)性的,，即由多個克隆組成。這種異質(zhì)性可以通過聚類不同的等位基因頻率來推斷,。inserHetereneity函數(shù)使用VAF信息對變量進行聚類(使用mclust),，從而推斷出克隆性。默認情況下,，inserverHetereneity函數(shù)查找包含VAF信息的t_vaf列,。但是，如果字段名稱與t_vaf不同,，我們可以使用參數(shù)vafCol手動指定,。例如，在本例中,，研究vaf存儲在字段名i_TumorVAF_WU下,。

#Heterogeneity in sample TCGA.AB.2972
tcga.ab.2972.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-2972', vafCol = 'i_TumorVAF_WU')
print(tcga.ab.2972.het$clusterMeans)
#>    Tumor_Sample_Barcode cluster   meanVaf
#> 1:         TCGA-AB-2972       2 0.4496571
#> 2:         TCGA-AB-2972       1 0.2454750
#> 3:         TCGA-AB-2972 outlier 0.3695000
#Visualizing results
plotClusters(clusters = tcga.ab.2972.het)

image.png

上圖清楚地顯示了兩個平均變異等位基因頻率為_{45%的克隆(主克隆)和另一個次要克隆的變異等位基因頻率為}25%。雖然變異等位基因頻率的聚類使我們對異質(zhì)性有了一個很好的認識,，但也可以用數(shù)值來衡量異質(zhì)性的程度,。MATH score(上圖中的副標題)是腫瘤內(nèi)異質(zhì)性的一種簡單的定量測量，它計算出VAF分布的寬度,。研究發(fā)現(xiàn),，較高的MATH scores 與較差的結(jié)果有關(guān)。MATH score也可以作為生存分析的代理變量-11.

9.10.2 忽略拷貝數(shù)更改區(qū)域的變異,。

我們可以使用拷貝數(shù)信息來忽略位于拷貝數(shù)改變區(qū)域的變異,?？截悢?shù)改變會導(dǎo)致變異等位基因頻率異常高/低，這往往會影響聚類,。去除這些變異可以改進聚類和密度估計,，同時保留有生物學(xué)意義的結(jié)果?？截悢?shù)信息可以作為從分割程序生成的分割文件來提供,，例如來自“DNACopy” Bioconductor package的圓形二進制分割的R包6

seg = system.file('extdata', 'TCGA.AB.3009.hg19.seg.txt', package = 'maftools')
tcga.ab.3009.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-3009', segFile = seg, vafCol = 'i_TumorVAF_WU')
#>    Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
#> 1:        PHF6         23      133551224    133551224         TCGA-AB-3009
#>        t_vaf Segment_Start Segment_End Segment_Mean CN
#> 1: 0.9349112            NA          NA           NA NA
#>    Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
#> 1:         NF1         17       29562981     29562981         TCGA-AB-3009
#> 2:       SUZ12         17       30293198     30293198         TCGA-AB-3009
#>        t_vaf Segment_Start Segment_End Segment_Mean       CN    cluster
#> 1: 0.8419000      29054355    30363868      -0.9157 1.060173 CN_altered
#> 2: 0.8958333      29054355    30363868      -0.9157 1.060173 CN_altered
#Visualizing results. Highlighting those variants on copynumber altered variants.
plotClusters(clusters = tcga.ab.3009.het, genes = 'CN_altered', showCNvars = TRUE)

image.png

上圖顯示了兩個VAF高的基因NF1和SUZ12，這是由于拷貝數(shù)變異(缺失)所致,。這兩個基因在分析中被忽略了,。

9.11 突變Signatures

每一種癌癥，在進展過程中都會留下一個Signatures,，其Signatures是特定的核苷酸替換模式,。 Alexandrov et.al已經(jīng)顯示出這樣的突變signatures，源自5的7,000多個癌癥樣本,。這樣的signatures可以通過分解核苷酸替換矩陣來提取,，基于突變堿基周圍的直接堿基將其分類為96個替換類別。還可以將提取的signatures與那些signatures進行比較validated signatures,。 signatures分析的第一步是獲得突變堿基周圍的相鄰堿基,，并形成突變矩陣。

注意：早期版本的maftools需要一個fasta文件作為輸入,。但是從1.8.0開始,，BSgenome對象被用于更快的序列提取。

#Requires BSgenome object
library(BSgenome.Hsapiens.UCSC.hg19, quietly = TRUE)
laml.tnm = trinucleotideMatrix(maf = laml, prefix = 'chr', add = TRUE, ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
#> Warning in trinucleotideMatrix(maf = laml, prefix = "chr", add = TRUE, ref_genome = "BSgenome.Hsapiens.UCSC.hg19"): Chromosome names in MAF must match chromosome names in reference genome.
#> Ignorinig 101 single nucleotide variants from missing chromosomes chr23
#> -Extracting 5' and 3' adjacent bases
#> -Extracting +/- 20bp around mutated bases for background C>T estimation
#> -Estimating APOBEC enrichment scores
#> --Performing one-way Fisher's test for APOBEC enrichment
#> ---APOBEC related mutations are enriched in  3.315 % of samples (APOBEC enrichment score > 2 ;  6  of  181  samples)
#> -Creating mutation matrix
#> --matrix of dimension 188x96

上述功能執(zhí)行兩個步驟： * 估計APOBEC富集分數(shù) * 為signature分析準備突變矩陣,。

9.11.1 APOBEC Enrichment estimation.

APOBEC誘導(dǎo)的突變在實體瘤中更為常見,，并且主要與TCW基序中發(fā)生的C>T轉(zhuǎn)換事件有關(guān)。上述命令中的APOBEC富集分數(shù)是使用Roberts et al 13描述的方法估計的,。簡而言之,，將發(fā)生在TCW基序內(nèi)的C>T突變富集在給定樣本中的所有C>T突變上，并將其與背景胞嘧啶和發(fā)生在突變堿基20bp內(nèi)的TCW進行比較,。

image.png

如Roberts等人的原始研究中所描述的,，還執(zhí)行單側(cè)fishers精確測試以統(tǒng)計地評估富集分數(shù)。

9.11.2 APOBEC富集與非富集的區(qū)別,。

我們還可以分析APOBEC富集型和非APOBEC富集型的突變模式的差異,，plotApobecDiff是一個函數(shù)，它取trinucleotideMatrix估計的APOBEC富集值,，將樣本分為APOBEC富集型和非APOBEC富集型,。一旦分層，它就會比較這兩組人，以確定差異改變的基因,。請注意,，沒有APOBEC富集的LAML不是此類分析的理想隊列，因此下面的圖表僅用于演示目的,。

plotApobecDiff(tnm = laml.tnm, maf = laml, pVal = 0.2)

image.png

#> $results
#>      Hugo_Symbol Enriched nonEnriched       pval        or      ci.up
#>   1:        TP53        2          13 0.08175632 5.9976455  46.608861
#>   2:        TET2        1          16 0.45739351 1.9407002  18.983979
#>   3:        FLT3        2          45 0.65523131 1.4081851  10.211621
#>   4:      DNMT3A        1          47 1.00000000 0.5335362   4.949499
#>   5:      ADAM11        0           2 1.00000000 0.0000000 164.191472
#>  ---                                                                 
#> 132:         WAC        0           2 1.00000000 0.0000000 164.191472
#> 133:         WT1        0          12 1.00000000 0.0000000  12.690862
#> 134:      ZBTB33        0           2 1.00000000 0.0000000 164.191472
#> 135:      ZC3H18        0           2 1.00000000 0.0000000 164.191472
#> 136:      ZNF687        0           2 1.00000000 0.0000000 164.191472
#>          ci.low adjPval
#>   1: 0.49875432       1
#>   2: 0.03882963       1
#>   3: 0.12341748       1
#>   4: 0.01101929       1
#>   5: 0.00000000       1
#>  ---                   
#> 132: 0.00000000       1
#> 133: 0.00000000       1
#> 134: 0.00000000       1
#> 135: 0.00000000       1
#> 136: 0.00000000       1
#> 
#> $SampleSummary
#>         Cohort SampleSize  Mean Median
#> 1:    Enriched          6 7.167    6.5
#> 2: nonEnriched        172 9.715    9.0

9.11.3 Signature分析,。

Signature分析包括以下步驟。 - 1.estimateSignatures -在一系列值上運行NMF,，并衡量擬合的好壞-以時間為單位Cophenetic correlation,。 - 2.plotCophenetic-繪制elblow圖，幫助您確定最佳Signature數(shù)量,。最佳可能的Signature是共生相關(guān)性顯著下降的值,。 - 3.ExtractSignaturesSignature-使用非負矩陣分解將矩陣分解為三個N個Signature，根據(jù)上述兩個步驟選擇N個Signature,。如果您已經(jīng)對N‘有了很好的估計,，您可以跳過以上兩步。 - 4.將上述步驟中提取的SignaturecompareSignatures與signatures[11](http://127.0.0.1:25995/library/maftools/doc/maftools.html#references)數(shù)據(jù)庫中已知的[COSMIC](https://cancer./cosmic/signatures/SBS/)簽名進行比對,，并計算余弦相似度,，確定最佳匹配。 - 5.plotSignatures`-plots signatures

注意：在以前的版本中,，上述步驟都是由ExtractSignatures自動完成的,。在2.2.0版本之后，Main函數(shù)被拆分成不超過5個stpe,，以方便用戶靈活使用,。

image.png

#  library('NMF')
#  laml.sign = estimateSignatures(mat = laml.tnm, nTry = 6)

繪制elbow曲線圖，根據(jù)上述結(jié)果可視化并確定最佳signatures數(shù)量,。

plotCophenetic(res = laml.sign)

image.png

#  laml.sig = extractSignatures(mat = laml.tnm, n = 3)

最佳可能值是y軸上的相關(guān)值顯著下降的值。在這種情況下,，它看起來是在n = 3,。LAML的突變率較低，不是特征分析的理想例子,，但對于突變負擔(dān)較高的實體腫瘤,，只要有足夠數(shù)量的樣本，就可以期待更多的特征,。一旦估計了n,，我們就可以運行main函數(shù)了。

將檢測到的signatures與COSMIC數(shù)據(jù)庫中的已知signatures進行比較,。

#Compate against original 30 signatures 
laml.og30.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "legacy")
#Compate against updated version3 60 signatures 
laml.v3.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "SBS")

compareSignatures返回COSMICsignatures余弦相似度的完整表,，可以進一步分析。下圖顯示了檢測到的signatures與驗證過的signatures的相似性比較。

library('pheatmap')
pheatmap::pheatmap(mat = laml.og30.cosm$cosine_similarities, cluster_rows = FALSE, main = "cosine similarity against validated signatures")

image.png

Finally plot signatures

maftools::plotSignatures(nmfRes = laml.sig, title_size = 0.8, sig_db = "SBS")

image.png

注： - 1.如果您在運行extractSignatures時收到none of the packages are loaded的錯誤,，請手動加載NMF庫并重新運行,。 - 2.如果extractSignatures或estimateSignatures在兩者之間停止，則可能是因為矩陣中的突變計數(shù)較低,。在這種情況下,，重新運行pConstant參數(shù)設(shè)置為小正值(例如0.1)的函數(shù)。

9.11.4 Signature enrichment analysis

Signature可以進一步賦值給樣本,，并使用signatureEnrichment函數(shù)進行富集分析,，該函數(shù)識別每個識別出的Signature中富集的突變。

#  library("barplot3d")
#  #Visualize first signature
#  sig1 = laml.sig$signatures[,1]
#  barplot3d::legoplot3d(contextdata = sig1, labels = FALSE, scalexy = 0.01, sixcolors = "sanger", alpha = 0.5)

laml.se = signatureEnrichment(maf = laml, sig_res = laml.sig)
#> 
#> Signature_1 Signature_2 Signature_3 
#>          60          65          63

image.png

上述結(jié)果可進行和臨床結(jié)果相似的可視化操作,。

plotEnrichmentResults(enrich_res = laml.se, pVal = 0.05)

image.png

10 變異的注釋

10.1 將annovar注釋結(jié)果轉(zhuǎn)換為MAF

ANNOVAR是基因組學(xué)17中使用最廣泛的變異注釋工具之一,。ANNOVAR輸出通常是具有各種注釋列的表格格式。此函數(shù)用于將此類注解輸出文件轉(zhuǎn)換為MAF,。此函數(shù)要求在包含任何基于過濾或區(qū)域的注釋之前,，將基于基因的注釋作為第一個操作來運行注解。例如,，

table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol (refGene),cytoBand,dbnsfp30a -operation (g),r,f -nastring NA

annovarToMaf主要使用基于基因的注釋進行處理,，輸入文件中的其余注釋列將附加到生成的MAF的末尾。作為示例,，我們將注釋上面用來運行oncotate函數(shù)的同一文件,。我們將使用以下命令使用Annovar對其進行注釋。為簡單起見,，這里我們只包含基于基因的注釋,，但是可以根據(jù)需要包含任意多的注釋。但是要確保第一個操作始終是基于基因的注釋,。

$perl table_annovar.pl variants.tsv ~/path/to/humandb/ -buildver hg19 
-out variants --otherinfo -remove -protocol ensGene -operation g -nastring NA

生成的輸出作為此包的一部分存儲,。我們可以使用annovarToMaf將這個注釋輸出轉(zhuǎn)換成MAF。

var.annovar = system.file("extdata", "variants.hg19_multianno.txt", package = "maftools")
var.annovar.maf = annovarToMaf(annovar = var.annovar, Center = 'CSI-NUS', refBuild = 'hg19', tsbCol = 'Tumor_Sample_Barcode', table = 'ensGene')
#> -Reading annovar files
#> -Processing Exonic variants
#> --Adding Variant_Classification
#> --Parsing aa-change
#> -Processing Non-exonic variants
#> --Adding Variant_Classification
#> -Adding Variant_Type
#> -Converting Ensemble Gene IDs into HGNC gene symbols
#> --Done. Original ensemble gene IDs are preserved under field name ens_id
#> Finished in 0.204s elapsed (0.778s cpu)

Annovar在用Ensemble作基因注釋源時,，使用ensemble gene IDs作為基因名稱,。在這種情況下，使用帶有參數(shù)table的annovarToMaf,，該參數(shù)設(shè)置為ensGene將ensemble gene IDs轉(zhuǎn)換為HGNC symbols,。

10.2 將ICGC簡單體細胞突變格式轉(zhuǎn)換為MAF。

就像TCGA一樣,，國際癌癥基因組聯(lián)盟ICGC也將其數(shù)據(jù)公之于眾,。但數(shù)據(jù)存儲在結(jié)構(gòu)上與MAF格式相似的簡單體細胞突變Format](http://docs./submission/guide/icgc-simple-somatic-mutation-format/)中。但是變量的字段名稱和分類與MAF不同%E4%B8%AD%E3%80%82%E4%BD%86%E6%98%AF%E5%8F%98%E9%87%8F%E7%9A%84%E5%AD%97%E6%AE%B5%E5%90%8D%E7%A7%B0%E5%92%8C%E5%88%86%E7%B1%BB%E4%B8%8EMAF%E4%B8%8D%E5%90%8C),，icgcSimpleMutationToMAF是一個讀取ICGC數(shù)據(jù)并將其轉(zhuǎn)換成MAF的函數(shù),。

#Read sample ICGC data for ESCA
esca.icgc <- system.file("extdata", "simple_somatic_mutation.open.ESCA-CN.sample.tsv.gz", package = "maftools")
esca.maf <- icgcSimpleMutationToMAF(icgc = esca.icgc, addHugoSymbol = TRUE)
#> --Removed 427 duplicated variants
#Printing first 16 columns for display convenience.
print(esca.maf[1:5,1:16, with = FALSE])
#>    Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
#> 1:  AC005237.4             NA     NA     GRCh37          2      241987787
#> 2:  AC061992.1            786     NA     GRCh37         17       76425382
#> 3:  AC097467.2             NA     NA     GRCh37          4      156294567
#> 4:    ADAMTS12             NA     NA     GRCh37          5       33684019
#> 5:  AL589642.1          54801     NA     GRCh37          9       32630154
#>    End_Position Strand Variant_Classification Variant_Type
#> 1:    241987787      +                 Intron          SNP
#> 2:     76425382      +                3'Flank          SNP
#> 3:    156294567      +                 Intron          SNP
#> 4:     33684019      +      Missense_Mutation          SNP
#> 5:     32630154      +                    RNA          SNP
#>    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS
#> 1:                C                 C                 T       NA
#> 2:                C                 C                 T       NA
#> 3:                A                 A                 G       NA
#> 4:                A                 A                 C       NA
#> 5:                T                 T                 C       NA
#>    dbSNP_Val_Status Tumor_Sample_Barcode
#> 1:               NA             SA514619
#> 2:               NA             SA514619
#> 3:               NA             SA514619
#> 4:               NA             SA514619
#> 5:               NA             SA514619

請注意,，默認情況下，簡單體細胞突變格式包含一個變異的所有受影響的轉(zhuǎn)錄本,，從而導(dǎo)致同一樣本中有多個相同變異位點的條目,。僅根據(jù)注釋很難選擇單個受影響的轉(zhuǎn)錄本，并且默認情況下,，該軟件會刪除重復(fù)的變異作為重復(fù)條目,。如果希望全部保留，請將removeDuplicatedVariants設(shè)置為FALSE,。另一種選擇是通過刪除重復(fù)項將輸入文件轉(zhuǎn)換為MAF,，然后使用諸如vcf2maf之類的腳本重新注釋受影響的腳本并確定優(yōu)先順序。

10.3 準備MAF文件進行MutSigCV分析,。

MutSig/MutSigCV是18中應(yīng)用最廣泛的驅(qū)動基因檢測方法,。然而，我們觀察到與MutSig捆綁在一起的協(xié)變量文件(gene.covariates.txt和exome_full192.coverage.txt)具有非標準的基因名稱(非Hugo_Symbol),。MAF中的Hugo_Symbol和協(xié)變量文件中的非Hugo_Symbol之間的這種差異導(dǎo)致MutSig程序忽略這些基因,。例如，KMT2D-一個眾所周知的食管癌驅(qū)動基因,，在MutSig協(xié)變量中被表示為MLL2,。這會導(dǎo)致KMT2D在分析中被忽略，并在MutSig結(jié)果中表示為一個無關(guān)緊要的基因,。此函數(shù)嘗試使用與MutSig協(xié)變量列表兼容的手動管理的基因名稱列表更正此類基因符號,。

#  laml.mutsig.corrected = prepareMutSig(maf = laml)
#  # Converting gene names for 1 variants from 1 genes
#  #    Hugo_Symbol MutSig_Synonym N
#  # 1:    ARHGAP35          GRLF1 1
#  # Original symbols are preserved under column OG_Hugo_Symbol.

11 設(shè)置操作

11.1 子集MAF

我們也可以使用subsetMaf函數(shù)來取MAF子集

#Extract data for samples 'TCGA.AB.3009' and 'TCGA.AB.2933'  (Printing just 5 rows for display convenience)
subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'), mafObj = FALSE)[1:5]
#>    Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
#> 1:      ABCB11           8647 genome.wustl.edu         37          2
#> 2:       ACSS3          79611 genome.wustl.edu         37         12
#> 3:        ANK3            288 genome.wustl.edu         37         10
#> 4:       AP1G2           8906 genome.wustl.edu         37         14
#> 5:         ARC          23237 genome.wustl.edu         37          8
#>    Start_Position End_Position Strand Variant_Classification Variant_Type
#> 1:      169780250    169780250      +      Missense_Mutation          SNP
#> 2:       81536902     81536902      +      Missense_Mutation          SNP
#> 3:       61926581     61926581      +            Splice_Site          SNP
#> 4:       24033309     24033309      +      Missense_Mutation          SNP
#> 5:      143694874    143694874      +      Missense_Mutation          SNP
#>    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
#> 1:                G                 G                 A
#> 2:                C                 C                 T
#> 3:                C                 C                 A
#> 4:                C                 C                 T
#> 5:                C                 C                 G
#>    Tumor_Sample_Barcode Protein_Change i_TumorVAF_WU i_transcript_name
#> 1:         TCGA-AB-3009       p.A1283V      46.97218       NM_003742.2
#> 2:         TCGA-AB-3009        p.A266V      47.32510       NM_024560.2
#> 3:         TCGA-AB-3009                     43.99478       NM_020987.2
#> 4:         TCGA-AB-3009        p.R346Q      47.08000       NM_003917.2
#> 5:         TCGA-AB-3009        p.W253C      42.96435       NM_015193.3
##Same as above but return output as an MAF object (Default behaviour)
subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'))
#> An object of class  MAF 
#>                    ID          summary Mean Median
#>  1:        NCBI_Build               37   NA     NA
#>  2:            Center genome.wustl.edu   NA     NA
#>  3:           Samples                2   NA     NA
#>  4:            nGenes               34   NA     NA
#>  5:   Frame_Shift_Ins                5  2.5    2.5
#>  6:      In_Frame_Ins                1  0.5    0.5
#>  7: Missense_Mutation               26 13.0   13.0
#>  8: Nonsense_Mutation                2  1.0    1.0
#>  9:       Splice_Site                1  0.5    0.5
#> 10:             total               35 17.5   17.5

11.1.1指定查詢和控制輸出字段。

#Select all Splice_Site mutations from DNMT3A and NPM1
subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = FALSE,query = "Variant_Classification == 'Splice_Site'")
#>    Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
#> 1:      DNMT3A           1788 genome.wustl.edu         37          2
#> 2:      DNMT3A           1788 genome.wustl.edu         37          2
#> 3:      DNMT3A           1788 genome.wustl.edu         37          2
#> 4:      DNMT3A           1788 genome.wustl.edu         37          2
#> 5:      DNMT3A           1788 genome.wustl.edu         37          2
#> 6:      DNMT3A           1788 genome.wustl.edu         37          2
#>    Start_Position End_Position Strand Variant_Classification Variant_Type
#> 1:       25459874     25459874      +            Splice_Site          SNP
#> 2:       25467208     25467208      +            Splice_Site          SNP
#> 3:       25467022     25467022      +            Splice_Site          SNP
#> 4:       25459803     25459803      +            Splice_Site          SNP
#> 5:       25464576     25464576      +            Splice_Site          SNP
#> 6:       25469029     25469029      +            Splice_Site          SNP
#>    Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2
#> 1:                C                 C                 G
#> 2:                C                 C                 T
#> 3:                A                 A                 G
#> 4:                A                 A                 C
#> 5:                C                 C                 A
#> 6:                C                 C                 A
#>    Tumor_Sample_Barcode Protein_Change i_TumorVAF_WU i_transcript_name
#> 1:         TCGA-AB-2818        p.R803S         36.84       NM_022552.3
#> 2:         TCGA-AB-2822                        62.96       NM_022552.3
#> 3:         TCGA-AB-2891                        34.78       NM_022552.3
#> 4:         TCGA-AB-2898                        46.43       NM_022552.3
#> 5:         TCGA-AB-2934        p.G646V         37.50       NM_022552.3
#> 6:         TCGA-AB-2968        p.E477*         40.00       NM_022552.3

#Same as above but include only 'i_transcript_name' column in the output
subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = FALSE, query = "Variant_Classification == 'Splice_Site'", fields = 'i_transcript_name')
#>    Hugo_Symbol Chromosome Start_Position End_Position Reference_Allele
#> 1:      DNMT3A          2       25459874     25459874                C
#> 2:      DNMT3A          2       25467208     25467208                C
#> 3:      DNMT3A          2       25467022     25467022                A
#> 4:      DNMT3A          2       25459803     25459803                A
#> 5:      DNMT3A          2       25464576     25464576                C
#> 6:      DNMT3A          2       25469029     25469029                C
#>    Tumor_Seq_Allele2 Variant_Classification Variant_Type
#> 1:                 G            Splice_Site          SNP
#> 2:                 T            Splice_Site          SNP
#> 3:                 G            Splice_Site          SNP
#> 4:                 C            Splice_Site          SNP
#> 5:                 A            Splice_Site          SNP
#> 6:                 A            Splice_Site          SNP
#>    Tumor_Sample_Barcode i_transcript_name
#> 1:         TCGA-AB-2818       NM_022552.3
#> 2:         TCGA-AB-2822       NM_022552.3
#> 3:         TCGA-AB-2891       NM_022552.3
#> 4:         TCGA-AB-2898       NM_022552.3
#> 5:         TCGA-AB-2934       NM_022552.3
#> 6:         TCGA-AB-2968       NM_022552.3

11.1.2 Subsetting by clinical data

使用subsetMaf函數(shù)中的clinQuery參數(shù)根據(jù)臨床癥狀挑選合適的感興趣的樣本,。

#Select all samples with FAB clasification M4 in clinical data 
laml_m4 = subsetMaf(maf = laml, clinQuery = "FAB_classification %in% 'M4'")

12 預(yù)編譯的TCGA MAF對象

還有一個R數(shù)據(jù)包,，其中包含來自TCGA Firehose和TCGA-MC3項目的預(yù)編譯TCGA MAF對象，對那些使用TCGA突變數(shù)據(jù)的人特別有幫助,。每個數(shù)據(jù)集都存儲為包含體細胞突變和臨床信息的MAF對象,。由于Bioconductor包裝尺寸的限制和其他困難，這份報告沒有提交給Bioconductor,。不過,，你仍然可以從GitHub下載TCGAmutations包。

#  devtools::install_github(repo = "PoisonAlien/TCGAmutations")

13 References

Cancer Genome Atlas Research, N. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 368, 2059-74 (2013).
Mermel, C.H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 12, R41 (2011).
Olshen, A.B., Venkatraman, E.S., Lucito, R. & Wigler, M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557-72 (2004).
Alexandrov, L.B. et al. Signatures of mutational processes in human cancer. Nature 500, 415-21 (2013).
Tamborero, D., Gonzalez-Perez, A. & Lopez-Bigas, N. OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics 29, 2238-44 (2013).
Dees, N.D. et al. MuSiC: identifying mutational significance in cancer genomes. Genome Res 22, 1589-98 (2012).
Lawrence MS, Stojanov P, Mermel CH, Robinson JT, Garraway LA, Golub TR, Meyerson M, Gabriel SB, Lander ES, Getz G: Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 2014, 505:495-501.
Griffith, M., Griffith, O. L., Coffman, A. C., Weible, J. V., McMichael, J. F., Spies, N. C., … Wilson, R. K. (2013). DGIdb - Mining the druggable genome. Nature Methods, 10(12), 1209–1210. http:///10.1038/nmeth.2689
Sanchez-Vega F, Mina M, Armenia J, Chatila WK, Luna A, La KC, Dimitriadoy S, Liu DL, Kantheti HS, Saghafinia S et al. 2018. Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell 173: 321-337 e310
Madan, V. et al. Comprehensive mutational analysis of primary and relapse acute promyelocytic leukemia. Leukemia 30, 1672-81 (2016).
Mroz, E.A. & Rocco, J.W. MATH, a novel measure of intratumor genetic heterogeneity, is high in poor-outcome classes of head and neck squamous cell carcinoma. Oral Oncol 49, 211-5 (2013).
Roberts SA, Lawrence MS, Klimczak LJ, et al. An APOBEC Cytidine Deaminase Mutagenesis Pattern is Widespread in Human Cancers. Nature genetics. 2013;45(9):970-976.
Gaujoux, R. & Seoighe, C. A flexible R package for nonnegative matrix factorization. BMC Bioinformatics 11, 367 (2010).
Welch, J.S. et al. The origin and evolution of mutations in acute myeloid leukemia. Cell 150, 264-78 (2012).
Ramos, A.H. et al. Oncotator: cancer variant annotation tool. Hum Mutat 36, E2423-9 (2015).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38, e164 (2010).
Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al: Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013, 499:214-218.

14 Session Info

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] pheatmap_1.0.12                   doParallel_1.0.15                
#>  [3] iterators_1.0.12                  foreach_1.4.7                    
#>  [5] NMF_0.22.0                        Biobase_2.46.0                   
#>  [7] cluster_2.1.0                     rngtools_1.4                     
#>  [9] pkgmaker_0.27                     registry_0.5-1                   
#> [11] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.54.0                  
#> [13] rtracklayer_1.46.0                Biostrings_2.54.0                
#> [15] XVector_0.26.0                    GenomicRanges_1.38.0             
#> [17] GenomeInfoDb_1.22.0               IRanges_2.20.1                   
#> [19] S4Vectors_0.24.0                  BiocGenerics_0.32.0              
#> [21] maftools_2.2.10                  
#> 
#> loaded via a namespace (and not attached):
#>  [1] splines_3.6.1               R.utils_2.9.0              
#>  [3] assertthat_0.2.1            GenomeInfoDbData_1.2.2     
#>  [5] Rsamtools_2.2.3             yaml_2.2.0                 
#>  [7] pillar_1.4.2                lattice_0.20-38            
#>  [9] glue_1.3.1                  digest_0.6.23              
#> [11] RColorBrewer_1.1-2          colorspace_1.4-1           
#> [13] R.oo_1.23.0                 htmltools_0.4.0            
#> [15] Matrix_1.2-17               plyr_1.8.5                 
#> [17] XML_3.98-1.20               pkgconfig_2.0.3            
#> [19] bibtex_0.4.2                zlibbioc_1.32.0            
#> [21] purrr_0.3.3                 xtable_1.8-4               
#> [23] scales_1.1.0                BiocParallel_1.20.0        
#> [25] tibble_2.1.3                ggplot2_3.3.0              
#> [27] withr_2.1.2                 SummarizedExperiment_1.16.0
#> [29] survival_2.44-1.1           magrittr_1.5               
#> [31] crayon_1.3.4                mclust_5.4.5               
#> [33] evaluate_0.14               R.methodsS3_1.7.1          
#> [35] tools_3.6.1                 data.table_1.12.6          
#> [37] lifecycle_0.1.0             matrixStats_0.55.0         
#> [39] gridBase_0.4-7              stringr_1.4.0              
#> [41] munsell_0.5.0               DelayedArray_0.12.0        
#> [43] compiler_3.6.1              rlang_0.4.5                
#> [45] grid_3.6.1                  RCurl_1.95-4.12            
#> [47] bitops_1.0-6                rmarkdown_2.1              
#> [49] gtable_0.3.0                codetools_0.2-16           
#> [51] reshape2_1.4.3              R6_2.4.1                   
#> [53] GenomicAlignments_1.22.0    knitr_1.25                 
#> [55] dplyr_0.8.5                 stringi_1.4.3              
#> [57] Rcpp_1.0.3                  wordcloud_2.6              
#> [59] tidyselect_0.2.5            xfun_0.10

參考學(xué)習(xí)資料：http://www./packages/release/bioc/vignettes/maftools/inst/doc/maftools.html