Nature用到的GWAS數(shù)據(jù)通路富集方法--MAGMA軟件|別樣的公共數(shù)據(jù)庫挖掘（內(nèi)有練習(xí)資料）

yjt2004us 2018-03-01

展開全文

本推文相關(guān)的數(shù)據(jù)和代碼儲存于百度網(wǎng)盤：鏈接：https://eyun.baidu.com/s/3c1GJdNa 密碼：hWAG（或發(fā)送后臺“練習(xí)資料”,，即可得鏈接和密碼）

目前,，有很多可以用于GWAS數(shù)據(jù)分析的軟件和方法,。比如說GSA-SNP,，F(xiàn)ORGE,，MRPEA,，INRICH，DGAT,，ALIGATOR, MAGENTA, Set screen test method等等,。當(dāng)然，它們各有自己的優(yōu)勢和缺陷,。大家可以根據(jù)自己的需要自行選擇合適的,。

這里我們和大家分享一個最近Nature, Nature Genetics, Nature Neuroscience等大文章中常用的分析軟件MAGMA。

這個軟件的英文介紹是MAGMA is a tool for gene analysis and generalized gene-set analysis of GWAS data. It can be used to analyses both raw genotype data as well as summary SNP p-values from a previous GWAS or meta-analysis.

即此軟件既可以分析基因水平又可以分析生物通路水平,，既可以分析GWAS的原始數(shù)據(jù)又可以分析GWAS summary數(shù)據(jù)。是一個功能十分強大,，而操作又很方便的軟件,。我們可以從官網(wǎng)上直接免費下載：https://ctg./software/magma。

此軟件可以基于Linux系統(tǒng),，也可以基于Windows系統(tǒng),。

這個MAGMA軟件相關(guān)的文章發(fā)表在PLoS Computational Biology雜志上:

de Leeuw C, Mooij J, Heskes T, Posthuma D: MAGMA: Generalized gene-set analysis of GWAS data. PLoS Comput Biol 11(4): e1004219. doi:10.1371/journal.pcbi.1004219.

首先，我們需要GWAS數(shù)據(jù),，如果您有自己感興趣的GWAS原始數(shù)據(jù)那是最好,，沒有的話我們可以從公共數(shù)據(jù)庫內(nèi)下載已有GWAS summary數(shù)據(jù)進行分析，發(fā)現(xiàn)新的結(jié)論?，F(xiàn)在我們從https://www.med./pgc/downloads數(shù)據(jù)庫下載吸煙的GWAS數(shù)據(jù)：tag.evrsmk.tbl.

對文件重新命名： ever_smoking.results,，其內(nèi)部格式如下：

由于這個TAG GWAS研究于2010年發(fā)表在NG上的,，所以參考基因組是hg18，比較舊,。這里我們利用liftover軟件將其升到hg19,，再用于后面的分析。代碼如下：

1）利用picard工具去改變vcf文件格式：如從hg18版本變到hg19版本：

java -jar picard.jar LiftoverVcf \

I=input.vcf \

O=lifted_over.vcf \

CHAIN=hg18tohg19.chain \

REJECT=rejected_variants.vcf \

R=reference_sequence.fasta

2）利用liftOver軟件進行hg18 to hg19轉(zhuǎn)換：

代碼pattern: liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed

如下：

./liftOver -bedPlus=4 ever_smoking.results hg18ToHg19.over.chain ever_smoking.results.hg19.bed ever_smoking.results_unmapped.txt

接下來我們利用MAGMA軟件先將SNP注釋到gene上,。

###Annotation performed with the following command:

代碼pattern: magma --annotate --snp-loc[SNPLOC_FILE] --annotate window=5,1.5 --gene-loc [GENELOC_FILE] --out[ANNOT_PREFIX]

這里SNP的location文件格式是：

#The SNP location file should contain three columns:

前三列是: SNP ID, chromosome, and base pair position (并且沒有header)

做出SNP location文件：

gawk '{print $4, $1, $2}' ever_smoking.results.hg19.bed > ever_smoking.results.hg19.location &

sed -i 's/chr//g' ever_smoking.results.hg19.location 去除第一列染色體上的chr

做出SNP對應(yīng)P值文件：

gawk '{print $4, $5}' ever_smoking.results.hg19.bed > ever_smoking.results_Pval &

1# ever Smoking_TAG 數(shù)據(jù)進行SNP annotation:

nohup ./magma --snp-loc ./GWAS_Summary_SCZ_Smoking/ever_smoking.results.hg19.location --annotate window=35,10 --gene-loc NCBI37.3.gene.loc --out ever_smoking_SNP_Gene_annotation &

2# ever Smoking_TAG 數(shù)據(jù)進行Gene-based analysis:

nohup ./magma --bfile g1000_eur --pval ./GWAS_Summary_SCZ_Smoking/ever_smoking.results_Pval N=69409 --gene-annot ever_smoking_SNP_Gene_annotation.genes.annot --out ever_smoking_SNP_Gene_Analysis_P &

3# ever Smoking_TAG 數(shù)據(jù)進行Gene-set analysis (or pathway-based analysis)

nohup ./magma --gene-results ever_smoking_SNP_Gene_Analysis_P.genes.raw --model fwer=10000 --set-annot ./Pathways/GO_PANTHER_INGENUITY_KEGG_REACTOME_BIOCARTA_new --out ever_smoking_pathway_P &

總結(jié) ：

通過以上的代碼和數(shù)據(jù),，我們就可以分析GWAS的gene-based or gene-set水平的數(shù)據(jù)，發(fā)現(xiàn)一些新的結(jié)果,。像這樣基于GWAS summary數(shù)據(jù)的公共數(shù)據(jù)挖掘有很多文章,。主要是找到自己想要解釋的科學(xué)問題，然后找到數(shù)據(jù)進行分析,。