最近刷視頻看到了b站jimmy老師又更新了ATAC-seq系列教學(xué)指引,,趕緊花幾天時(shí)間follow了一遍,!
而且把我自己學(xué)習(xí)筆記分享給大家,,視頻的話,,文末的閱讀原文直達(dá)免費(fèi)學(xué)習(xí)哈,!雖然視頻錄制是兩年前,但是絲毫不影響學(xué)習(xí)體驗(yàn),! fq+trim+bowtie與其他類似,,這里省略 背景知識(shí) 真核生物的核DNA并不是裸露的,,而是與組蛋白結(jié)合形成染色體的基本結(jié)構(gòu)單位核小體,,核小體再經(jīng)逐步的壓縮折疊最終形成染色體高級(jí)結(jié)構(gòu)(如人的DNA鏈完整展開約2m長,經(jīng)過這樣的折疊就變成了納米級(jí)至微米級(jí)的染色質(zhì)結(jié)構(gòu)而可以儲(chǔ)存在小小的細(xì)胞核),。而DNA的復(fù)制轉(zhuǎn)錄是需要將DNA的緊密結(jié)構(gòu)打開,,從而允許一些調(diào)控因子結(jié)合(轉(zhuǎn)錄因子或其他調(diào)控因子)。這部分打開的染色質(zhì),,就叫開放染色質(zhì),,打開的染色質(zhì)允許其他調(diào)控因子結(jié)合的特性稱為染色質(zhì)的可及性(chromatin accessibility)。因此,,認(rèn)為染色質(zhì)的可及性與轉(zhuǎn)錄調(diào)控密切相關(guān),。ATAC則是 原理是通過轉(zhuǎn)座酶Tn5容易結(jié)合在開放染色質(zhì)的特性,然后對(duì)Tn5酶捕獲到的DNA序列進(jìn)行測(cè)序,。來源自簡書第1篇:ATAC-seq的背景介紹以及與ChIP-Seq的異同 優(yōu)勢(shì) 實(shí)驗(yàn)設(shè)計(jì) 實(shí)戰(zhàn)流程 注意一些黑名單(微衛(wèi)星序列,,重復(fù)序列),去除掉不要當(dāng)做peaks 1.數(shù)據(jù)下載 通過SRP055881 下載原始數(shù)據(jù),,獲得sraruntable與accession list,,找到樣本對(duì)應(yīng)的信息(例如樣本名,分組等) 通過ascp下載,,因?yàn)闃颖玖刻?,這里只做一部分(4個(gè)) cat >srrlist.txt /sra/sra-instant/reads/ByRun/sra/SRR/SRR292/SRR2927015/SRR2927015.sra /sra/sra-instant/reads/ByRun/sra/SRR/SRR292/SRR2927016/SRR2927016.sra /sra/sra-instant/reads/ByRun/sra/SRR/SRR354/SRR3545580/SRR3545580.sra /sra/sra-instant/reads/ByRun/sra/SRR/SRR292/SRR2927018/SRR2927018.sra 然后使用下面的命令高速下載(需要注意的是ascp有時(shí)候會(huì)失效) nohup /home/xlfang/.aspera/connect/bin/ascp -T -l 200M -P 33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -T --mode recv --host ftp-private.ncbi.nlm.nih.gov --user anonftp --file-list ./srrlist.txt ./ & ##3803 奇怪的是SRR292開頭的數(shù)據(jù)沒有對(duì)應(yīng)的下載鏈接,SRR354有 nohup /home/xlfang/ .aspera/connect/bin/ascp -T -P 33001 -i ~/ .aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -l 200m [email protected] :/sra/sra-instant/reads/ByRun/sra/SRR/SRR354/SRR3545580/SRR3545580.sra ./ & cat >fq.txt fasp.sra.ebi.ac.uk:/vol1/fastq/SRR292/005/SRR2927015/SRR2927015_1.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR292/005/SRR2927015/SRR2927015_2.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR292/006/SRR2927016/SRR2927016_1.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR292/006/SRR2927016/SRR2927016_2.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR292/008/SRR2927018/SRR2927018_1.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR292/008/SRR2927018/SRR2927018_2.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR354/000/SRR3545580/SRR3545580_1.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR354/000/SRR3545580/SRR3545580_2.fastq.gz cat >fq.command cat fq.txt|while read id do ( ascp -QT -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh era-fasp@$id ./ ) done 一樣的問題,,難道是數(shù)據(jù)庫遷移,? ls -lh *fastq.gz|cut -d " " -f 5- 2.5G Apr 10 11:15 SRR2927015_1.fastq.gz 2.4G Apr 10 11:20 SRR2927015_2.fastq.gz 3.2G Apr 10 11:26 SRR2927016_1.fastq.gz 3.5G Apr 10 11:32 SRR2927016_2.fastq.gz 1.1G Apr 10 11:34 SRR2927018_1.fastq.gz 1.2G Apr 10 11:36 SRR2927018_2.fastq.gz 4.1G Apr 10 11:44 SRR3545580_1.fastq.gz 4.6G Apr 10 11:52 SRR3545580_2.fastq.gz 2.質(zhì)檢 trim 下載完成后對(duì)數(shù)據(jù)進(jìn)行質(zhì)檢 trim 構(gòu)建小鼠索引 # ##conda activate rna # ##fastqc nohup fastqc -t 2 -o ./ ./*.fastq.gz &# ##multiqc multiqc ./*zip -o ./# ##trim for i in `ls *_1.fastq.gz` do i=${i/_1.fastq.gz/} echo "trim_galore --phred33 -q 25 --length 35 --stringency 4 -e 0.1 --fastqc --paired -o ../2.clean ${i}_1.fastq.gz ${i}_2.fastq.gz" done > trim_galore.command cat trim_galore.command3.bowtie2比對(duì) 質(zhì)檢完成后就需要用bowtie2去比對(duì),第一步需要構(gòu)建索引,。官網(wǎng)有構(gòu)建好的常見物種的索引(人,,小鼠,斑馬魚等)就可以直接下載用,,如果是非模式生物就需要自己構(gòu)建(不推薦) # ##下載小鼠的mm10基因索引 nohup wget -4 -q ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/mm10.zip & ls -lh|cut -d " " -f 5-# 3.2G Apr 11 03:14 mm10.zip 大小為3.2G unzip mm10.zip ##解壓第一部分 bowtie2進(jìn)行比對(duì) cd 2.clean# # 相對(duì)目錄需要理解 index=/zju/phf5a/data/bwindex/mm10# # 一定要搞清楚自己的bowtie2軟件安裝在哪里,,以及自己的索引文件在什么地方!??! # conda activate rna for i in `ls *_1_val_1.fq.gz` do i=${i/_1_val_1.fq.gz/} echo "bowtie2 -p 2 --very-sensitive -X 2000 -x ${index} -1 ${i}_1_val_1.fq.gz -2 ${i}_2_val_2.fq.gz |samtools sort -O bam -T ${i} -@ 2 -o - > ${i}.raw.bam" done > mapping.command 這里一直報(bào)錯(cuò)。,。,。(ERR): bowtie2-align exited with value 141 搞得我滿頭霧水?,?應(yīng)該不會(huì)有錯(cuò)的呀,。 后來才發(fā)現(xiàn)原來是bowtie2版本問題,升級(jí)一下版本,,就不會(huì)報(bào)錯(cuò)了 (https://www./p/229653/),; 另外samtools排序需要 -T 參數(shù)加前綴,不然也會(huì)報(bào)錯(cuò) 8個(gè)樣本居然比對(duì)了24小時(shí),,看來確實(shí)是very sensitive了。,。(其實(shí)主要是sam文件轉(zhuǎn)為bam文件的時(shí)候非常慢)排序好的bam文件如下 ls -lh *raw.bam|cut -d " " -f 5- 3.7G Apr 11 22:55 SRR2927015.raw.bam 4.6G Apr 12 07:19 SRR2927016.raw.bam 1.7G Apr 12 09:50 SRR2927018.raw.bam 5.4G Apr 12 19:23 SRR3545580.raw.bam 第二部分 sambamba去重 對(duì)ATAC-seq數(shù)據(jù)來說,,大部分教程都推薦去除PCR重復(fù),所以我們加上這個(gè)步驟: conda activate rna for i in `ls *.raw.bam` do i=${i/.raw.bam/} echo " samtools index ${i}.raw.bam | bedtools bamtobed -i ${i}.raw.bam > ${i}.raw.bed|samtools flagstat ${i}.raw.bam > ${i}.raw.stat|sambamba markdup --overflow-list-size 600000 --tmpdir='./' -r ${i}.raw.bam ${i}.rmdup.bam |samtools index ${i}.rmdup.bam" done > rmdup.command ### # # ref:https://www./p/170294/ # # Calculate %mtDNA: mtReads=$(samtools idxstats ${i}.rmdup.bam | grep 'chrM' | cut -f 3) totalReads=$(samtools idxstats ${i}.rmdup.bam | awk '{SUM += $3} END {print SUM}') echo '==> mtDNA Content:' $(bc <<< "scale=2;100*$mtReads/$totalReads")'%' echo "samtools flagstat ${i}.rmdup.bam > ${i}.rmdup.stat|samtools view -h -f 2 -q 30 ${i}.rmdup.bam |grep -v chrM |samtools sort -O bam -@ 2 -o - > ${i}.last.bam|samtools index ${i}.last.bam|samtools flagstat ${i}.last.bam > ${i}.last.stat|bedtools bamtobed -i ${i}.last.bam > ${i}.bed" 8個(gè)樣本大概運(yùn)行了2個(gè)小時(shí),,結(jié)果如下,,可以看到去重后的bam文件比原bam文件小得多 打開一個(gè)flagstat文件看看統(tǒng)計(jì) cat SRR3545580.raw.stat# 105342880 + 0 in total (QC-passed reads + QC-failed reads) # 0 + 0 secondary # 0 + 0 supplementary # 0 + 0 duplicates # 103697403 + 0 mapped (98.44%:-nan%) # 105342880 + 0 paired in sequencing # 52671440 + 0 read1 # 52671440 + 0 read2 # 100212070 + 0 properly paired (95.13%:-nan%) # 103230528 + 0 with itself and mate mapped # 466875 + 0 singletons (0.44%:-nan%) # 477956 + 0 with mate mapped to a different chr # 18927 + 0 with mate mapped to a different chr (mapQ>=5) QC pass的reads的數(shù)量為105342880,未通過QC的reads數(shù)量為0,,意味著一共有105342880條reads,; 重復(fù)reads的數(shù)量,QC pass和failed 比對(duì)到參考基因組上的reads數(shù)量,; paired reads數(shù)據(jù)數(shù)量,; 正確地匹配到參考序列的reads數(shù)量,; 一對(duì)reads都比對(duì)到了參考序列上的數(shù)量,但是并不一定比對(duì)到同一條染色體上,; 一對(duì)reads中只有一條與參考序列相匹配的數(shù)量,; 一對(duì)reads比對(duì)到不同染色體的數(shù)量; 一對(duì)reads比對(duì)到不同染色體的且比對(duì)質(zhì)量值大于5的數(shù)量。 見[https://www./p/149883/] 第三部分 獲取只比對(duì)到常染色體上的高質(zhì)量序列其實(shí)就是去除在chrM染色體的reads,,代碼如下: for i in `ls *.raw.bam` do i=${i/.raw.bam/} mtReads=$(samtools idxstats ${i}.rmdup.bam | grep 'chrM' | cut -f 3) totalReads=$(samtools idxstats ${i}.rmdup.bam | awk '{SUM += $3} END {print SUM}') echo '==> mtDNA Content:' $(bc <<< "scale=2;100*$mtReads/$totalReads")'%' echo " samtools flagstat ${i}.rmdup.bam > ${i}.rmdup.stat|samtools view -h -f 2 -q 30 ${i}.rmdup.bam |grep -v chrM |samtools sort -O bam -T ${i} -@ 2 -o - > ${i}.last.bam|samtools index ${i}.last.bam|samtools flagstat ${i}.last.bam > ${i}.last.stat" done > quality.command 可以看到線粒體DNA含量確實(shí)很高,,差不多占%20。需要去除掉,,不然對(duì)結(jié)果造成線粒體污染,! 前面漏了一步,轉(zhuǎn)成為bed格式,。,。 for i in `ls *.raw.bam` do i=${i/.raw.bam/} echo "bedtools bamtobed -i ${i}.last.bam> ${i}.bed " done > bed.command ls -lh *.bed|cut -d " " -f 5- 247M Apr 14 16:40 SRR2927015.bed 352M Apr 14 16:41 SRR2927016.bed 211M Apr 14 16:41 SRR2927018.bed 264M Apr 14 16:41 SRR3545580.bed 跟Jimmy老師的結(jié)果比起來,平均要大10M,。,。不知道為什么 4. MACS2找peak 安裝 MACS2 這里注意MACS2只支持python3.6以上版本,,注意版本環(huán)境?。?/section>conda create -n actc python=3 ###重新創(chuàng)建爬蟲3的環(huán)境 conda activate actc conda install -c bioconda macs2 macs2 -h##能調(diào)用就說明沒有問題 運(yùn)行 conda activate atac cat >macs2.command ls *.bed | while read id ;do (macs2 callpeak -t $id -g mm --nomodel --shift -100 --extsize 200 -n ${id%%.*} --outdir ../3.macs2/) ;done # ##參數(shù)含義 ls -lh|cut -d " " -f 5 -# 644K Apr 14 17:06 SRR2927015_peaks.narrowPeak # 704K Apr 14 17:06 SRR2927015_peaks.xls # 441K Apr 14 17:06 SRR2927015_summits.bed # 1.2M Apr 14 17:07 SRR2927016_peaks.narrowPeak # 1.3M Apr 14 17:07 SRR2927016_peaks.xls # 835K Apr 14 17:07 SRR2927016_summits.bed # 374K Apr 14 17:07 SRR2927018_peaks.narrowPeak # 409K Apr 14 17:07 SRR2927018_peaks.xls # 256K Apr 14 17:07 SRR2927018_summits.bed # 1.1M Apr 14 17:07 SRR3545580_peaks.narrowPeak # 1.2M Apr 14 17:07 SRR3545580_peaks.xls # 714K Apr 14 17:07 SRR3545580_summits.bed 其中的excel文件就是結(jié)果了,,可以看見前幾行是運(yùn)行命令及信息,, 后面從左到右分別是染色體,peak的起始終止位置,,長度,,峰值位置,高度,,pvalue值記憶富集程度 后面就是對(duì)結(jié)果進(jìn)行分析 5.統(tǒng)計(jì)indel插入長度的分布 在上一部生成的bam文件的第9列,,就是插入indel的長度啦,,需要將4個(gè)bam的第9列取出來 conda activate rna ls *.last.bam >1 ls *.last.stat >2 paste 1 2 >config_bam cat config_bam # SRR2927015.last.bam SRR2927015.last.stat # SRR2927016.last.bam SRR2927016.last.stat # SRR2927018.last.bam SRR2927018.last.stat # SRR3545580.last.bam SRR3545580.last.stat # ##bash腳本 cat >length.sh cat config_bam |while read id; do arr=($id) sample=${arr[0]} sample_name=${arr[1]} samtools view $sample |awk '{print $9}' > ${sample_name}_length.txt done ls -lh *last.stat_length.txt -rw-rw-r-- 1 xlfang xlfang 22M Apr 16 10:48 SRR2927015.last.stat_length.txt -rw-rw-r-- 1 xlfang xlfang 32M Apr 16 10:49 SRR2927016.last.stat_length.txt -rw-rw-r-- 1 xlfang xlfang 19M Apr 16 10:49 SRR2927018.last.stat_length.txt -rw-rw-r-- 1 xlfang xlfang 24M Apr 16 10:49 SRR3545580.last.stat_length.txt indel_length_distribution.R cmd=commandArgs(trailingOnly=TRUE ); input=cmd[1 ]; output=cmd[2 ]; a=abs(as.numeric(read.table(input)[,1 ])); png(file=output); hist(a, main="Insertion Size distribution" , ylab="Read Count" ,xlab="Insert Size" , xaxt="n" , breaks=seq(0 ,max(a),by=10 ) ); axis(side=1 , at=seq(0 ,max(a),by=100 ), labels=seq(0 ,max(a),by=100 ) ); dev.off() ls *last.stat_length.txt >3 ls *last.stat_length.txt|cut -d "." -f 1-3>4 paste 3 4 >config.indel_length_distribution cat config.indel_length_distribution# SRR2927015.last.stat_length.txt SRR2927015.last.stat_length # SRR2927016.last.stat_length.txt SRR2927016.last.stat_length # SRR2927018.last.stat_length.txt SRR2927018.last.stat_length # SRR3545580.last.stat_length.txt SRR3545580.last.stat_length cat >indel_length_distribution.sh cat config.indel_length_distribution |while read id; do arr=($id) input=${arr[0]} output=${arr[1]} Rscript indel_length_distribution.R $input $output done 但是我還是用不慣linux的R,,還是用windows吧。 a=read.delim("SRR2927015.last.stat_length.txt" ) a=abs(as.numeric(a[,1 ])) png(file=output) hist(a, main="SRR2927015_Insertion Size distribution" , ylab="Read Count" ,xlab="Insert Size" , xaxt="n" , breaks=seq(0 ,max(a),by=10 ) ) axis(side=1 , at=seq(0 ,max(a),by=100 ), labels=seq(0 ,max(a),by=100 ) ); dev.off() 可以很直觀的看到,,插入片段在150-300之間是counts數(shù)最大,隨著長度增大,,counts越不斷縮減 6.FRiP值的計(jì)算 FRiP 是 The fraction of reads in called peak regions的縮寫,,指的是mapping到peak里的reads 除以總mapped reads conda activate rna for i in `ls *.narrowPeak` do i=${i/_peaks.narrowPeak/} bed=../2.clean/${i}.bed Reads=$(bedtools intersect -a ${bed} -b ${i}_peaks.narrowPeak |wc -l|awk '{print $1}') totalReads=$(wc -l $bed|awk '{print $1}') echo $Reads $totalReads echo '==> FRiP value:' $(bc <<< "scale=2;100*$Reads/$totalReads")'%' done > frip.command cat frip.command 148214 5105856 ==> FRiP value: 2.90% 320405 7292136 ==> FRiP value: 4.39% 90854 4399724 ==> FRiP value: 2.06% 258981 5466470 ==> FRiP value: 4.73% 7.樣本間peaks值交集 比較粗狂的辦法,,這里我默認(rèn)了4個(gè)樣本都是重復(fù)樣本(其實(shí)不是,。,。) library (ChIPseeker)library (ChIPpeakAnno) setwd("E://desktop/sept/ATAC-seq_practice/find_peaks_overlaping/" ) list.files('./' ,"*.narrowPeak" ) tmp = lapply(list.files('./' ,"*.narrowPeak" ),function (x){ return (readPeakFile(file.path('./' , x))) }) tmp ol <- findOverlapsOfPeaks(tmp[[1 ]],tmp[[2 ]],tmp[[3 ]],tmp[[4 ]]) png('overlapVenn.png' ) makeVennDiagram(ol) dev.off()# #安裝IDR conda activate atac conda isntall -y idr ls -lh *narrowPeak|cut -d " " -f 9|tr -s "\n" " "# SRR2927015_peaks.narrowPeak SRR2927016_peaks.narrowPeak SRR2927018_peaks.narrowPeak SRR3545580_peaks.narrowPeak # ##2個(gè)樣本的overlap peaks IDR只能對(duì)于2個(gè)重復(fù)樣本進(jìn)行比較,,所以分為2組 idr --samples SRR2927015_peaks.narrowPeak SRR2927016_peaks.narrowPeak \ --input-file-type narrowPeak \ --rank p.value \ --output-file group1-idr \ --plot \ --log-output-file group1.idr.log idr --samples SRR2927018_peaks.narrowPeak SRR3545580_peaks.narrowPeak \ --input-file-type narrowPeak \ --rank p.value \ --output-file group2-idr \ --plot \ --log-output-file group2.idr.log ls -lh group*|cut -d " " -f 5-# 740K Apr 19 22:00 group1-idr # 205 Apr 19 22:00 group1.idr.log # 341K Apr 19 22:00 group1-idr.png # 315K Apr 19 22:00 group2-idr # 202 Apr 19 22:00 group2.idr.log # 218K Apr 19 22:00 group2-idr.png cat group1.idr.log# Initial parameter values: [0.10 1.00 0.20 0.50] # Final parameter values: [0.59 1.05 0.62 0.79] # Number of reported peaks - 5869/5869 (100.0%) # Number of peaks passing IDR cutoff of 0.05 - 1571/5869 (26.8%) cat group2.idr.log# Initial parameter values: [0.10 1.00 0.20 0.50] # Final parameter values: [0.35 1.06 0.47 0.50] # Number of reported peaks - 2505/2505 (100.0%) # Number of peaks passing IDR cutoff of 0.05 - 38/2505 (1.5%) idr.log指通過IDR的0.05的共有peaks數(shù),可以看到比之前用narrowPeak結(jié)果直接取交集少了很多peaks 左上:Rep1 peak ranks vs Rep2 peak ranks, 沒有通過特定IDR閾值的peaks顯示為紅色,。 右上:Rep1 log10 peak scores vs Rep2 log10 peak scores,,沒有通過特定IDR閾值的peaks顯示為紅色,。 下面兩個(gè)圖:Peak rank vs IDR scores,,箱線圖展示了IDR值的分布,,默認(rèn)情況下,IDR值的閾值為-1E-6,。 8.deeptools可視化 介紹主要看Jimmy老師的推文如何使用deeptools處理BAM數(shù)據(jù) # ##安裝deeptools conda activate atac ###激活小環(huán)境 conda install -y deeptools ###安裝 cd ../../zju/phf5a/atac/ mkdir deeptools cd deeptools/ ln -s ../2.clean/*.last.bam ./# ##構(gòu)建索引 ls *.bam |xargs -i samtools index {}# ##將最終的bam轉(zhuǎn)換成bigwig文件 ls *last.bam |while read id;do nohup bamCoverage -b ${id} -p 5 --normalizeUsing RPKM -o ${id%%.*}.last.bw & done# ##將去重的bam轉(zhuǎn)換成bigwig文件 ls *rmdup.bam |while read id;do nohup bamCoverage --normalizeUsing CPM -b $id -o ${id%%.*}.rm.bw & donecurl 'http://genome./cgi-bin/hgTables?hgsid=646311755_P0RcOBvAQnWZSzQz2fQfBiPPSBen&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_ncbiRefSeq&hgta_ctDesc=table+browser+query+on+ncbiRefSeq&hgta_ctVis=pack&hgta_ctUrl=&fbQual=whole&fbUpBases=200&fbExonBases=0&fbIntronBases=0&fbDownBases=200&hgta_doGetBed=get+BED' ># ##16m大小下載了18min,。。 cut -f 1-3 mm10.refseq.bed > ref.bed computeMatrix reference-point --referencePoint TSS -p 15 \-b 10000 -a 10000 \ -R ./ref.bed \ -S ./*.bw \ --skipZeros -o matrix1_test_TSS.gz \ --outFileSortedRegions regions1_test_genes.bed 遇到報(bào)錯(cuò),。,。應(yīng)該是python版本問題吧。 算了,,反正就是學(xué)習(xí),,以后再解決吧! 文末友情宣傳 強(qiáng)烈建議你推薦我們生信技能樹給身邊的博士后以及年輕生物學(xué)PI ,,幫助他們多一點(diǎn)數(shù)據(jù)認(rèn)知,,讓科研更上一個(gè)臺(tái)階: