在前面的直播基因組系列,,我們講解過那些比對不少我們?nèi)祟惖膮⒖蓟蚪M序列的數(shù)據(jù),,其實(shí)可以細(xì)致的進(jìn)行探究。
直播】我的基因組(十五):提取未比對的測序數(shù)據(jù)
這里主要參考這篇文章的圖4:http://www./ng/journal/v42/n11/figtab/ng.691 F4.html
組裝的contig注釋到物種 這是2010年發(fā)表于nature genetics雜志的Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing 雖然文章選擇的是SOAPdenovo,ABySS,Velvet這3款軟件 來進(jìn)行組裝,,但畢竟是2010年的文章了,現(xiàn)在其實(shí)有更好的選擇,,比如Minia
選擇Minia工具來組裝 Minia軟件也是基于de Bruijn圖原理的短序列組裝工具,,優(yōu)于以前的ABySS和SOAPdenovo,關(guān)鍵是速度非???,十幾分鐘就OK了,不消耗計(jì)算機(jī)資源,,所以這里就選擇它啦,。
下載安裝Minia 安裝官網(wǎng)的指導(dǎo)說明書下載二進(jìn)制版本即可,代碼如下:
## Download and install Minia
# http://minia./
cd ~/ biosoft
mkdir Minia && cd Minia
wget https : //github.com/GATB/minia/releases/download/v2.0.7/minia-v2.0.7-bin-Linux.tar.gz
tar - zxvf minia - v2 . 0.7 - bin - Linux . tar . gz
~ /biosoft/ Minia / minia - v2 . 0.7 - bin - Linux / bin / minia -- help
## eg: ./minia -in reads.fa -kmer-size 31 -abundance-min 3 -out output_prefix
軟件使用方法也非常簡單,,就一行命令,,其中最佳 - kmer - size
需要用KmerGenie來確定。
使用 step1:提取比對失敗的reads samtools view - f4 jmzeng_recal . bam | perl - alne '{print "\@$F[0]\n$F[9]\n+\n$F[10]" }' > unmapped . fq
perl ~ /biosoft/ PRINSEQ / prinseq - lite - 0.20 . 4 / prinseq - lite . pl - verbose - fastq unmapped . fq - graph_data unmapped . gd - out_good null - out_bad null
perl ~ /biosoft/ PRINSEQ / prinseq - lite - 0.20 . 4 / prinseq - graphs . pl - i unmapped . gd - png_all - o unmapped
perl ~ /biosoft/ PRINSEQ / prinseq - lite - 0.20 . 4 / prinseq - graphs . pl - i unmapped . gd - html_all - o unmapped
cd ~ /data/ project / myGenome / gatk / jmzeng / unmapped
共31481084/4=7870271,,僅僅是7.8M的reads
Input Information
Input file(s): unmapped.fq Input format(s): FASTQ # Sequences: 7,870,271 Total bases: 1,180,540,650
step2: 用KmerGenie確定kmer值 KmerGenie estimates the best k-mer length for genome de novo assembly.
KmerGenie predictions can be applied to single-k genome assemblers (e.g. Velvet, SOAPdenovo 2, ABySS, Minia).
## http://kmergenie.bx./
cd ~/ biosoft
mkdir KmerGenie && cd KmerGenie
wget http : //kmergenie.bx./kmergenie-1.7044.tar.gz
tar zxvf kmergenie - 1.7044 . tar . gz
cd kmergenie - 1.7044
make
python setup . py install -- user
~ /.local/ bin / kmergenie -- help
cd ~ /data/ project / myGenome / gatk / jmzeng / unmapped
~ /.local/ bin / kmergenie unmapped . fq
step3: 運(yùn)行Minia cd ~ /data/ project / myGenome / gatk / jmzeng / unmapped
~ /biosoft/ Minia / minia - v2 . 0.7 - bin - Linux / bin / minia - in unmapped . fq - kmer - size 31 - abundance - min 3 - out output_prefix
7.8M的reads組裝之后有272007條contigs
組裝之后: Prinseq v0.20.4 was used to calculate assembly statistics, including N50 contig size, GC content
cd ~ /data/ project / myGenome / gatk / jmzeng / unmapped
perl ~ /biosoft/ PRINSEQ / prinseq - lite - 0.20 . 4 / prinseq - lite . pl - verbose - fasta output_prefix . contigs . fa - graph_data contigs . gd - out_good null - out_bad null
perl ~ /biosoft/ PRINSEQ / prinseq - lite - 0.20 . 4 / prinseq - graphs . pl - i contigs . gd - png_all - o contigs
perl ~ /biosoft/ PRINSEQ / prinseq - lite - 0.20 . 4 / prinseq - graphs . pl - i contigs . gd - html_all - o contigs
perl ~ /biosoft/ PRINSEQ / prinseq - lite - 0.20 . 4 / prinseq - lite . pl - verbose - fasta output_prefix . contigs . fa - stats_assembly
就是給出一些指標(biāo),,如下;
stats_assembly N50 176
stats_assembly N75 113
stats_assembly N90 78
stats_assembly N95 70
Input Information Input file(s): output_prefix.contigs.fa Input format(s): FASTA # Sequences: 272,007 Total bases: 44,868,011
Length Distribution Mean sequence length: 164.95 ± 204.44 bp Minimum length: 63 bp Maximum length: 10,187 bp Length range: 10,125 bp Mode length: 150 bp with 16,461 sequences
然后用RNA-SEQ數(shù)據(jù)來比對驗(yàn)證,! 以后再講
把組裝好的contigs拿去NCBI做blast看看物種分布,Distribution of top nucleotide BLAST hits by species from the NCBI nr database for 1000 random contigs in the assembly,!其實(shí)上面的prinseq軟件也簡單的給出了一個(gè)污染物種分布情況表,但是這個(gè)原理不一樣,。 以后再講