宏基因組二,、三代測序混合組裝軟件OPERA-MS混合組裝宏基因組實(shí)現(xiàn)高精度分析人體微生物組中的抗性基因和移動元件 Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes Nature Biotechnology [IF:31.864] 2019-07-29 Articles DOI: https:///10.1038/s41587-019-0191-2 第一作者:Denis Bertrand1 通訊作者:Niranjan Nagarajan1,7* 其它作者:Jim Shaw, Manesh Kalathiyappan, Amanda Hui Qi Ng, M. Senthil Kumar, Chenhao Li(李陳浩), Mirta Dvornicic, Janja Paliska Soldo, Jia Yu Koh, Chengxuan Tong, Oon Tek Ng, Timothy Barkham, Barnaby Young, Kalisvar Marimuthu, Kern Rei Chng, Mile Sikic 作者單位: 1 計(jì)算與系統(tǒng)生物學(xué),,新加坡基因組所(Computational & Systems Biology, Genome Institute of Singapore, Singapore, Singapore) 7 新加坡國立大學(xué)(National University of Singapore, Singapore, Singapore.) 熱心腸日報(bào)Nature子刊:宏基因組二、三代混合組裝新軟件OPERA-MS 創(chuàng)作:劉永鑫 審核:劉永鑫 08月02日 原標(biāo)題:混合宏基因組組裝實(shí)現(xiàn)人體微生物組中的抗性基因和移動元件的高精度分析 OPERA-MS采用重復(fù)感知聚類和精確的支架方法結(jié)合,,實(shí)現(xiàn)二,、三代序列的混合宏基因組組裝; 基于模擬和真實(shí)宏基因組樣本評估,,獲得目前最高質(zhì)量的宏基因組,,比長讀長更高的堿基準(zhǔn)確度,比短讀長更高的連續(xù)性和比混合組裝更少的錯誤,,可獲得低豐度物種的高質(zhì)量基因組,; 軟件還可實(shí)現(xiàn)同一物種內(nèi)菌株水平組裝,獲得稀有物種的高質(zhì)量參考基因組,; 結(jié)合納米孔讀長,,實(shí)現(xiàn)80個完整質(zhì)粒或噬菌體序列組裝,,為研究腸道抗生素抗性組精細(xì)研究提供可能,。
二代測序通量高,、準(zhǔn)確度高,但讀長短,;三代測序讀長長,,但錯誤率高、成本高,。將這兩者的優(yōu)勢結(jié)合,,目前在宏基因組領(lǐng)域還沒有得到廣泛應(yīng)用,存在很多技術(shù)難題沒有解決,。近日,,來自新加坡基因組所的Niranjan Nagarajan課題組發(fā)布了一款二、三代測序混合組裝軟件OPERA-MS,,組裝結(jié)果不僅堿基準(zhǔn)確率高,,而且短讀長數(shù)據(jù)拼接長度提升了一個數(shù)量級。 OPERA-MS整合了宏基因組聚類和精確支架算法,,基于虛擬腸道微生物組和人工群落數(shù)據(jù)測序,,研究者僅用9×長讀長覆蓋深度組裝出了接近目前最完整的宏基因組,也組裝出低豐度(<1%)物種的高質(zhì)量基因組,。值得一提的是,,OPERA-MS還可在亞種水平上獲得基因組結(jié)果。將Nanopore測序應(yīng)用于抗生素治療病人的腸道宏基因組研究,,發(fā)現(xiàn)長讀長組裝質(zhì)量較短讀長提升了200倍,。這一重鎊成果于7月29日發(fā)表于世界頂級期刊《Nature Biotechnology》。 摘要通過高通量宏基因組測序已經(jīng)實(shí)現(xiàn)了微生物組的組成分析,。然而,,現(xiàn)有方法不是設(shè)計(jì)用于組裝來自短讀長和長讀長混合序列。我們提出了一個名為OPERA-MS的混合宏基因組組裝軟件,,它將組裝宏基因組采用重復(fù)感知聚類和精確的支架方法結(jié)合,,實(shí)現(xiàn)精確地組裝復(fù)雜群落。使用預(yù)定義的體外和虛擬腸道微生物組進(jìn)行評估,,OPERA-MS組裝的宏基因組具有比長讀長(> 5×; Canu)更高的堿基對準(zhǔn)確度,,比短讀長更高的連續(xù)性(~10× NGA50; MEGAHIT,IDBA-UD) ,,metaSPAdes)和比非宏基因組混合組裝軟件(2×; hybridSPAdes)更少的組裝錯誤,。OPERA-MS在同一物種的多個基因組存在下提供菌株分辨率的組裝結(jié)果,可在~9倍長讀取覆蓋率下獲得稀有物種的高質(zhì)量參考基因組(<1%),。我們使用OPERA-MS組裝28個抗生素治療患者的腸道宏基因組,,并顯示包含長納米孔讀長產(chǎn)生更多連續(xù)組裝(比短讀長組裝提高200倍),包括超過80個成環(huán)質(zhì)?;蚴删w序列和一個新的263 kbp巨型噬菌體,。高質(zhì)量的混合組軟件可以對人類患者的腸道抗生素抗性組進(jìn)行精細(xì)的觀察。 Characterization of microbiomes has been enabled by high-throughput metagenomic sequencing. However, existing methods are not designed to combine reads from short- and long-read technologies. We present a hybrid metagenomic assembler named OPERA-MS that integrates assembly-based metagenome clustering with repeat-aware, exact scaffolding to accurately assemble complex communities. Evaluation using defined in vitro and virtual gut microbiomes revealed that OPERA-MS assembles metagenomes with greater base pair accuracy than long-read (>5×; Canu), higher contiguity than short-read (~10× NGA50; MEGAHIT, IDBA-UD, metaSPAdes) and fewer assembly errors than non-metagenomic hybrid assemblers (2×; hybridSPAdes). OPERA-MS provides strain-resolved assembly in the presence of multiple genomes of the same species, high-quality reference genomes for rare species (<1%) with ~9× long-read coverage and near-complete genomes with higher coverage. We used OPERA-MS to assemble 28 gut metagenomes of antibiotic-treated patients, and showed that the inclusion of long nanopore reads produces more contiguous assemblies (200× improvement over short-read assemblies), including more than 80 closed plasmid or phage sequences and a new 263?kbp jumbo phage. High-quality hybrid assemblies enable an exquisitely detailed view of the gut resistome in human patients.
主要結(jié)果圖1. OPERA-MS工作流程圖Fig. 1: OPERA-MS workflow. 首先將宏基因組的短讀長拼接為重疊群,,并將短讀取和長讀長比對至重疊群以獲得覆蓋信息和跨越序列(步驟1),。然后綁定跨越讀長獲得組裝圖中重疊群之間的邊,該組裝圖表示整個宏基因組的連續(xù)性信息(步驟2),。將重疊群組織成層次聚類,,其中重疊群之間的距離隨基因組距離及其覆蓋差異而增加(步驟3)。然后基于BIC(貝葉斯信息準(zhǔn)則)將樹切割成最佳簇(步驟4),??蛇x步驟,為了改善可獲得參考基因組物種的聚類,,計(jì)算每個聚類與完整細(xì)菌基因組數(shù)據(jù)庫之間的Mash基因組距離(步驟5),。然后,如果在裝配圖中存在支持信息以形成物種特定的超級簇,,則合并簇(步驟6),。進(jìn)一步分析這些超級簇以解卷積來自可區(qū)分的亞種基因組的重疊群(步驟7)。最后,,使用針對分離基因組的程序(OPERA-LG;步驟8),,獨(dú)立地構(gòu)建每個簇并填充間隙。 Short reads are first assembled by a metagenomic assembler into contigs, and short and long reads are mapped to them to obtain coverage information and spanning reads (Step 1). Spanning reads are then bundled to get edges between contigs for an assembly graph that represents the contiguity information of the whole metagenome (Step 2). Contigs are organized into a hierarchical clustering where the distance between contigs increases with genomic distance and their difference in coverage (Step 3). The tree is then cut into optimal clusters based on the BIC (Step 4). Optionally, to improve the clustering for species where a reference genome is available, the Mash genomic distance between each cluster and a database of complete bacterial genomes is computed (Step 5). Clusters are then merged if there is supporting information in the assembly graph to form species-specific super-clusters (Step 6). These super-clusters are further analyzed to deconvolute contigs that come from distinguishable subspecies genomes (Step 7). Finally, each cluster is independently scaffolded and gap-filled using a program meant for isolate genomes (OPERA-LG; Step 8).
圖2. 宏基因組數(shù)據(jù)混合組裝基因組評測Fig. 2: Benchmarking hybrid assembly of genomes from metagenomes. a-c,,作為短讀長代表性組裝軟件metaSPAdes(a),,長讀長組裝軟件Canu(b)和混合組裝軟件OPERA-MS(c)的測序覆蓋率增加與組裝連續(xù)性的增加。請注意,,混合裝配在跨越覆蓋方面有效改進(jìn)了短讀長和長讀長的裝配結(jié)果,,可在低至9×長讀長覆蓋度下產(chǎn)生接近完整的基因組(NGA50 > 1 Mbp)。未組裝的基因組顯示為帶有黑色邊框的圓圈,。d,,OPERA-MS與其他組裝軟件相比較提高的裝配連續(xù)性(NGA50)。對于MEGAHIT和IDBA-UD,,組裝基因組中覆蓋度上升的數(shù)量為3,12,20和19,,對于metaSPAdes和hybridSPAdes為3,13,21和19,對于Canu為4和16,。請注意,,Canu不會組裝低覆蓋率的基因組,因此在這些范圍內(nèi)不提供指標(biāo),。數(shù)據(jù)以箱形圖表示(中心線,,中位數(shù);箱限,,上下四分位數(shù); 須線,,1.5×四分位數(shù)間距; 點(diǎn),,異常值)。e,,不同組裝軟件的組裝錯誤率,,實(shí)線表示中值。除了hybridSPAdes之外,,大多數(shù)組裝軟件每 Mbp(虛線)產(chǎn)生大約1個錯誤的組裝,。在每個部分中,每個數(shù)據(jù)點(diǎn)代表來自模擬群落的一個基因組,。 a–c, Increase in assembly contiguity as a function of read coverage for a representative short-read assembler (a), long-read assembler (b) and hybrid assembler (c). Note that hybrid assembly improves over short- and long-read assembly in terms of scaling across coverage ranges and producing near-complete genomes (NGA50 >1?Mbp) with as little as 9× long-read coverage. Unassembled genomes are shown as circles with black borders. d, Improvements in assembly contiguity (NGA50) provided by OPERA-MS in comparison with other assemblers as a function of long-read coverage. The number of assembled genomes, in ascending order of coverage is 3, 12, 20 and 19 for MEGAHIT and IDBA-UD, 3, 13, 21 and 19 for metaSPAdes and hybridSPAdes and 4 and 16 for Canu. Note that Canu does not assemble low-coverage genomes and hence metrics are not provided in those ranges. Data are presented as box plots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers). e, Misassembly rates for different assemblers, with solid lines indicating median values. Most assemblers produce ~1 large misassembly per Mbp (dashed line), except for hybridSPAdes. In each part, each data point represents one genome from the mock communities.
圖3. 組裝虛擬腸道微生物組Fig. 3: Assembly of a virtual gut microbiome. a,,構(gòu)建虛擬腸道微生物組,代表復(fù)雜的宏基因組數(shù)據(jù)集,,同時(shí)保留評估組裝與金標(biāo)準(zhǔn)參考的能力,。 b,與不同覆蓋范圍內(nèi)的其他組裝軟件相比,,使用OPERA-MS獲得組裝連續(xù)性(NGA50)的改進(jìn)情況,。點(diǎn)代表在宏基因組中具有至少兩個菌株的物種(在GIS20和S2中存在的物種,如MetaPhlAn2報(bào)道的豐度 > 0.1%(參考文獻(xiàn)49)(v.2.6.0)),。按照覆蓋度的上升,,組裝的基因組的數(shù)量對于Canu是1,對于其他方法是2,6,4和5個,。數(shù)據(jù)以箱形圖表示(中心線,,中位數(shù);箱限,,上下四分位數(shù); 須線,,1.5×四分位數(shù)間距; 點(diǎn),異常值),。 c,,不同組裝軟件的組裝錯誤率(每個基因組一個點(diǎn))的比較,實(shí)線表示中值,。 d,,在分箱后評估僅Illumina數(shù)據(jù)(M,MEGAHIT)和混合(H,,hybridSPAdes; O,,OPERA-MS)組裝宏基因組組裝以用于下游分析。包含最大部分參考基因組的區(qū)域(GIS20參考文獻(xiàn),;具有粗體名稱的物種在宏基因組中具有至少兩個菌株)評估以下參數(shù):(1)基因組完整性,,在分箱中基因組的比例,(2)基因組純度,分箱中堿基對應(yīng)正確參考的百分比,,(3)基因完整性,,在分箱中完全組裝的基因比例和(4)通路完整性,其組成基因超過90%的通路出現(xiàn)在組裝的分箱中,。 a, Construction of a virtual gut microbiome that represents a complex metagenomic data set while retaining the ability to evaluate assemblies against gold-standard references. b, Improvement in assembly contiguity (NGA50) obtained using OPERA-MS compared with other assemblers over different coverage ranges. Dots represent species that have at least two strains in the metagenome (species present in GIS20 and S2 with an abundance >0.1% as reported by MetaPhlAn2 (ref. 49) (v.2.6.0)). The number of assembled genomes, in ascending order of coverage, was 1 for Canu and 2, 6, 4 and 5 for the other methods. Data are presented as box plots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers). c, Comparison of misassembly rates (one dot per genome) for different assemblers, with solid lines indicating median values. d, Evaluation of Illumina-only (M, MEGAHIT) and hybrid (H, hybridSPAdes; O, OPERA-MS) metagenomic assemblies after binning for their utility in downstream analysis. Bins that contained the largest fraction of a reference genome (GIS20 references; species with bold names have at least two strains in the metagenome) were evaluated for (1) genome completeness, the fraction of the genome represented in the bin, (2) genome purity, percentage of bases in the bin that correspond to the correct reference, (3) gene completeness, fraction of genes that were fully assembled in the bin and (4) pathway completeness, fraction of pathways with over 90% of their constituent genes being assembled and binned together.
圖4. 移動元件和與人腸道微生物組中宿主物種的關(guān)聯(lián)Fig. 4: Mobile elements and association with host species in the human gut microbiome. a,,來自O(shè)PERA-MS的28個人腸道宏基因組數(shù)據(jù)集中完全組裝成環(huán)序列的基因組大小分布,說明了組裝不同大小和復(fù)雜性的環(huán)狀基因組的能力(質(zhì)粒,,噬菌體和細(xì)菌基因組),。 b,,與NCBI核苷酸(nt)數(shù)據(jù)庫中的序列(基于BLAST搜索)比對,,覆蓋序列的比例與組裝的環(huán)狀序列的平均序列相似度。許多組裝序列從端到端(右上角)顯示出與已知序列的良好比對和相似度,,但是一些僅具有局部相似性(左上角),,并且一些似乎是新的(左下角; 18個序列) 。 c,,觀察到最大的新(在nt數(shù)據(jù)庫中沒有匹配)環(huán)狀序列(263kbp)的注釋,,發(fā)現(xiàn)與噬菌體生命周期相關(guān)的蛋白,包括復(fù)制,、組裝和宿主裂解相關(guān),,表明組裝的序列是假定的巨型噬菌體。 d,,OPERA-MS從耐受碳青霉烯的腸桿菌科細(xì)菌定植患者的腸道微生物組中組裝出新的多重抗性區(qū)域,。除臨床相關(guān)的碳青霉烯酶基因區(qū)域外,該區(qū)域還含有賦予氨基糖苷類,、甲氧芐氨嘧啶和磺胺類抗性的基因,,限制了治療選擇。 e,,OPERA-MS菌株水平組裝可以進(jìn)行質(zhì)粒與基因組基于跨越時(shí)間點(diǎn)的測序覆蓋信息進(jìn)行關(guān)聯(lián)(n = 12),。左圖:來自第76天的數(shù)據(jù)的雜合宏基因組裝配中觀察到的兩種大腸桿菌菌株基因組的覆蓋度的變化(黑色箭頭)。右圖:質(zhì)粒覆蓋度與兩種大腸桿菌菌株之間的相關(guān)性表明它是可能含有IMP基因的質(zhì)粒的菌株L使用R中的學(xué)生t-檢驗(yàn)(雙側(cè))計(jì)算P值,。 a, Distribution of genomes sizes for fully assembled circular sequences from OPERA-MS in 28 human gut metagenome data sets, illustrating the ability to assemble circular genomes of varying sizes and complexity (plasmids, phages and bacterial genomes). b, Fraction of sequence covered versus average sequence identity of the assembled circular sequences in comparison to sequences in the NCBI nucleotide (nt) database (based on BLAST searches). Many of the assembled sequences showed good alignment and homology to known sequences from end to end (top right corner), but some only had local similarities (top left corner), and a few appear to be new (bottom left corner; 18 sequences). c, Annotation of the largest (263 kbp) observed new circular sequence (no matches in nt database) revealed proteins associated with a phage life cycle, including replication, assembly and host lysis, indicating that the assembled sequence is a putative jumbo phage. d, A new multiple resistance region assembled by OPERA-MS from the gut microbiome of a patient colonized by carbapenem-resistant Enterobacteriaceae. Apart from the clinically relevant carbapenemase gene cassette, the region also harbors genes that confer resistance to aminoglycosides, trimethoprim and sulfonamides, limiting treatment options. e, Strain level assembly with OPERA-MS enabled association of plasmid to genome based on correlation in read coverage across timepoints (n?=?12). Left panel: Variation in coverage of two Escherichia coli strain genomes seen in the hybrid metagenomic assembly of data from day 76 (black arrow). Right panel: Correlation between the coverage of the plasmid and the two E. coli strains reveals that it is strain L that likely harbors the IMP gene containing plasmid. The P value was computed using Student’s t-test in R (two-sided).
總結(jié)本文介紹了一種基于混合數(shù)據(jù)的宏基因組組裝軟件OPERA-MS,,比較分析了其與其他幾種短讀長、長讀長數(shù)據(jù)組裝軟件對宏基因組研究的效能,。它能夠顯著的提升組裝的連續(xù)性,,并且還能夠解決亞種級基因組的組裝,解決了長讀長數(shù)據(jù)的原始錯誤率,、覆蓋度問題和短讀長數(shù)據(jù)的讀長缺陷,,即使對于低深度覆蓋的數(shù)據(jù)也能有出色的表現(xiàn)。為了驗(yàn)證軟件的應(yīng)用能力,研究者還模擬了人體腸道微生物組的數(shù)據(jù),,發(fā)現(xiàn)其對于臨床宏基因組,、抗生素耐藥性基因的研究上面也能提供較好的幫助。 ReferenceDenis Bertrand, Jim Shaw, Manesh Kalathiyappan, Amanda Hui Qi Ng, M. Senthil Kumar, Chenhao Li, Mirta Dvornicic, Janja Paliska Soldo, Jia Yu Koh, Chengxuan Tong, Oon Tek Ng, Timothy Barkham, Barnaby Young, Kalisvar Marimuthu, Kern Rei Chng, Mile Sikic, and Niranjan Nagarajan. (2019). Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nature Biotechnology.10.1038/s41587-019-0191-2 有了OPERA-MS,,人體腸道微生物不用愁,!
|