一,、單細(xì)胞single cell RNA-seq簡介1,、Bulk RNA-seq(大量RNA-seq) Measures the average expression level for each gene across a large population of input cells Useful for quantifying expression signatures from ensembles, e.g. in disease studies Insufficient for studying heterogeneous systems, e.g. early development studies, complex tissues (brain) Does not provide insights into the stochastic nature of gene expression
2、scRNA-seq Measures the distribution of expression levels for each gene across a population of cells Allows to study new biological questions in which cell-specific changes in transcriptome are important, e.g. cell type identification, heterogeneity of cell responses, stochasticity of gene expression, inference of gene regulatory networks across the cells. Currently there are several different protocols in use, e.g. SMART-seq2, CELL-seq and Drop-seq. There are also commercial platforms available, including the Fluidigm C1, Wafergen ICELL8 and the 10X Genomics Chromium Several computational analysis methods from bulk RNA-seq can be used. But in most cases computational analysis requires adaptation of the existing methods or development of new ones
3,、單細(xì)胞測序的工作流程 4,、計算分析 computational analysis SCONE(single-cell overview of normalized expression),一個處理單細(xì)胞測序數(shù)據(jù)的包:質(zhì)量控制和標(biāo)準(zhǔn)化 Seurat:R package,用于單細(xì)胞數(shù)據(jù)的質(zhì)量控制和分析。 ASAP(Automated single cell Analysis Pipeline):一個基于web的單細(xì)胞分析平臺,。https://asap./
5,、單細(xì)胞測序分析的主要挑戰(zhàn) bulk RNA-seq 和single cell RNA-seq的最主要區(qū)別:單細(xì)胞測序代表單個細(xì)胞(single cell),而bulk測序代表一群細(xì)胞(a population of cells),。因此主要的關(guān)注點(diǎn)應(yīng)該放在不同細(xì)胞類型結(jié)果的比較,。兩種測序手段的主要差異性體現(xiàn)在兩點(diǎn): amplification(擴(kuò)增 up to 1 million fold) gene dropouts(在一個細(xì)胞里基因被檢測到一個中等程度的表達(dá)水平,但是在另一個細(xì)胞里卻檢測不到這個基因的表達(dá)) 導(dǎo)致兩種差異性結(jié)果的原因: 僅從一個細(xì)胞中獲取的RNA轉(zhuǎn)錄本起始量較低,。提高轉(zhuǎn)錄本的捕獲效率和減少擴(kuò)增偏差是目前最活躍的研究領(lǐng)域,。然而我們可以通過適當(dāng)?shù)臉?biāo)準(zhǔn)化處理減輕這些問題。
二,、單細(xì)胞表達(dá)矩陣scRNA-seq數(shù)據(jù)的分析以其表達(dá)矩陣為起點(diǎn),,表達(dá)矩陣中的每一行代表一個基因,每列代表一個細(xì)胞,。每個條目代表給定細(xì)胞中特定基因的表達(dá)水平,。 1、表達(dá)矩陣中表達(dá)值得數(shù)據(jù)類型 counts: Raw count data, e.g., number of reads or transcripts for a particular gene. normcounts: Normalized values on the same scale as the original counts. For example, counts divided by cell-specific size factors that are centred at unity. logcounts: Log-transformed counts or count-like values. In most cases, this will be defined as log-transformed normcounts, e.g., using log base 2 and a pseudo-count of 1. cpm: Counts-per-million. This is the read count for each gene in each cell, divided by the library size of each cell in millions. tpm: Transcripts-per-million. This is the number of transcripts for each gene in each cell, divided by the total number of transcripts in that cell (in millions).
2,、scater package scater是一個用于單細(xì)胞測序數(shù)據(jù)分析的R包,,包括質(zhì)量控制(quality control),,可視化和預(yù)處理等,以便進(jìn)一步的分析,。 scater features the following functionality: Automated computation of QC metrics Transcript quantification from read data with pseudo-alignment Data format standardisation Rich visualizations for exploratory analysis Seamless integration into the Bioconductor universe Simple normalisation methods 3,、獨(dú)特的分子標(biāo)識符(UMI) unique Molecular identifiers 是在逆轉(zhuǎn)錄過程中添加到轉(zhuǎn)錄本中的短隨機(jī)條碼(4-10bp)。它們可以將測序讀序列分配給單個轉(zhuǎn)錄分子,,從而消除scRNASeq數(shù)據(jù)中的放大噪聲和偏差,。當(dāng)對包含UMI的數(shù)據(jù)進(jìn)行測序時,技術(shù)通常只對包含UMI的轉(zhuǎn)錄本的末端(通常是3’端)進(jìn)行特定的測序,。
三,、清潔表達(dá)式矩陣應(yīng)該檢查以去除在讀取QC或映射QC步驟中未檢測到的劣質(zhì)細(xì)胞,在此階段未能移除低質(zhì)量細(xì)胞可能會增加技術(shù)噪音,,這可能會模糊下游分析中感興趣的生物信號,。且由于目前沒有用于執(zhí)行scRNAseq的標(biāo)準(zhǔn)方法,因此呈現(xiàn)的各種QC測量的預(yù)期值可以在實(shí)驗之間顯著變化,。 因此,,為了執(zhí)行QC,我們將尋找相對于數(shù)據(jù)集的其余部分而言是異常值的cells,,而不是與獨(dú)立的質(zhì)量標(biāo)準(zhǔn)進(jìn)行比較,。因此,在比較實(shí)用不同協(xié)議收集的數(shù)據(jù)集之間的質(zhì)量指標(biāo)時應(yīng)該小心,。 去除在任何細(xì)胞中都不表達(dá)的基因 具有少量reads/molecules,,很可能是已經(jīng)被破壞或捕獲細(xì)胞失敗,應(yīng)該移除這類細(xì)胞 手動過濾細(xì)胞:大多數(shù)細(xì)胞檢測到的基因在7000-10000之間,,這對于高深度scRNAseq來說是正常的,。但是這也取決于實(shí)驗協(xié)議和測序深度。比如說基于droplet或其他測序深度較低的方法,,通常其每個細(xì)胞檢測到的基因較少。如果細(xì)胞檢出率相等,,則分布應(yīng)近似正常,,因此我們移除分布尾部的那些細(xì)胞(檢測到的基因少于7000的細(xì)胞) 自動過濾細(xì)胞:自動異常檢測來識別可能存在問題的細(xì)胞:scater包創(chuàng)建一個矩陣(行代表cell,列代表不同的QC度量),,然后PAC提供了按照QC度量排序的cells的2D表示,,然后使用來自mvoutlier包的方法檢測異常值。 過濾基因:移除表達(dá)水平被認(rèn)為是undetectable的基因,。We define a gene as detectable if at least two cells contain more than 1 transcript from the gene. If we were considering read counts rather than UMI counts a reasonable threshold is to require at least five reads in at least two cells. 然而很多時候閾值取決于測序深度,,而且很重要的一點(diǎn)是基因過濾一定要在細(xì)胞過濾之后,因為一些基因可能僅僅被檢測到只存在在低質(zhì)量的細(xì)胞里,。
四,、數(shù)據(jù)可視化1,、PCA PC對應(yīng)于協(xié)方差矩陣的特征向量,特征向量按照特征值排序,,因此第一主成分盡可能的考慮數(shù)據(jù)的可變性,,并且每個后續(xù)成分在與前面的成分正交的約束下具有最高的方差。 注意:對數(shù)轉(zhuǎn)換對數(shù)據(jù)是有益的:減少了第一主成分的方差,,并且已經(jīng)分離了一些生物效應(yīng),。而且使表達(dá)值的分布更正常。 2,、tSNE圖 tSNE(t-Distributed Stochastic Neighbor Embedding)將維數(shù)降低與最近鄰網(wǎng)絡(luò)上的隨機(jī)游走相結(jié)合,,將高維數(shù)據(jù)(例如14214維表達(dá)矩陣)映射到二維空間,同時保持細(xì)胞間的局部距離,。與PCA相比,,tSNE是一種隨機(jī)算法,在統(tǒng)一數(shù)據(jù)集上運(yùn)行多次該方法會得到不同的圖,。由于tSNE的非線性和隨機(jī)性,,導(dǎo)致很難解釋。為了確保再現(xiàn)性(reproducibility),我們可以固定隨機(jī)數(shù)產(chǎn)生器的種子,,以保證總是得到相同的圖,。 此外,tSNE需要提供一個perplexity值,,該值反應(yīng)了用于構(gòu)建最近鄰網(wǎng)絡(luò)的鄰居數(shù)量,。該值越高表明產(chǎn)生了一個密集網(wǎng)絡(luò),將細(xì)胞聚集在一起,;該值越低表明網(wǎng)路更加稀疏,,允許細(xì)胞群彼此分離。在scater包中,,perplexity的默認(rèn)值是:cells總數(shù)除以5,。 五、歸一化理論目的是消除混雜因素的影響,。 由于scRNAseq數(shù)據(jù)通常高在高度復(fù)用的平臺上進(jìn)行測序,,因此library大小不同,從每個cell獲取的reads可能有很大差異,。然而可以使用一種量化方法來歸一化library大小,。通過將表達(dá)式矩陣的每一列乘以或除以“標(biāo)準(zhǔn)化因子”(相對于其他cell的library大小的估計值)來校正library大小。對于bulk RNAseq已經(jīng)開發(fā)了很多方法來校正庫大小,,這些大小同樣適用于scRNAseq(例如 UQ,,SF,CPM,RPKM,,F(xiàn)PKM,,TPM) 1、CPM 規(guī)范化數(shù)據(jù)的最簡單方法,,將其轉(zhuǎn)換為每百萬計數(shù)(CPM),,方法是將每一列除以其總數(shù),然后乘以1000000,。 CPM的潛在缺陷是:如果樣本中包含的基因在細(xì)胞中高度表達(dá)或者差異表達(dá),,在這種情況下,細(xì)胞內(nèi)的總molecular數(shù)可能取決于這些基因在細(xì)胞中是否處于開啟/關(guān)閉狀態(tài),而總molecular數(shù)的標(biāo)準(zhǔn)化可能會掩蓋這些基因的差異表達(dá)或者錯誤的認(rèn)為其余基因是差異表達(dá),。 另外,,RPKM,F(xiàn)PKM,,TPM是CPM的變體,,他們通過各自基因/轉(zhuǎn)錄本的長度進(jìn)一步調(diào)節(jié)計數(shù)。 2,、RLE (SF) size factor(SF):首先計算所有細(xì)胞中每個基因的幾何平均值,,每個細(xì)胞的size factor是基因表達(dá)與基因幾何平均值比值的中位數(shù)。 方法的缺點(diǎn):由于它使用的是幾何平均值,,所以只能在所有細(xì)胞中使用非零表達(dá)值的基因進(jìn)行計算,,因此不適合進(jìn)行大規(guī)模低深度的scRNAseq實(shí)驗。 edgeR & scater包中把這個方法叫做RLE(relative log expression),。 3,、UQ upperquartile(UQ上四分位數(shù)),each column is divided by the 75% quantile of the counts for each library. Often the calculated quantile is scaled by the median across cells to keep the absolute level of expression relatively consistent. 這種方法的缺點(diǎn):在低深度scRNAseq實(shí)驗中,,大量的未檢測到的基因可能導(dǎo)致75%分位數(shù)為0(或者接近于0),。這種限制可以通過推廣這個概念和使用更高的分位數(shù)來克服。99%分位數(shù)是scater中的默認(rèn)值,,或者在計算75%分位數(shù)之前排除0,。 4、TMM TMM(the weighted trimmed mean of M-values): The M-values in question are the gene-wise log2 fold changes between individual cells. One cell is used as the reference then the M-values for each other cell is calculated compared to this reference. These values are then trimmed by removing the top and bottom ~30%, and the average of the remaining values is calculated by weighting them to account for the effect of the log scale on variance. Each non-reference cell is multiplied by the calculated factor. Two potential issues with this method are insufficient non-zero genes left after trimming, and the assumption that most genes are not differentially expressed. 5,、scran scran包實(shí)現(xiàn)了專門用于單細(xì)胞數(shù)據(jù)的CPM變體,。簡單說來,這種方法通過將cells集中在一起計算一個標(biāo)準(zhǔn)化因子(類似于CPM)來處理每個cell中的大量零值問題,。 Briefly this method deals with the problem of vary large numbers of zero values per cell by pooling cells together calculating a normalization factor (similar to CPM) for the sum of each pool. Since each cell is found in many different pools, cell-specific factors can be deconvoluted from the collection of pool-specific factors using linear algebra. 6、downsampling 對表達(dá)矩陣進(jìn)行下采樣,,使每個細(xì)胞的molecular數(shù)大致相同,。這種方法的好處:下采樣將引入零值,從而消除由于檢測到的基因數(shù)量不同而產(chǎn)生的任何偏差,。然而缺點(diǎn)是:這個過程不是確定性的,,因此每次運(yùn)行下采樣時得到的表達(dá)矩陣是略有不同的,。因此通常必須在多個下采樣上運(yùn)行分析,以確保結(jié)果是健壯的,。 7,、注意項 RLE、TMM,、UQ是為bulk RNAseq數(shù)據(jù)開發(fā)的,,根據(jù)實(shí)驗環(huán)境的不同,他們可能不適用于單細(xì)胞RNAseq數(shù)據(jù),,因為他們的基本假設(shè)可能存在問題,。 對于CPM標(biāo)準(zhǔn)化,我們使用scater的calculateCPM()函數(shù) 對于RLE,,UQ,,TMM,我們使用scater的normaliseExprs()函數(shù) 對于scran,,我們使用scran包來計算size factors(它作用于SingleCellExperiment class),然后使用scater的normalize()來標(biāo)準(zhǔn)化數(shù)據(jù),。 對于downsampling,使用如下函數(shù): Down_Sample_Matrix <- function (expr_mat) { min_lib_size <- min(colSums(expr_mat)) down_sample <- function(x) { prob <- min_lib_size/sum(x) return(unlist(lapply(x, function(y) { rbinom(1, y, prob) }))) } down_sampled_mat <- apply(expr_mat, 2, down_sample) return(down_sampled_mat) } 所有這些規(guī)范化函數(shù)都將結(jié)果保存為logcounts,。
六,、處理混雜因素技術(shù)上的混雜因素包括識別和移除表達(dá)數(shù)據(jù)中與感興趣的生物學(xué)信號無關(guān)(即混雜因素)的變異源。目前存在各種各樣的方法,,其中一些使用spike-in or housekeeping genes,一些使用endogenous genes,。 一些方法:
|