The gamma-Poisson distribution with mean μ and overdispersion α implies a quadratic mean–variance relationship \(\) transformation equation ( 1), the shifted logarithm equation ( 2) with pseudo-count y 0 = 1 or y 0 = 1 / (4 α) and the shifted logarithm with CPM. Instead of working with the raw counts Y, we apply a non-linear function g( Y) designed to make the variances (and possibly, higher moments) more similar across the dynamic range of the data 8. Variance-stabilizing transformations based on the delta method 7 promise an easy fix for heteroskedasticity if the variance predominantly depends on the mean. An alternative choice is to use variance-stabilizing transformations as a preprocessing step and subsequently use the many existing statistical methods that implicitly or explicitly assume uniform variance for best performance 3, 6. For data derived from unique molecular identifiers (UMIs), a theoretically and empirically well-supported model is the gamma-Poisson distribution (also referred to as the negative binomial distribution) 1, 2, 3, but parameter inference can be fiddly and computationally expensive 4, 5. One approach to handle such heteroskedasticity is to explicitly model the sampling distributions. ![]() Analyzing heteroskedastic data is challenging because standard statistical methods typically perform best for data with uniform variance. Accordingly, a change in a gene’s counts from 0 to 100 between different cells is more relevant than, say, a change from 1,000 to 1,100. ![]() In particular, counts for highly expressed genes vary more than for lowly expressed genes. Single-cell RNA-sequencing (RNA-seq) count tables are heteroskedastic. This result highlights limitations of current theoretical analysis as assessed by bottom-line performance benchmarks. We compare their strengths and weaknesses and find that the latter three have appealing theoretical properties however, in benchmarks using simulated and real-world data, it turns out that a rather simple approach, namely, the logarithm with a pseudo-count followed by principal-component analysis, performs as well or better than the more sophisticated alternatives. Here, we describe four transformation approaches based on the delta method, model residuals, inferred latent expression state and factor analysis. These steps are intended to make subsequent application of generic statistical methods more palatable. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range. ![]() The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |