在數(shù)據(jù)進(jìn)行分析時(shí),,我們往往需要先看一下數(shù)據(jù)的基本信息,比如求和,、平均數(shù),、標(biāo)準(zhǔn)差、標(biāo)準(zhǔn)誤,、中位數(shù),、四分位數(shù)、最小值,、最大值,、極差、偏度,、峰度等,。在R中這不是難事,有N多的程序包函數(shù)可以實(shí)現(xiàn)這些功能,,有時(shí)候你會(huì)覺得多得都不知道選擇哪一個(gè),。本文只介紹兩個(gè),一個(gè)是基礎(chǔ)安裝summary{base},,另外一個(gè)是stat.desc{pastecs},同時(shí)會(huì)用到分組計(jì)算輸出函數(shù)by(),。 library(foreign) ma<- read.dta("D:/Temp/STATA/Multivariate.dta") 如果不考慮分組,,想知道16例受試者體重和身高的一些基本信息,可使用命令: summary(ma["weight"]) summary(ma["height"]) var=c("weight","height") summary(ma[var]) summary(ma[c("weight","height")]) #等同于使用命令summary(ma[3:4])或summary(ma[-1:-2]) 以上只為演示,,實(shí)際工作中我們想知道的是A,、B兩組各自的基本信息,這就需要分組計(jì)算,。函數(shù)by {base}:Apply a Function to a Data Frame Split by Factors,,應(yīng)用格式為by(data, INDICES, FUN, ..., simplify = TRUE),具體解釋可通過命令help("by")查詢,該函數(shù)可以按照INDICES將要分析的data數(shù)據(jù)分割成幾個(gè)數(shù)據(jù)框,,然后對每個(gè)數(shù)據(jù)框應(yīng)用FUN函數(shù)的功能,。本例分組計(jì)算輸出命令代碼如下:by(ma[c("weight","height")],ma$group,summary) #對數(shù)據(jù)框ma中的weight和height變量按group分組,分別進(jìn)行summary獲取描述統(tǒng)計(jì)量后輸出結(jié)果【2】stat.desc{pastecs} stat.desc(x, basic=TRUE, desc=TRUE, norm=FALSE, p=0.95),,這是一個(gè)比較牛X的函數(shù),,會(huì)得出比較多的描述性指標(biāo)。x為數(shù)據(jù)框或時(shí)間序列,,在默認(rèn)情況下(basic=TRUE, desc=TRUE),,函數(shù)會(huì)返回x所有值、空值,、缺失值的數(shù)量,,最小值,最大值,,值域,,總和,中位數(shù),,平均數(shù),,平均數(shù)的標(biāo)準(zhǔn)誤,均數(shù)在P水平時(shí)的置信區(qū)間,,方差,,標(biāo)準(zhǔn)差以及變異系數(shù)。若norm=TRUE(默認(rèn)是FALSE),,返回正態(tài)分布的一些統(tǒng)計(jì)量,,如偏度和峰度(以及它們的統(tǒng)計(jì)顯著程度)和Shapiro-Wilk正態(tài)檢驗(yàn)結(jié)果。P=0.95,,是默認(rèn)的置信度為0.95來計(jì)算平均數(shù)的置信區(qū)間,。 命令接前面的數(shù)據(jù)載入命令: library(pastecs) #載入程序包pastecs,該程序包非默認(rèn),,需要通過install.packages("pastecs")下載安裝 stat.desc(ma[3:4],norm=TRUE,p=0.95) stat.desc(ma[1:8,3],norm=TRUE) #weightAstat.desc(ma[9:16,”weight”],norm=TRUE) #weightB stat.desc(ma[1:8,”height”],norm=TRUE) #heightA stat.desc(ma[9:16,4],norm=TRUE) #heightB by(ma[3:4],ma$group,stat.desc) #對數(shù)據(jù)框ma中的第3列和第4列變量按group分組,,分別進(jìn)行stat.desc獲取描述統(tǒng)計(jì)量后輸出結(jié)果by(ma[3:4],ma$group,function(x)stat.desc(x,norm=TRUE)) #對數(shù)據(jù)框ma中的第3列和第4列變量按group分組,,分別進(jìn)行stat.desc獲取基本描述統(tǒng)計(jì)量和正態(tài)分布的統(tǒng)計(jì)量后輸出結(jié)果stat.desc{pastecs}:Descriptive statistics on a data frame or time series,。Compute a table giving various descriptive statistics about the series in a data frame or in a single/multiple time series Useage:stat.desc(x, basic=TRUE, desc=TRUE, norm=FALSE, p=0.95) x:a data frame or a time series basic:do we have to return basic statistics (by default, it is TRUE)? These are: the number of values (nbr.val), the number of null values (nbr.null), the number of missing values (nbr.na), the minimal value (min), the maximal value (max), the range (range, that is, max-min) and the sum of all non-missing values (sum) desc:do we have to return various descriptive statistics (by default, it is TRUE)? These are: the median (median), the mean (mean), the standard error on the mean (SE.mean), the confidence interval of the mean (CI.mean) at the p level, the variance (var), the standard deviation (std.dev) and the variation coefficient (coef.var) defined as the standard deviation divided by the mean norm:do we have to return normal distribution statistics (by default, it is FALSE)? the skewness coefficient g1 (skewness), its significant criterium (skew.2SE, that is, g1/2.SEg1; if skew.2SE > 1, then skewness is significantly different than zero), kurtosis coefficient g2 (kurtosis), its significant criterium (kurt.2SE, same remark than for skew.2SE), the statistic of a Shapiro-Wilk test of normality (normtest.W) and its associated probability (normtest.p) END |
|