Basic R — Descriptive Statistics of Univariate Data

This is a basic introductory look at using R for generating descriptive statistics of a univariate data set. Here, we will use the historical dataset of Michelson’s experiment to determine the speed of light in air provided as a an ASCII file with header content and the observed speed of light for 100 trials.

We need to first read the data into R. Since the data is in a properly formatted ASCII file, we only need to tell R to ignore the first 60 lines, which is header information. R will then import the data into a list of class data.frame.


>C <- read.table("Michelso.dat",skip=60)

We can take a look at the dataset by simply typing the dataset name at the prompt. Here you can see that R automatically assigned the variable V1 to the data.


> C
        V1
1   299.85
2   299.74
3   299.90
4   300.07
...

The summary() command in R provides the summary statistics: MIn, 1st Q, Median, Mean, 3rd Q and Max. We call this function with the argument 'C$V1' which tells R to act on the named variable, V1, in the data.frame C. (The options commands set the output number formatting to something realistic.)


> options(scipen=100)
> options(digits=10)
> summary(C$V1)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
299.6200 299.8075 299.8500 299.8524 299.8925 300.0700 

Standard deviation, trimmed mean and number of data points can be obtained individually.


>sd(C$V1)
[1] 0.07901054782
>mean(C$V1,trim=0.05)
[1] 299.8528889
>length(C$V1)
[1] 100

If we want to get skewness and kurtosis we'll need the fBasics package installed


> install.packages("fBasics")
> library(fBasics)
...
>skewness(C$V1, method="moment")
[1] -0.01798640563
attr(,"method")
[1] "moment"
>kurtosis(C$V1, method="moment")
[1] 3.198586275
attr(,"method")
[1] "moment"

To determine confidence intervals on the mean, we can use the one sample t-test. We can ignore the mean value to test against since in our case it is not known (or relevant for confidence interval estimation)


> t.test(C$V1, conf.level=0.99)

	One Sample t-test

data:  C$V1 
t = 37950.9329, df = 99, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 0 
99 percent confidence interval:
 299.8316486 299.8731514 
sample estimates:
mean of x 
 299.8524

Another method for obtaining much of this information in a single step can be found in the stat.desc() function from the pastecs package.


> install.packages("pastecs")
> library(pastecs)
...
> options(scipen=100)
> options(digits=4)
> stat.desc(C)
                        V1
nbr.val        100.0000000
nbr.null         0.0000000
nbr.na           0.0000000
min            299.6200000
max            300.0700000
range            0.4500000
sum          29985.2400000
median         299.8500000
mean           299.8524000
SE.mean          0.0079011
CI.mean.0.95     0.0156774
var              0.0062427
std.dev          0.0790105
coef.var         0.0002635

We'll look at the generation of some standard statistical plots for exploratory data analysis in a future post.


Caveat lector — All work and ideas presented here may not be accurate and should be verified before application.

One thought on “Basic R — Descriptive Statistics of Univariate Data”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>