One sample t-test

Perhaps the most widely used statistical analysis for better or worse is the t-test.  Here’s a quick summary of how to call the t-test for one sample using R.  The function name is t.test and the main parameters are the data, the test type (alternative=), the mean (mu=), and the confidence level (conf.level=).

The hardest part about t-tests in R is knowing how to set up the problem.  In these examples the null hypothesis has a mean of 4 (H0: μ = 4) and I have tested the three different alternative hypothesis options: H1: μ ≠ 3, H1: μ < 3, and H1: μ > 3.  In each test I have used a 95% confidence interval (alpha = 0.05).  Note: A t-test would not be performed in this manner using all three alternatives, this is merely for example purposes.

Here’s an example of a one-sided t test using the vector x.

x = c(1,2,4,7,4,3,7,8,3,9)
t.test(x, alternative="two.sided", mu = 3, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  x
## t = 2.0769, df = 9, p-value = 0.0676
## alternative hypothesis: true mean is not equal to 3
## 95 percent confidence interval:
##  2.839464 6.760536
## sample estimates:
## mean of x 
##       4.8
t.test(x, alternative="less", mu = 3, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  x
## t = 2.0769, df = 9, p-value = 0.9662
## alternative hypothesis: true mean is less than 3
## 95 percent confidence interval:
##      -Inf 6.388698
## sample estimates:
## mean of x 
##       4.8
t.test(x, alternative="greater", mu = 3, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  x
## t = 2.0769, df = 9, p-value = 0.0338
## alternative hypothesis: true mean is greater than 3
## 95 percent confidence interval:
##  3.211302      Inf
## sample estimates:
## mean of x 
##       4.8

 

Let’s analyze the results above starting with the alternative hypothesis that the true mean is not equal to 3 (H1: μ ≠ 3).  The results show that the 95% confidence interval for the true mean is 2.84 to 6.76.  Since the null hypothesis states that the true mean is 3 (H0: μ = 3), and 3 is within the 95% confidence interval, the null hypothesis is unlikely to be rejected.  The p-value is 0.076, which is greater than the alpha value of 0.05 (or 1-confidenc interval 0.95).  Since the p-value is not less than 0.05, the alternative hypothesis that the true mean is not equal to 3 is rejected in favor of the null hypothesis.  Some other useful information the t-test provides is the degrees of freedom (9) and the t-statistic 2.08.

Let’s look at the results from the alternative hypothesis that the true mean is less than 3 (H1: μ < 3).  The 95% confidence interval is less than 6.39 and the p-value is 0.966.  Once again the p-value is greater than the alpha value of 0.05 (or 1-0.95), so the alternative hypothesis that the sample mean is less than 3 (H1: μ < 3) is rejected in favor of the null hypothesis that the true mean is 3 (H0: μ = 3).

Finally let’s look at the results from the t-test using the alternative hypothesis that the true mean is greater than 3 (H1: μ > 3).  The 95% confidence interval is 3.21 and greater, which does not include the value of the null hypothesis.  In this case the p-value is 0.0338, which IS less than the alpha value of 0.05 (or 1-0.95) and the null hypothesis (H0: μ = 3) is rejected in favor of the alternative hypothesis (H1: μ > 3) that the true mean is greater than 3.

 

 

rep, seq, and cbind: Data Creation and Processing

rep and seq are basic functions in R that are very powerful for a wide variety of tasks in R.  rep repeats characters and seq creates a sequence of characters.

The rep function has 2 parameters, the value to repeat and the number of times to repeat it.  Here’s some examples of the rep function:

Repeat the number 1 five times.

rep(1, 5)
## [1] 1 1 1 1 1

Repeat the sequence of values 1-5, three times.

rep(1:5, 3)
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Repeat the days of the week twice.

days=c("mon","tue","wed","thu","fri","sat","sun")
rep(days, 2)
##  [1] "mon" "tue" "wed" "thu" "fri" "sat" "sun" "mon" "tue" "wed" "thu"
## [12] "fri" "sat" "sun"

Now, let’s look at the seq function.  There are three parameters, the start of the sequence, the end, and the interval in that order.

Here’s a sequence from 1 to 5:

seq(from = 1, to = 5, by = 1)
## [1] 1 2 3 4 5

Here’s a sequence from 0 to 10 by 2:

seq(from = 0, to = 10, by = 2)
## [1]  0  2  4  6  8 10

Finally, let’s merge this newly created data together using the function cbind cbind stands for column bind, and will merge vectors together to form a matrix.

In this example we create two vectors of data (x & y) using the concatenate function, then we will merge them using the cbind function:

x = c(1:5)
y = c(-1:-5)
cbind(x, y)
##      x  y
## [1,] 1 -1
## [2,] 2 -2
## [3,] 3 -3
## [4,] 4 -4
## [5,] 5 -5

Finally, let’s add more data to the matrix we created above.

x = c(1:5)
y = c(-1:-5)
data = cbind(x, y)

a = c(1,1,1,1,1)
b = c(2,2,2,2,2)

cbind(data, a, b)
##      x  y a b
## [1,] 1 -1 1 2
## [2,] 2 -2 1 2
## [3,] 3 -3 1 2
## [4,] 4 -4 1 2
## [5,] 5 -5 1 2

For more information on the cbind function, check out my other post here.

Creating data

It might be necessary to create data on the fly using R.  This will also be important to understand for any future posts with example data sets.  First we’ll make a vector of data, using the function c for concatenate.  This code simply produces a vector of the provided sequence.  You will notice that the commas are not included in the output.  These are used a delimiters for the c function.

Create a vector:

c(1,2,3,4,5)
## [1] 1 2 3 4 5

We can also make a vector of text values by placing each value within double quotations “”.  In the next example we will make the same vector as above, but stored as text rather than numbers.  We have also made a vector with the days of the week.

Create a vector of numbers with text:

c("1", "2", "3", "4", "5")
## [1] "1" "2" "3" "4" "5"

Create a vector of days of the week:

c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
## [1] "Sunday"    "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"  "Saturday"

 

Next we will make a matrix using the function matrix.  The matrix function has three main parameters: 1. data, which I have decided to fill with NA in this example  2. nrow, which gives the number of rows in the matrix  3. ncol, which gives the number of columns in the matrix.

Create a matrix:

matrix(NA, nrow = 3 , ncol = 2)
##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA
## [3,]   NA   NA

If we want to fill the matrix with actual data, we can build a vector “y” and set the matrix data parameter equal to y.  This fill the 6 matrix spots with the values 1 through 6.

Fill a matrix with data:

y = c(1,2,3,4,5,6)
matrix (data = y, nrow = 3, ncol = 2)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

 

By default, the matrix fills by column.  This can be changed using the parameter “byrow” within the matrix function.  The default value is byrow = FALSE, which does not need to be explicitly coded.  To have the matrix fill by row we simply change this to byrow = TRUE.

Fill a matrix BY ROW with vector y:

y = c(1,2,3,4,5,6)
matrix (data = y, nrow = 3, ncol = 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6

 

Next we can give this matrix column names and row names with the functions colnames and rownames, respectively.  These two functions operate in the same manner.  These functions require one input, which is the data that will receive the headers (e.g., colnames(rain_data), will add column headers to the data set “rain_data”.  The second part needed is the actual headers.  These are provided as text in a vector using the concatenate function c.  The headers are denoted as text by placing them in double quotations “”.

Give the matrix column and row headers:

y = c(1,2,3,4,5,6)
mymatrix = matrix (data = y, nrow = 3, ncol = 2, byrow = FALSE)
colnames(mymatrix) = c("Col_1", "Col_2")
rownames(mymatrix) = c("Row-1", "Row-2", "Row-3")
mymatrix
##       Col_1 Col_2
## Row-1     1     4
## Row-2     2     5
## Row-3     3     6

Finally, we create a data set using random data from a statistical distribution.  This is a popular method used on blogs and websites like stackoverflow.  I covered how to call statistical distributions from R in my previous post.  For this example, we will generate data using the normal distribution with a mean of 1, and a standard deviation of 2.

Get random data from the normal distribution and put it into a matrix:

randomdata = rnorm (n = 12, mean = 1, sd = 2)
matrix(data = randomdata, nrow = 2, ncol = 6)
##            [,1]      [,2]     [,3]     [,4]       [,5]       [,6]
## [1,] 0.05265589 1.9368580 2.631436 1.025770 -1.5012382  4.0731287
## [2,] 0.52010343 0.8585777 2.371716 4.123003 -0.9378838 -0.8740166

For more information on how to make column names and row names, check out my other post here.

 

 

 

Statistical Distributions

Statistical distributions are the meat and potatoes of R.  Generating random numbers from any distribution is easy in R.  Below I have listed the code for several popular statistical distributions.  The code is nearly the same for each distribution.  The “r” designates random, which is the first letter of the call for the distribution.  “unif” calls a uniform distribution, “norm” calls a normal distribution, “binom” calls a binomial distribution, etc.

To get random data from the normal distribution (rnorm), several other parameters are needed. n is the number of random variables to create, mean is the mean of the data distribution, and sd is the standard deviation of the data distribution.  I am requesting 10 random variables from the normal distribution with a mean of 0 and a standard deviation of 1.

#Uniform Distribution
runif(n=10, min=0, max=1)

#Normal Distribution
rnorm(n=10, mean=0, sd=1)

#Binomial Distribution
rbinom(n=10, size=5, prob=0.2)

#The log-normal Distribution
rlnorm(n=10, meanlog=0, sdlog=1)

#Weibull Distribution
rweibull(n=10, shape=1, scale = 1)

#Exponential Distribution
rexp(n=10, rate = 1)

#Poisson Distribution
rpois(n=10, lambda=1)

#Gamma Distribution
rgamma(n=10, shape=1, rate = 1)

#Chisquare Distribution
rchisq(n=10, df=3, ncp=1)
#where df is degrees of freedom, and ncp is non-centrality parameter

First Post

This blog was created to try and help others starting out learning R.  As a scientist with a non-programming background, I found that R had a steep learning curve.  I also found the R help pages to be too technical for someone who is just learning how to use R and therefore were fairly useless.  This is my effort to give back to the R community that helped me become comfortable with and thoroughly enjoy R.

I have to recommend a site that I still reference from time to time on R.  Quick-R is the best website I have found for beginners using R. http://www.statmethods.net/