A high quality plot

I’ll keep this post short and sweet. Here’s some code to get a really nice looking plot in R. It has a high pixel count to produce a high resolution output that can be used in a word document. Because of this, the size of everything in the plot (axes, points, text, axis labels, etc) has to be increased. I have skipped my normal commentary and instead left comments in the code. If you have any questions, leave a comment. Hope this helps!!

##First let's grab a dataset from the R library
nottem
#if we look at the data type 
typeof(nottem)
class(nottem)
#We need to convert from a "time series" object to a "matrix" to plot it
temp = as.matrix(nottem)
data =  matrix(temp, ncol=12, byrow = TRUE)
colnames(data) = c("jan", "feb", "mar", "apr", "may", "jun",
                   "jul", "aug", "sep", "oct", "nov", "dec")
#This tells you where you data is stored
#This is important because we will store our plot here
getwd()
#Calculate the monthly means
monthlymean<-apply(data, 2, FUN=mean)
#Make boxplots of the observed monthly data 
png("nice-plot.png",width=2400,height=1600)
par(mfrow=c(1,1),mar=c(8,9,10,9))

boxplot(data,col="cornflowerblue",
          xaxt='n', xlab="", ylab="",
          main="", cex.main=2, cex.lab=3, cex.axis=3, outcex=2)
points(monthlymean, pch=24, col="black", bg="red", cex=4)

axis(side=1, at=1:12, labels=FALSE, tick=TRUE, cex.axis=3, tck=-0.01)
mtext(c("Jan","Feb","Mar","Apr","May","Jun","Jul",
        "Aug","Sep","Oct","Nov","Dec"),at=1:12,side=1,line=3,cex=3)
mtext("Temperature (Fahrenheit)",side=2,las=0,cex=3.5,line=4.5)

legend("topleft",pch=c(24), c("Mean Observed Precipitation"),
       col=c("black"),pt.bg=c("red"), cex=3)
mtext("Average Monthly Temperatures at Nottingham", side=3, line=4.5, cex=4)
mtext("1920-1939", side=3, line=0.6, cex=3.5)
box(which="outer",col="black",lwd=1)
dev.off()

A nice plot

Plotting 2

In this post we’ll cover go into more detail on plotting commands. We’ll use a scatterplot (X-Y plot) as our example plot. Again we’ll use the command plot.

##First let's make some data
x<-c(1,3,5,7,9,11)
y<-c(2,4,6,8,10,12)

plot(x,y)

scatter

Next let’s change the axis labels. To change the axis titles we’ll use the commands xlab and ylab for the x-axis and y-axis, respectively. We add these calls within the parenthesis of the plot function. Let’s make the x-axis “Even” and the y-axis “Odd”.

plot(x, y, xlab = "Even", ylab = "Odd")
plot2

Looks good! Now let’s change the x- and y-axis limits. We’ll use the commands xlim and ylim. In each case we give a lower and upper limit, so we need to concatenate them together with the c function. In our example we’ll set the x-axis from 0 to 15 using xlim = c(0, 15), and the y-axis from 1 to 20 using ylim = c(1, 20). Again these commands are added within the plot function.

plot(x, y, xlim = c(0, 15), ylim = c(1, 20), xlab = "Even", ylab = "Odd")
plot3

Next let’s add a title calling it “My Plot”. We’ll use the command main = “add your title here”.

plot(x, y, main = "My Plot", xlim = c(0, 15), ylim = c(1, 20), 
     xlab = "Even", ylab = "Odd")
plot4

Now, let’s spice up the colors of our plot. Let’s make the points red and bigger. We use the calls “col” and “cex” to adjust these items.

plot(x, y, col = "red", cex = 2, main = "My Plot", 
     xlim = c(0, 15), ylim = c(1, 20), xlab = "Even", 
     ylab = "Odd")
plot5

Now let’s make our points a little bit fancier. We can use the command pch to change the points from the standard hollow circle to a filled diamond (pch = 18). You can find a snapshot of the different pch symbols here. Since this is a filled symbol, the call col colors the outline and the call bg colors the fill of the symbol.

plot(x, y, pch = 23, bg = "yellow", col = "red", 
     cex = 2, main = "My Plot", xlim = c(0, 15), 
     ylim = c(1, 20), xlab = "Even", ylab = "Odd")
plot6

Finally let’s complete the plot by adding a legend. The legend is different than the previous calls. It goes outside of the plot() command. Add the legend() command on a second line. The first bit of code “topleft” adds the legend to the top left of the plot. The second bit of code calls the legend item by the name “my data”. The rest of the code defines the legend item as we added it into the plot. The exception is the call “pt.bg” which has to be used instead of just “bg”.

plot(x, y, pch = 23, bg = "yellow", col = "red", 
     cex = 2, main = "My Plot", xlim = c(0, 15), 
     ylim = c(1, 20), xlab = "Even", ylab = "Odd")
legend("topleft", "my data", pch=22, pt.bg="yellow", col="red")
plot7

That’s it for now. We’ll do some more plotting next time!

R Studio and Shiny

R Studio has released a web application that is run (nearly) entirely through R (R Studio). It’s called Shiny and it’s great! It easily lets you turn your R scripts into a webpage. This is great for teaching purposes, showing off some code, and publishing to the web.

R Studio has given its users everything they need to make a web app using templates they have provided.  Everything fits into one “.R” file for easy editing and publishing.

You can find the Shiny page here: http://shiny.rstudio.com/

Here’s a link to my Shiny app. This has 4 statistical distributions (normal, lognormal, weibull, exponential) and let’s the user interact with the variables. The box plot and histogram of the data respond to the user controlled inputs.

Check it out here: My Shiny App
(Make sure to give it about 30 seconds to fully load for the first time.)

screenshot

Unbroken – A Visual Guide via Google Earth

I recently finished reading Unbroken by Laura Hillenbrand. It was an immensely powerful story of the human spirit and will to survive and flourish. The only thing I felt was lacking in the book was a visual guide (i.e., map) of Louie’s crazy trip around the Pacific. Therefore I made this Google Earth kmz file which has pins for many of the significant locations in the book once Louie is sent to Honolulu. The kmz file can be viewed in Google Maps by clicking the link, with the option to download if interested (see link below). Here’s a list of the locations included:

KMZ Download (via google drive)

Honolulu
Midway Islands
Wake Atoll
Canton Island
Makin – Gilbert Islands
Tarawa – Gilbert Islands
Funafuti
Nauru
Palmyra Atoll
Flight path of the Green Hornet
Lost at Sea – Louie, Phil and Mac’s drifting journey
Wotje Atoll
Kwajalein Atoll – Execution Island
Ofuna Camp (very approximate)
Omori POW Camp
Naoetsu POW Camp

R Studio Update!

R Studio currently has version 0.99.442 out for download. I highly recommend upgrading to the latest version since there has been a lot of new functionality added. Here’s a few snapshots of some of the new features:

There’s built-in debugging icons within R scripts to the left of the line number. For this feature to work properly your script has to be saved (e.g., test.R).

Capture1

When I hover my mouse over the error I see the message “unexpected token ‘)’ “, which is letting me know I have an extra closed parenthesis. Sweet!

Capture1b

There’s built-in auto-fill for some of the parameters you may be adding to a function. In the image below I’m starting to add ‘xlim’ to specify the x-axis limits of my plot. My full coding might be something like “xlim = c(0, 6)”. Click the image below for a bigger picture so you can read the dialog box.

Capture2

In the image below, I’m calling the function lm (for linear regression), but I have a comma without another argument. When I hover the mouse over the yellow triangle exclamation point the built-in debugging says there’s a “missing argument to function call”. This Is AWESOME!!!

Capture3

These are just a few of the new features in R Studio. Get out there and check it out!!!

Plotting Introduction

In this post we’ll cover some of the basic plot types in R such as scatter plots, box plots, histograms, and line graphs. Let’s start with the basic scatter plot.  We’ll use the command plot.

##First let's make some data
x<-c(1,2,3,4,5,6)
y<-c(2,4,6,8,10,12)

plot(x,y)

scatter

Next we’ll make a box plot. For this plot we’ll use some data that is available within R already. Let’s take a look at what’s already within R using the command data(). We see that there’s lots of datasets within R that are ready to go! Let’s load ‘rivers’, which says it is the length of major rivers in North America. Use data(rivers) to load rivers into R. Type ‘rivers’ to see what the data looks like.

data(rivers)
rivers
##   [1]  735  320  325  392  524  450 1459  135  465  600  330  336  280  315
##  [15]  870  906  202  329  290 1000  600  505 1450  840 1243  890  350  407
##  [29]  286  280  525  720  390  250  327  230  265  850  210  630  260  230
##  [43]  360  730  600  306  390  420  291  710  340  217  281  352  259  250
##  [57]  470  680  570  350  300  560  900  625  332 2348 1171 3710 2315 2533
##  [71]  780  280  410  460  260  255  431  350  760  618  338  981 1306  500
##  [85]  696  605  250  411 1054  735  233  435  490  310  460  383  375 1270
##  [99]  545  445 1885  380  300  380  377  425  276  210  800  420  350  360
## [113]  538 1100 1205  314  237  610  360  540 1038  424  310  300  444  301
## [127]  268  620  215  652  900  525  246  360  529  500  720  270  430  671
## [141] 1770

To make a box plot use the function boxplot.

boxplot(rivers)

rivers

Now let’s make a histogram with the same ‘rivers’ dataset.

hist(rivers)

rivers_hist

Let’s make a line plot using some time series data. Let’s load the dataset ‘airmiles’.

data(airmiles)
airmiles
## Time Series:
## Start = 1937 
## End = 1960 
## Frequency = 1 
##  [1]   412   480   683  1052  1385  1418  1634  2178  3362  5948  6109
## [12]  5981  6753  8003 10566 12528 14760 16769 19819 22362 25340 25343
## [23] 29269 30514

plot(airmiles)

airmiles_line

We’ll notice that we used the same function plot to make a line graph here as we did in the first plot using x and y to make a scatter plot. R has it’s own defaults based on the type of data it receives for the function plot.

If we want to force the plot type (a solid line versus points) we can use the parameter ‘type’, where type=”l” makes a line and type=”p” makes points. Let’s see this using the airmiles plot again.

plot(airmiles, type="p")

xx

Finally let’s use the data x and y from the first plot to make a plot with a line rather than points.

plot(x,y, type="l")

xy_line

That’s it for now, we’ll add more plotting options for the graphs in subsequent posts.

Naming columns and rows

It is often convenient to name the columns and rows within a dataset to keep things clearly organized.  This is very easy to do in R, here’s how.  First let’s make some data.

a <- c(62.3, 55.3, 65.3, 59.3, 67.3)
b <- c(2.2, 5.4, 1.3, 2.8, 5.4)
c <- c(0.1, 1.5, 1.6, 2.1, 0.3)
data <- cbind(a, b, c)
data
##         a   b   c
## [1,] 62.3 2.2 0.1
## [2,] 55.3 5.4 1.5
## [3,] 65.3 1.3 1.6
## [4,] 59.3 2.8 2.1
## [5,] 67.3 5.4 0.3

We can see that this dataset “data” already has the column names “a”, “b”, and “c” stored for me when I used the function cbind. ***For more information on cbind and creating data, check out my other post here***.

Now let’s update the column names to something meaningful using the function colnames. For this function we have to define the data we are working on and then the names of the columns.
colnames(data name here) <- c(“column name 1”, “column name 2”, “etc”)

colnames(data) <- c("temp_F", "wind_m/s", "precip_in")
data
##      temp_F wind_m/s precip_in
## [1,]   62.3      2.2       0.1
## [2,]   55.3      5.4       1.5
## [3,]   65.3      1.3       1.6
## [4,]   59.3      2.8       2.1
## [5,]   67.3      5.4       0.3

Now let’s say that we want to update the row names to be meaningful as well. The function is you guessed it rownames and works the same way as colnames. rownames(data name here) <- c(“row name 1”, “row name 2”, “etc”)

rownames(data) <- c("Site 1", "Site 2", "Site 3", "Site 4", "Site 5")
data
##        temp_F wind_m/s precip_in
## Site 1   62.3      2.2       0.1
## Site 2   55.3      5.4       1.5
## Site 3   65.3      1.3       1.6
## Site 4   59.3      2.8       2.1
## Site 5   67.3      5.4       0.3

Let’s try to break rownames quickly by adding too many row names (6 row names instead of 5). We’ll find that R will get angry and let us know with a semi-cryptic error message.

rownames(data) <- c("Site 1", "Site 2", "Site 3", "Site 4", "Site 5", "Site 6")
## Error in `rownames<-`(`*tmp*`, value = c("Site 1", "Site 2", "Site 3",  : 
## length of 'dimnames' [1] not equal to array extent

If you make a data frame, your data will already reflect the column headers as well. Remember from last time that we can make a data frame using the function data.frame. Here’s an example:

numbers <- c(1, 2, 3, 4, 5)
letters <- c("a", "b", "c", "d", "e")
symbols <- c("!", "@", "#", "$", "%")
data2 <- data.frame(numbers, letters, symbols)
data2
##   numbers letters symbols
## 1       1       a       !
## 2       2       b       @
## 3       3       c       #
## 4       4       d       $
## 5       5       e       %

We can see that our new data frame already has the column names “numbers”, “letters”, and “symbols” from the way we build our dataset.

cbind part 2

In an earlier post we discussed creating data using the functions seq, rep, and then merging them together with cbind. **You can see that post here**. In this post we’re going to go more in depth on the limitations of cbind.

If we make some data of say temp and precip, we can combine using cbind:

temp <- c(52.5, 53.7, 55.7, 57.2, 55.9, 57.3, 60.3)
precip <-c (0.11, 0.22, 0.05, 0.0, 0.0, 0.0, 0.0)
data <- cbind(temp, precip)
data
##      temp precip
## [1,] 52.5   0.11
## [2,] 53.7   0.22
## [3,] 55.7   0.05
## [4,] 57.2   0.00
## [5,] 55.9   0.00
## [6,] 57.3   0.00
## [7,] 60.3   0.00

This worked just fine. The temperature and precipitation data are now stored together in a matrix. We know it’s a matrix because we can query the object type using the functions typeof and class. The “double” means the data is numeric with double precision and the “matrix” means that the data class is a matrix.

typeof(data)
class(data)
## [1] "double"
## [1] "matrix"

cbind works great you are working with the same data types. It does not work well when the data types are different. For instance if data “a” is numeric and data “b” is text, cbind has unexpected results. See for yourself:

a <- c(1, 2, 3, 4, 5)
b <- c("a", "b", "c", "d", "e")
class(a)
class(b)
## [1] "numeric"
## [1] "character"
data2 <-cbind(a, b)
data2
##      a   b  
## [1,] "1" "a"
## [2,] "2" "b"
## [3,] "3" "c"
## [4,] "4" "d"
## [5,] "5" "e"
typeof(data2)
class(data2)
## [1] "character"
## [1] "matrix"

In the results shown above we can see that when we put numeric data together with text data using cbind, the function turns all of the data to text (character). This is not what we want!!! If we try and operate on the data it doesn’t work. We’ll try multiplying the matrix by 2.

2*data2
Error in 2 * data2 : non-numeric argument to binary operator

This is a feature of cbind to be aware of. So what’s the work around you ask? Data frames. Using the same data as above, we’ll store vectors “a” and “b” together in a data frame and preserve their integrity. The function is easy to remember, it’s data.frame.

newdata <- data.frame(a, b)
newdata
##   a b
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
typeof(newdata)
class(newdata)
## [1] "list"
## [1] "data.frame"

More importantly though, if we query the data within the data frame, we’ll see that column “a” is numeric and column “b” is text. Remember we can reference columns and rows within a matrix and data frame using square brackets after the name (data[row, column]). Using data[,2] references all of the rows of data within the second column.

typeof(newdata[,1])
class(newdata[,1])
## [1] "double"
## [1] "numeric"
typeof(newdata[,2])
class(newdata[,2])
## [1] "integer"
## [1] "factor"

We can see from the results that column 1 (“a”) is numeric and column 2 (“b”) is a factor (character). We’ll get more into factors at another time, but it’s safe to say that our data is in tact. We can test this to be certain by operating on each column in our new data frame “newdata”.

2*newdata[,1]
## [1]  2  4  6  8 10

We can operate on column 1 (“a”) which we would expect since they are numeric values.

2*newdata[,2]
## Warning in Ops.factor(2, newdata[, 2]): '*' not meaningful for factors
## [1] NA NA NA NA NA

We cannot operate on column 2 (“b), which we also would expect, since they are text values.

One sample t-test

Perhaps the most widely used statistical analysis for better or worse is the t-test.  Here’s a quick summary of how to call the t-test for one sample using R.  The function name is t.test and the main parameters are the data, the test type (alternative=), the mean (mu=), and the confidence level (conf.level=).

The hardest part about t-tests in R is knowing how to set up the problem.  In these examples the null hypothesis has a mean of 4 (H0: μ = 4) and I have tested the three different alternative hypothesis options: H1: μ ≠ 3, H1: μ < 3, and H1: μ > 3.  In each test I have used a 95% confidence interval (alpha = 0.05).  Note: A t-test would not be performed in this manner using all three alternatives, this is merely for example purposes.

Here’s an example of a one-sided t test using the vector x.

x = c(1,2,4,7,4,3,7,8,3,9)
t.test(x, alternative="two.sided", mu = 3, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  x
## t = 2.0769, df = 9, p-value = 0.0676
## alternative hypothesis: true mean is not equal to 3
## 95 percent confidence interval:
##  2.839464 6.760536
## sample estimates:
## mean of x 
##       4.8
t.test(x, alternative="less", mu = 3, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  x
## t = 2.0769, df = 9, p-value = 0.9662
## alternative hypothesis: true mean is less than 3
## 95 percent confidence interval:
##      -Inf 6.388698
## sample estimates:
## mean of x 
##       4.8
t.test(x, alternative="greater", mu = 3, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  x
## t = 2.0769, df = 9, p-value = 0.0338
## alternative hypothesis: true mean is greater than 3
## 95 percent confidence interval:
##  3.211302      Inf
## sample estimates:
## mean of x 
##       4.8

 

Let’s analyze the results above starting with the alternative hypothesis that the true mean is not equal to 3 (H1: μ ≠ 3).  The results show that the 95% confidence interval for the true mean is 2.84 to 6.76.  Since the null hypothesis states that the true mean is 3 (H0: μ = 3), and 3 is within the 95% confidence interval, the null hypothesis is unlikely to be rejected.  The p-value is 0.076, which is greater than the alpha value of 0.05 (or 1-confidenc interval 0.95).  Since the p-value is not less than 0.05, the alternative hypothesis that the true mean is not equal to 3 is rejected in favor of the null hypothesis.  Some other useful information the t-test provides is the degrees of freedom (9) and the t-statistic 2.08.

Let’s look at the results from the alternative hypothesis that the true mean is less than 3 (H1: μ < 3).  The 95% confidence interval is less than 6.39 and the p-value is 0.966.  Once again the p-value is greater than the alpha value of 0.05 (or 1-0.95), so the alternative hypothesis that the sample mean is less than 3 (H1: μ < 3) is rejected in favor of the null hypothesis that the true mean is 3 (H0: μ = 3).

Finally let’s look at the results from the t-test using the alternative hypothesis that the true mean is greater than 3 (H1: μ > 3).  The 95% confidence interval is 3.21 and greater, which does not include the value of the null hypothesis.  In this case the p-value is 0.0338, which IS less than the alpha value of 0.05 (or 1-0.95) and the null hypothesis (H0: μ = 3) is rejected in favor of the alternative hypothesis (H1: μ > 3) that the true mean is greater than 3.

 

 

rep, seq, and cbind: Data Creation and Processing

rep and seq are basic functions in R that are very powerful for a wide variety of tasks in R.  rep repeats characters and seq creates a sequence of characters.

The rep function has 2 parameters, the value to repeat and the number of times to repeat it.  Here’s some examples of the rep function:

Repeat the number 1 five times.

rep(1, 5)
## [1] 1 1 1 1 1

Repeat the sequence of values 1-5, three times.

rep(1:5, 3)
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Repeat the days of the week twice.

days=c("mon","tue","wed","thu","fri","sat","sun")
rep(days, 2)
##  [1] "mon" "tue" "wed" "thu" "fri" "sat" "sun" "mon" "tue" "wed" "thu"
## [12] "fri" "sat" "sun"

Now, let’s look at the seq function.  There are three parameters, the start of the sequence, the end, and the interval in that order.

Here’s a sequence from 1 to 5:

seq(from = 1, to = 5, by = 1)
## [1] 1 2 3 4 5

Here’s a sequence from 0 to 10 by 2:

seq(from = 0, to = 10, by = 2)
## [1]  0  2  4  6  8 10

Finally, let’s merge this newly created data together using the function cbind cbind stands for column bind, and will merge vectors together to form a matrix.

In this example we create two vectors of data (x & y) using the concatenate function, then we will merge them using the cbind function:

x = c(1:5)
y = c(-1:-5)
cbind(x, y)
##      x  y
## [1,] 1 -1
## [2,] 2 -2
## [3,] 3 -3
## [4,] 4 -4
## [5,] 5 -5

Finally, let’s add more data to the matrix we created above.

x = c(1:5)
y = c(-1:-5)
data = cbind(x, y)

a = c(1,1,1,1,1)
b = c(2,2,2,2,2)

cbind(data, a, b)
##      x  y a b
## [1,] 1 -1 1 2
## [2,] 2 -2 1 2
## [3,] 3 -3 1 2
## [4,] 4 -4 1 2
## [5,] 5 -5 1 2

For more information on the cbind function, check out my other post here.