Overview

In this lesson, you’ll learn how create and modify histograms, bar plots, and scatter plots using the base graphics capabilities of R.

Objectives

  1. Create and modify histograms, bar plots and scatter plots using R.

Readings

Lander, Chapter 7.1

1 Base Graphics

Visualization is the best method for exploring and getting to know your data, as well as generating hypotheses and presenting results. R has robust graphics built in, and even better graphics through the ggplot2 package. We’ll start with the base graphics, and then look a bit more closely at ggplot2.

In the previous module, we used aggregate to examine mean temperatures by month in the airquality dataset. We can get much richer information by visualizing these data.

1.1 Histogram

To begin with, to display a single variable, we can generate a simple histogram using the base R function hist:

hist(airquality$Temp,main="Temperature Histogram for Airquality",xlab="Temperature",ylab="Count")

The first input is the data vector; “main” is the figure title; “xlab” is the label for the x axis; and “ylab” for the y axis. R automatically chooses the number of bars to include, but if you want more, you can customize that with “breaks”.

hist(airquality$Temp,main="Temperature Histogram for Airquality",xlab="Temperature",ylab="Count",breaks=20)

1.2 Boxplot

If we want a boxplot to summarize a single variable instead, we can use:

boxplot(airquality$Temp,main="Temperature Boxplot for Airquality",ylab="Temperature")

The boxplot displays the median (dark line), the middle two quartiles (ie, 25th-75th percentiles, or the interquartile range, IQR) in the box, and whiskers out to +/- 1.5 of the IQR. It will also show outlier points if there are any.

1.3 Scatter plot

Finally, if we want the simplest two-variable scatter plot, we can use plot.

If we want to examine temperature as a function of date, we need to first construct a nice date variable, since right now the airquality data has dates by month number and day of the month. (Think about how paste is working here.)

airdate <- as.Date(paste("1972","-",airquality$Month,"-",airquality$Day,sep=""))
head(airdate)
[1] "1972-05-01" "1972-05-02" "1972-05-03" "1972-05-04" "1972-05-05"
[6] "1972-05-06"

Now we can plot temperature as a function of day:

plot(airdate,airquality$Temp,xlab="Date",ylab="Temperature",main="Temperature by Day",pch=20)

“Pch” sets the type of dot to use. show.pch() will show the basic available options.

To add a line to the plot, one can use abline(), where “a” is the intercept, “b” is the slope, and “col” is the color. abline() can also take the output of a linear regression to show the best fit line.

plot(airdate,airquality$Temp,xlab="Date",ylab="Temperature",main="Temperature by Day",pch=20)
abline(a=90,b=0,col="red")

Note the slightly odd syntax here, where abline() is a stand-alone function that draws on top of the existing plot. The base graphics in R are actually quite powerful and flexible in this way, allowing you to layer many different visualizations onto the same plot, including points, lines, text labels, and many other things. But to a degree, plotting with the base R functions has been somewhat displaced these days by the ggplot2 package, which we discuss in the next lesson.