Getting+Started+With+R

=Understanding the R environment=

R runs on a command-line environment. It runs based on a series of “objects” which are stored on your working directory. Some of these are databases (e.g., CSV files) which you may have generated in other software (e.g., Excel) that you want to store permanently. Some will get created as you work through an R session. Each time you exit R, you will be shown a list of all the objects in your working directory and asked whether you want to save/update.

R is similar to other computer languages in that you can assign values to parameters, and write code like if/then or “do” loops. Note that R __IS__ case sensitive.

//Other tips://

Ø Use the up arrow to recall previous commands. Ø If you hit enter and get a ‘+’ sign, it means the command is incomplete. Check for matching parenthesis/brackets. Ø When writing a script, use a ‘#’ at the start of a line to indicate a comment Ø Leave spaces between operators (+, - *, /)

=Getting Started=

After you download R, the first thing you should do is specify a logical working directory. Right click on the R-shortcut button on the desktop and click “Properties”. Under the “Shortcut Tab” go to the window “Start in” and type the path to a directory where you will be storing the data you are working on. I suggest you create a new folder for each project and then change the working directory (or have multiple shortcut keys – labeled for each project).

When you double click on the R shortcut button, you will see a window that looks like this:

Basically, R has 3 windows – a console (shown above) an editor window (which you can open by clicking File>NewScript), and a graphics window (which pops up automatically when you have graphical output).

To search for a command and help, try using the Rseek ([|RSeek.org]) Also check out: [] for help and tips from the friendly R community.

=Performing Simple manipulations; numbers and vectors=

//Vectors and assignment//

R operates on named data structures. The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers. To set up a vector named x, say, consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7, use the R command > x <- c(10.4, 5.6, 3.1, 6.4, 21.7)

This is an assignment statement using the function c which in this context can take an arbitrary number of vector arguments and whose value is a vector got by concatenating its arguments end to end.1

A number occurring by itself in an expression is taken as a vector of length one.

Notice that the assignment operator (‘<-’), which consists of the two characters ‘<’ (“less than”) and ‘-’ (“minus”) occurring strictly side-by-side and it ‘points’ to the object receiving the value of the expression. In most contexts the ‘=’ operator can be used as a alternative.

Assignment can also be made using the function assign. An equivalent way of making the same assignment as above is with: > assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))

The usual operator, <-, can be thought of as a syntactic short-cut to this. Assignments can also be made in the other direction, using the obvious change in the assignment operator. So the same assignment could be made using > c(10.4, 5.6, 3.1, 6.4, 21.7) -> x

If an expression is used as a complete command, the value is printed and lost. So now if we were to use the command > 1/x

the reciprocals of the five values would be printed at the terminal (and the value of x, of course, unchanged). The further assignment > y <- c(x, 0, x)

would create a vector y with 11 entries consisting of two copies of x with a zero in the middle place.

//Vector arithmetic//

Vectors can be used in arithmetic expressions, in which case the operations are performed element by element. Vectors occurring in the same expression need not all be of the same length. If they are not, the value of the expression is a vector with the same length as the longest vector which occurs in the expression. Shorter vectors in the expression are recycled as often as need be (perhaps fractionally) until they match the length of the longest vector. In particular a constant is simply repeated. So with the above assignments the command > v <- 2*x + y + 1

generates a new vector v of length 11 constructed by adding together, element by element, 2*x repeated 2.2 times, y repeated just once, and 1 repeated 11 times.

The elementary arithmetic operators are the usual +, -, *, / and ^ for raising to a power. In addition all of the common arithmetic functions are available. log, exp, sin, cos, tan, sqrt, and so on, all have their usual meaning. max and min select the largest and smallest elements of a vector respectively. range is a function whose value is a vector of length two, namely c(min(x), max(x)). length(x) is the number of elements in x, sum(x) gives the total of the elements in x, and prod(x) their product. Two statistical functions are mean(x) which calculates the sample mean, which is the same as sum(x)/length(x), and var(x) which gives sum((x-mean(x))^2)/(length(x)-1) or sample variance.

=Reading Data from Files=

Usually, you will have a data set you’ve collected an input into an Excel spreadsheet (or other database software package) that you want to run a statistical analysis on. Let’s work through a simple sample data set plants.csv which has data on the tube length (T), limb length (L) and tube base length (N) from a sample of 18 flowers (source Steel and Torrie 1980, pg. 276).

Click here to download the sample data set: First you have to make sure you save the excel file as a CSV (comma delimited text) file. Make sure you save it in the working directory for your project (i.e., the same directory you specific R to start in).

Then you have to get R to “read” the file as follows:

> data<-read.csv(“plants.csv”)

You can call your data set anything – I use “data” to be generic, but you could call it “fred”.

To make sure it is read, type > data and you should see the files listed.

Now you are ready to do some analysis. With R, every time you execute a command you have to specify the object (in this case “data”) that you want the command to run on. You can save typing by attaching the file; from this point on (unless you attach a different file) all commands will run on “data”.

> attach(data)

=Some simple statistical tests=


 * 1) //Descriptive statistics//

You’ve attached the data set (data) which you read from the plants.xls file. To calucle mean, standard deviation, variance and mean for each variable (T, L, and N), do the following

> mean(T) or > mean(L) or > mean(N) > sd(T) or > sd(L) or > sd(N) > var(T) or > var(L) or > var(N) > median(T) or > median(L) or > median(N)

to calculate quantiles:

> quantile(T)

This gives quantiles of 35%, 50%, 75% and 100%. To obtain other quantiles, you have to generate a vector of the values you want analysed. For example to analyse deciles, specify a vector (pvec) that is a sequence from 0-1, with steps of 0.1

> pvec<-seq(0,1, 0.1) > pvec [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Then > quantile(T, pvec)

You can also summarize an entire data frame by > summary(data) T L N Min. :32.00 Min. :12.00 Min. :10.00 1st Qu.:36.25 1st Qu.:15.25 1st Qu.:14.25 Median :39.50 Median :20.00 Median :15.00 Mean :40.44 Mean :19.67 Mean :16.17 3rd Qu.:44.75 3rd Qu.:22.75 3rd Qu.:18.50 Max. :53.00 Max. :29.00 Max. :22.00


 * 1) Correlation analysis

Let’s say you want to test whether tube length (T) and limb length (L) are correlated.

Very easy.

> cor(T, L)

You should see the following: [1] 0.954978

Which tells you the correlation coefficient is 0.954978.

Now type > help(cor)

This opens a help window that gives details on the command “cor”. You’ll see the default values and the various ways you can modify the command. In the above, the default is to conduct a Pearson’s correlation. If you wanted to conduct a Spearman’s you’d type: > cor(T, L, method = c(“spearman”))

You should get: [1] 0.9611001

If you want to test the significance (i.e., get a p-value):

> cor.test(T, L, method = c(“spearman”)).


 * 1) //ANOVA//

Read the data set heartrate.csv (see below) This gives heart rates of different subjects at different points in time. We will do a two-way ANOVA to see how heart rate is affected by subject and time of sample.



data<-read.csv(“heartrate.csv”) > summary(data) to see a summary of the data > anova(lm(heart.rate~subject + time)) note “lm” means that you are assuming a linear model (as opposed to a generalized linear model). The tilde ‘~’ indicates that heart.rate is a function of both subject and time.

To see a summary of the relationship between heart.rate and subject/time

> summary(lm(heart.rate~subject + time))

=Quitting=

> q

=Bailing (if a script loops //ad infintium//)=

CTRL-C

=Now that you've gotten this far - what's next?=


 * See the "getting help" page on this wiki
 * See the workshop slideshow:
 * Go to help pages for specific stats packages or graphics help.