{"id":80,"date":"2015-03-17T16:45:54","date_gmt":"2015-03-17T16:45:54","guid":{"rendered":"http:\/\/tech.me.holycross.edu\/?p=80"},"modified":"2015-03-17T16:45:54","modified_gmt":"2015-03-17T16:45:54","slug":"introduction-to-r","status":"publish","type":"post","link":"https:\/\/blogs.holycross.edu\/tech\/2015\/03\/17\/introduction-to-r\/","title":{"rendered":"An Introduction to R"},"content":{"rendered":"<div id=\"TOC\">\n<h3><em>(R is not like other statistical software. But that&#8217;s a good thing.)<\/em><\/h3>\n<ul>\n<li><a id=\"toc-prologue\" href=\"#prologue\">Prologue<\/a><\/li>\n<li><a id=\"toc-getting-a-feel-for-r\" href=\"#getting-a-feel-for-r\">Getting a feel for R<\/a>\n<ul>\n<li><a id=\"toc-the-console\" href=\"#the-console\">The console<\/a><\/li>\n<li><a id=\"toc-function-calls-and-arguments\" href=\"#function-calls-and-arguments\">Function calls and arguments<\/a>\n<ul>\n<li><a id=\"toc-hint-getting-help\" href=\"#hint-getting-help\">HINT: Getting help<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a id=\"toc-some-graphics\" href=\"#some-graphics\">Some graphics<\/a>\n<ul>\n<li><a id=\"toc-hint-command-history-and-workspace\" href=\"#hint-command-history-and-workspace\">HINT: Command history and workspace<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li><a id=\"toc-getting-data-into-r\" href=\"#getting-data-into-r\">Getting data into R<\/a>\n<ul>\n<li><a id=\"toc-data-frames\" href=\"#data-frames\">Data frames<\/a><\/li>\n<li><a id=\"toc-creating-data-frames\" href=\"#creating-data-frames\">Creating data frames<\/a>\n<ul>\n<li><a id=\"toc-hint-data-management-in-r-and-rstudio.\" href=\"#hint-data-management-in-r-and-rstudio.\">HINT: Data management in R and RStudio.<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li><a id=\"toc-exploratory-data-analysis\" href=\"#exploratory-data-analysis\">Exploratory data analysis<\/a>\n<ul>\n<li><a id=\"toc-scatterplot-matrix\" href=\"#scatterplot-matrix\">Scatterplot matrix<\/a>\n<ul>\n<li><a id=\"toc-hint-r-packages\" href=\"#hint-r-packages\">HINT: R packages<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a id=\"toc-summary-statistics-and-data-screening\" href=\"#summary-statistics-and-data-screening\">Summary statistics and data screening<\/a><\/li>\n<li><a id=\"toc-box-plots\" href=\"#box-plots\">Box plots<\/a><\/li>\n<li><a id=\"toc-stem-and-leaf-display\" href=\"#stem-and-leaf-display\">Stem-and-leaf display<\/a><\/li>\n<li><a id=\"toc-categorical-variables-factors\" href=\"#categorical-variables-factors\">Categorical variables: factors<\/a><\/li>\n<li><a id=\"toc-some-confirmatory-analysis\" href=\"#some-confirmatory-analysis\">Some confirmatory analysis<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a id=\"toc-more-graphics\" href=\"#more-graphics\">More graphics<\/a><\/li>\n<li><a id=\"toc-r-scripts-command-files\" href=\"#r-scripts-command-files\">R scripts (command files)<\/a><\/li>\n<li><a id=\"toc-an-r-menu-system-r-commander\" href=\"#an-r-menu-system-r-commander\">An R menu system: R-commander<\/a><\/li>\n<li><a id=\"toc-epilogue\" href=\"#epilogue\">Epilogue<\/a><\/li>\n<\/ul>\n<\/div>\n<p><!--more--><\/p>\n<p><a href=\"http:\/\/live-hcblog.pantheonsite.io\/tech\/wp-content\/uploads\/sites\/2\/2015\/03\/r_logo15.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-731\" src=\"https:\/\/blogs.holycross.edu\/tech\/wp-content\/uploads\/sites\/2\/2015\/03\/r_logo15.png\" alt=\"\" width=\"300\" height=\"233\" \/><\/a><\/p>\n<p>Reading time: 22 minute(s) @ 200 WPM.<\/p>\n<div id=\"prologue\" class=\"section level1\">\n<h1>Prologue<\/h1>\n<p><strong>R is a free<\/strong>, open-source, cross-platform statistical and graphical environment that can be obtained from <a href=\"http:\/\/www.r-project.org\/\">www.r-project.org<\/a>.<\/p>\n<p><strong>Some reasons for using R<\/strong>:<\/p>\n<ul>\n<li>R is world-class software for data analysis and visualization.<\/li>\n<li>R is free. There are no license fees and no restrictions on how many times and where you can install the software.<\/li>\n<li>R runs on a variety of computer platforms including Windows, MacOS, and Unix.<\/li>\n<li>R provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner.<\/li>\n<li>R contains advanced statistical routines not yet available in other packages.<\/li>\n<li>R has state-of-the-art graphics capabilities.<\/li>\n<\/ul>\n<p>More reasons for using R can be found <a href=\"https:\/\/thetarzan.wordpress.com\/2011\/07\/15\/why-use-r-a-grad-students-2-cents\/\">here.<\/a><\/p>\n<p>We note that R is <em>way<\/em> easier to use when run inside of <a href=\"https:\/\/posit.co\/products\/open-source\/rstudio\/\">RStudio<\/a>, a free, cross-platform <a href=\"https:\/\/en.wikipedia.org\/wiki\/Integrated_development_environment\">integrated development environment<\/a> (IDE). Every example in this post can either be run in standalone R or in the <a href=\"https:\/\/support.rstudio.com\/hc\/en-us\/articles\/200404846-Working-in-the-Console\">R console panel<\/a> inside of RStudio. (See also <a href=\"\/post\/rstudio-as-a-research-and-writing-platform\/\">RStudio as a Research and Writing Platform<\/a>). The true power of R lies in its programmability, that is, the R programming language.<\/p>\n<p><strong>There are many excellent sources<\/strong> of R documentation. These include <a href=\"http:\/\/cran.r-project.org\/doc\/manuals\/r-release\/R-intro.html\">An Introduction to R<\/a>, the gigantic <a href=\"http:\/\/cran.r-project.org\/doc\/manuals\/r-release\/fullrefman.pdf\">R Reference Index<\/a> (PDF), and my personal favorite, <a href=\"http:\/\/www.statmethods.net\/\">Quick-R<\/a>. There is also an <a href=\"http:\/\/journal.r-project.org\/\">R Journal<\/a>, an <a href=\"http:\/\/cran.r-project.org\/doc\/FAQ\/R-FAQ.html\">R FAQ<\/a> (Frequently Asked Questions), many <a href=\"http:\/\/www.r-project.org\/doc\/bib\/R-books.html\">books<\/a>, the <a href=\"http:\/\/cran.r-project.org\/doc\/contrib\/Short-refcard.pdf\">R Reference Card<\/a> (PDF), various <a href=\"http:\/\/web.mit.edu\/hackl\/www\/lab\/turkshop\/slides\/r-cheatsheet.pdf\">R cheat sheets<\/a> (Google it), and other assorted <a href=\"http:\/\/www.r-project.org\/other-docs.html\">R websites<\/a>.<\/p>\n<p><strong>To install R<\/strong>, go to the <a href=\"http:\/\/www.r-project.org\/\">R website<\/a>. Under \u201cGetting Started:\u201d, click <a href=\"http:\/\/cran.r-project.org\/mirrors.html\">download R<\/a>. Choose a CRAN (<strong>C<\/strong>omprehensive <strong>R<\/strong> <strong>A<\/strong>rchive <strong>N<\/strong>etwork) mirror site that is closest to you. Then, under \u201cDownload and Install R,\u201d download the installer (\u201cPrecompiled binary distribution\u201d) for your particular computer platform. Run the downloaded installer application and you should be off and running. Extensive additional gory details can be found at <a href=\"http:\/\/cran.r-project.org\/doc\/manuals\/r-release\/R-admin.html\">R Installation and Administration<\/a>.<\/p>\n<p><strong>R is not like other statistical software<\/strong>. Rather than being a \u201csoftware package\u201d in the usual sense of SPSS, SAS, Systat, etc., where the user fills in the blanks in a series of menus and dialog boxes and then has to sift through reams of canned output, R is actually a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Programming_language\">programming language<\/a> and computing environment customized for data analysis and graphics.<\/p>\n<\/div>\n<div id=\"getting-a-feel-for-r\" class=\"section level1\">\n<h1>Getting a feel for R<\/h1>\n<p>The impact of this introduction to R will be greater if you follow along in R, entering the commands as shown. You can copy and paste R commands from this introduction directly into the R console instead of typing them by hand.<\/p>\n<p>Readers may find the <a href=\"http:\/\/www.stat.berkeley.edu\/~stark\/SticiGui\/Text\/gloss.htm\">Glossary of Statistical Terms<\/a> helpful.<\/p>\n<div id=\"the-console\" class=\"section level2\">\n<h2>The console<\/h2>\n<p>When you start R you will see a command console window, into which you type R commands. It looks like this on the Mac (and will look exactly the same inside of RStudio):<\/p>\n<p><img decoding=\"async\" src=\"\/img\/IntroductionToR\/RConsole.png\" \/><\/p>\n<p>The command prompt is the <code>&gt;<\/code> symbol.<\/p>\n<p>You type R statements into the console, which is a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Command-line_interface\">command-line interface<\/a>. R then evaluates the statements, and if it finds no errors, does something, like printing a result, producing graphics, or storing a value for later use. It then waits for you to type in another command. This is how computing was done in the <a href=\"https:\/\/en.wikipedia.org\/wiki\/History_of_computing\">olden days<\/a>, and it is still <a href=\"https:\/\/code.tutsplus.com\/tutorials\/the-command-line-is-your-best-friend--net-30362\">the most efficient and powerful way to do computing<\/a>. Something about <a href=\"https:\/\/en.wiktionary.org\/wiki\/everything_old_is_new_again\">everything old is new again<\/a>, in a manner of speaking.<\/p>\n<p>R commands can also be stored in a <a href=\"https:\/\/www.r-bloggers.com\/r-scripts\/\">script<\/a>, which is simply a text file containing basically the same commands that you would enter interactively on the command line. This is one of the most powerful features of R (R is, after all, a programming language) and enables complex sequences of commands to be stored, edited, and rerun.<\/p>\n<p>R can evaluate expressions and manipulate variables, e.g.,<\/p>\n<pre><code>&gt; x &lt;- 2<\/code><\/pre>\n<p>This assigns the value of 2 to the variable <code>x<\/code>. In R-speak, this assignment operation is pronounced \u201cx gets 2,\u201d not \u201cx equals 2.\u201d Many other programming languages use an equals sign (<code>=<\/code>) for assigning values to variables (e.g., <code>x = 2<\/code>), but R is different. In R you type a less-than symbol immediately followed by a dash, making a left-pointing arrow that points towards the variable to which you want to assign a value. R will <em>let<\/em> you use an equals sign, but <a href=\"http:\/\/blog.revolutionanalytics.com\/2008\/12\/use-equals-or-arrow-for-assignment.html\">R purists<\/a> insist that using <code>&lt;-<\/code> is preferable. This is true both for historical reasons (explained in the previous link) and for consistency, because there is another R assignment operator that uses a <em>double<\/em> arrow, namely the <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/base\/html\/assignOps.html\">&lt;&lt;-<\/a> operator.<\/p>\n<p>To print the contents of a variable, simply type the variable\u2019s name:<\/p>\n<pre><code>&gt; x\n[1] 2<\/code><\/pre>\n<p>The 1 in square brackets is part of R\u2019s way of printing lists of numbers. This convention becomes more useful when there is a longer list of numbers to print. The number in brackets is the index of the first number on that line. So if we were to generate 15 random numbers from a normal distribution, our output display might look like:<\/p>\n<pre><code>&gt; rnorm(15)\n\n[1] -1.20251243 -0.93915306 -0.58314349 -0.28205304 -0.72031211 1.12303378\n[7] 1.60557581 1.30062736 1.06739881 -2.09506242 -0.04172139 -1.66868780\n[13] 0.87027623 0.43993863 -0.07720584<\/code><\/pre>\n<p>Here the <code>[7]<\/code> indicates that <code>1.60557581<\/code> is the seventh element in the list of numbers (called a <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/base\/html\/vector.html\">vector<\/a> in R parlance).<\/p>\n<p>In R, the pound symbol (<code>#<\/code>) denotes a comment. Any line that starts with <code>#<\/code> is treated as a comment and is ignored by the R <a href=\"https:\/\/en.wikipedia.org\/wiki\/Interpreted_language\">interpreter.<\/a><\/p>\n<\/div>\n<div id=\"function-calls-and-arguments\" class=\"section level2\">\n<h2>Function calls and arguments<\/h2>\n<p>Much of R involves function calls, arguments, and return values. Like in your high school algebra class, when you wrote <code>f(x)<\/code> (\u201cthe f of x\u201d), the <code>f<\/code> stands for the name of the function and the <code>x<\/code> stands for the arguments passed into the function. In algebra and in computer programming, a function is an equation or a piece of computer code that takes in one or more values (arguments), does some sort of computation with them, and then returns a result.<\/p>\n<p>For example, in R we can generate a vector of 1000 random values from a normal distribution having a mean = 0 and standard deviation = 1 (the default arguments) by typing:<\/p>\n<p><code>&gt; y = rnorm(1000)<\/code><\/p>\n<p>Here, <code>rnorm<\/code> is the name of the function and 1000 is the argument that we pass into the function. The argument tells the function that we want it to generate 1000 values. The <code>rnorm<\/code> function generates the values and returns them for storage in the variable that we have named <code>y<\/code>. If you type the name of the variable <code>y<\/code> at the R console you will get a display of those 1000 values. Here are the first and last few lines of that output:<\/p>\n<pre><code>[1] 1.3395720217 -0.0115032420 -0.6823326244 0.2613036332 0.1040746632 0.0454721524 0.3640455956 0.7205450723 0.4994317717\n[10] 1.0467334635 0.0941044374 0.0470323976 1.6612519966 -1.2653165072 0.2894176277 0.2140272012 -0.1166290364 1.9260137594\n[19] -0.1623506191 0.2429738277 -0.0908777982 -1.0126380527 -1.4146009241 0.6927008684 1.5923438893 0.7920350474 -0.3419451639\n...\n[982] 0.2657230634 0.4844520987 1.9939724168 -0.3426382437 -2.6953913082 -0.7822534469 1.0135964703 0.7795363595 0.1870213245\n[991] -1.4046173709 -0.3727388529 -0.2606101406 -0.9251060911 0.1755509390 -2.5640188283 0.5750884848 0.5416196143 -0.5890929375\n[1000] -1.4114665861<\/code><\/pre>\n<p>Instead of using the default, unspecified argument values of mean = 0 and standard deviation = 1, we can invoke <code>rnorm<\/code> like this, specifying a particular mean and sd:<\/p>\n<p><code>&gt; y = rnorm(1000, mean=42, sd=2.5)<\/code><\/p>\n<p>If you type <code>help(rnorm)<\/code> you will get a help screen that explains the various arguments that can be used with <code>rnorm<\/code> and its relatives.<\/p>\n<p>To see if we got what we wanted, we can produce a few summary statistics of the variable y:<\/p>\n<p><code>&gt; summary(y)<\/code><\/p>\n<pre><code>Min. 1st Qu. Median Mean 3rd Qu. Max.\n34.09 40.29   41.91 41.91 43.63 50.73<\/code><\/pre>\n<p>We see that the mean is 41.91, which is \u201cclose enough\u201d to 42, given that <code>rnorm<\/code> generates a different batch of random values each time it is invoked.<\/p>\n<hr \/>\n<div id=\"hint-getting-help\" class=\"section level3\">\n<h3>HINT: Getting help<\/h3>\n<p>You can get help on any R command by typing <code>help(command)<\/code>.<\/p>\n<p>e.g.,<\/p>\n<p><code>help(plot)<\/code><\/p>\n<p><code>?command<\/code> works the same, e.g.\u00a0<code>?plot<\/code><\/p>\n<p>You can also search for all help topics that contain a particular command by typing two question marks immediately followed by the command, e.g.:<\/p>\n<p><code>??plot<\/code><\/p>\n<hr \/>\n<\/div>\n<\/div>\n<div id=\"some-graphics\" class=\"section level2\">\n<h2>Some graphics<\/h2>\n<p>To visualize the batch of numbers stored in the variable <code>y<\/code>, we can produce a <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/graphics\/html\/hist.html\">histogram<\/a>:<\/p>\n<p><code>&gt; hist(y)<\/code><\/p>\n<p><a href=\"\/img\/IntroductionToR\/hist.png\"><img decoding=\"async\" src=\"\/img\/IntroductionToR\/hist50.png\" \/><\/a><\/p>\n<p>(Click image to enlarge.)<\/p>\n<p>This graphic shows us the shape of the distribution of the variable <code>y<\/code>. The peak of the histogram is centered on the mean value of 42 and tapers off symmetrically on both sides of the mean. This is because the data we generated using <code>rnorm<\/code> is from a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Normal_distribution\">normal distribution<\/a>, which by definition is in the shape of the famous bell curve.<\/p>\n<p>The plot we just produced, with only one line of R code, can be easily saved to a graphics file in various formats for importing into other applications such as word processors and presentations. For example, to export the plot to a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Portable_Document_Format\">PDF<\/a> file, select the graphics window and then invoke the R <code>File|Save As\u2026<\/code> menu option to save the file.<\/p>\n<p>Plots can be saved to other graphics formats as well, but you need to do it from the R console. To save as a JPEG graphics file:<\/p>\n<pre><code>&gt; jpeg('rplot.jpg')\n&gt; hist(y)\n&gt; dev.off()<\/code><\/pre>\n<p>Note that the plot will be saved in the current working directory (see <code>HINT: Command history and workspace<\/code>, below) unless otherwise specified. You also may get a \u201cnull device\u201d message after the <code>dev.off()<\/code> statement; this just means that the JPEG file (the \u201cgraphics device\u201d) has been closed.<\/p>\n<p>You can use similar commands to export R graphics to other popular graphics formats, such as <a href=\"http:\/\/en.wikipedia.org\/wiki\/BMP_file_format\">bmp<\/a>, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Tagged_Image_File_Format\">tiff<\/a>, and <a href=\"http:\/\/en.wikipedia.org\/wiki\/Portable_Network_Graphics\">png<\/a>.<\/p>\n<p>As is the case with many R things, graphics management is easier in <a href=\"https:\/\/posit.co\/products\/open-source\/rstudio\/\">RStudio<\/a>, which has a dedicated <code>Plots<\/code> panel for keeping track of graphics and exporting them in various formats, without the need for typing code into the console.<\/p>\n<p>To summarize thus far, <a href=\"http:\/\/www.statmethods.net\/about\/learningcurve.html\">R works differently<\/a> than other statistical software you may have used:<\/p>\n<blockquote><p>\u201cRather than setting up a complete analysis at once, the process is highly interactive. You run a command (say fit a model), take the results and process it through another command (say a set of diagnostic plots), take those results and process it through another command (say cross-validation), etc. The cycle may include transforming the data, and looping back through the whole process again. You stop when you feel that you have fully analyzed the data.\u201d<\/p><\/blockquote>\n<hr \/>\n<div id=\"hint-command-history-and-workspace\" class=\"section level3\">\n<h3>HINT: Command history and workspace<\/h3>\n<p>R maintains a history of all of your commands, which you can access using the up- and down-arrow keys. Use the left- and right-arrow keys to move through and edit a retrieved command, after which you can re-submit the edited command by hitting the <code>Enter<\/code> key. (RStudio also has a separate <code>History<\/code> tab that contains the R command history.)<\/p>\n<p>Some handy workspace commands:<\/p>\n<ul>\n<li>Print the current working directory: <code>getwd()<\/code><\/li>\n<li>Change to <code>mydirectory<\/code>: <code>setwd(mydirectory)<\/code><\/li>\n<li>List the objects in the current workspace: <code>ls()<\/code><\/li>\n<li>Remove one or more objects from the workspace: <code>rm()<\/code>, e.g., <code>rm(y)<\/code><\/li>\n<\/ul>\n<p>For more help on managing your R workspace, go <a href=\"http:\/\/www.statmethods.net\/interface\/workspace.html\">here<\/a>.<\/p>\n<hr \/>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"getting-data-into-r\" class=\"section level1\">\n<h1>Getting data into R<\/h1>\n<div id=\"data-frames\" class=\"section level2\">\n<h2>Data frames<\/h2>\n<p>R uses an internal structure called a data frame (one of several common R <a href=\"http:\/\/www.statmethods.net\/input\/datatypes.html\">data types<\/a>) to store data in a row-column, spreadsheet-like format, where the rows are the objects of interest (your <a href=\"http:\/\/en.wikipedia.org\/wiki\/Unit_of_analysis\">units of analysis<\/a>) and the columns are variables measured on each object. In R, a data frame can be created by reading in raw data <a href=\"http:\/\/www.statmethods.net\/input\/importingdata.html\">from an external file<\/a>. A data frame can be saved by <a href=\"http:\/\/www.statmethods.net\/input\/exportingdata.html\">exporting it to an external file<\/a> or to R\u2019s <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/base\/html\/save.html\">internal data format<\/a>.<\/p>\n<p>To illustrate this process we will use a subset (males only) of a famous dataset from biology, the Bumpus <a href=\"http:\/\/www.allaboutbirds.org\/guide\/House_Sparrow\/id\">House Sparrow<\/a> data:<\/p>\n<ul>\n<li>Bumpus sparrow data: <a href=\"\/datasets\/bumpus.dat.txt\">bumpus.dat.txt<\/a><\/li>\n<li>Bumpus sparrow <a href=\"https:\/\/en.wikipedia.org\/wiki\/Metadata\">metadata<\/a>: <a href=\"\/datasets\/bumpus.met.txt\">bumpus.met.txt<\/a><\/li>\n<\/ul>\n<p>Here is a description of the dataset, from <a href=\"https:\/\/www.fieldmuseum.org\/science\/blog\/hermon-bumpus-and-house-sparrows\">Hermon Bumpus and House Sparrows<\/a>:<\/p>\n<blockquote><p>\u2026 on February 1 of the present year (1898), when, after an uncommonly severe storm of snow, rain, and sleet, a number of English sparrows [= House Sparrows, <em>Passer domesticus<\/em>] were brought to the Anatomical Laboratory of Brown University [, Providence, Rhode Island]. Seventy-two of these birds revived; sixty-four perished; \u2026 \u201d (p.\u00a0209). \u201c\u2026 the storm was of long duration, and the birds were picked up, not in one locality, but in several localities; \u2026\u201d (p.\u00a0212). This event . . . described by Hermon Bumpus (1898) . . . has served as a classic example of natural selection in action. Bumpus\u2019 paper is of special interest since he included the measurements of these 136 birds in his paper.<\/p><\/blockquote>\n<p>For more information on Hermon Bumpus and House Sparrows, see <a href=\"\/pdf\/Johnston.pdf\">Hermon Bumpus and Natural Selection in the House Sparrow <em>Passer domesticus<\/em><\/a> (PDF).<\/p>\n<\/div>\n<div id=\"creating-data-frames\" class=\"section level2\">\n<h2>Creating data frames<\/h2>\n<p>To create an R data frame of the sparrow data:<\/p>\n<ol style=\"list-style-type: lower-alpha\">\n<li>Download the plain text data file <a href=\"\/datasets\/bumpus.dat.txt\">bumpus.dat.txt<\/a> to a readily-accessible location on your computer.<\/li>\n<li>Issue the following command in R:<\/li>\n<\/ol>\n<p><code>&gt; bumpus = read.table(\"d:\/empty\/bumpus.dat.txt\", sep = \"\", header = TRUE)<\/code><\/p>\n<p>replacing the <code>d:\/empty\/bumpus.txt<\/code> with the path to the data file on your particular computer.<\/p>\n<p>In the above <code>read.table<\/code> command, <code>sep = \"\"<\/code> means that there is nothing separating the data items from each other (they are only separated by white space), and <code>header = TRUE<\/code> means that the first line in the data file is a header containing the names of the variables.<\/p>\n<p>Once your data are in a data frame, you can then begin to manipulate, analyze, and visualize the data. Typing <code>names(bumpus)<\/code> lists the names of the variables. Typing the name of the data frame (<code>bumpus<\/code>) will scroll all of the data to the console window. To display individual variables, use syntax like <code>bumpus$survive<\/code>. Better still, enter <code>attach(bumpus)<\/code>. Once a data frame is \u201cattached\u201d to the console window you can just type the name of a variable by itself to display the variable\u2019s contents. Typing <code>fix(bumpus)<\/code> will place the data frame into an editing window.<\/p>\n<hr \/>\n<div id=\"hint-data-management-in-r-and-rstudio.\" class=\"section level3\">\n<h3>HINT: Data management in R and RStudio.<\/h3>\n<p>Actually, you don\u2019t need to download the data file to your computer. You can read it into an R data frame directly if you know the <a href=\"https:\/\/en.wikipedia.org\/wiki\/URL\">URL<\/a> that points to the file location on a server:<\/p>\n<p><code>&gt; bumpus = read.table(\"https:\/\/richardlent.github.io\/datasets\/bumpus.dat.txt\", sep = \"\", header = TRUE)<\/code><\/p>\n<p>This will create the bumpus data frame exactly as if you had first downloaded the source data to your computer.<\/p>\n<p>Importing data from a local file or URL is even easier in RStudio. From RStudio\u2019s File menu, or from the Environment tab, choose <code>Import Dataset|From CSV...<\/code>, and then enter the URL. If instead you choose to import data from a local file, you will be given a dialog box from which you can choose the file from your computer\u2019s filesystem. RStudio can usually figure out the structure of your data as long as you follow standard practice, such as organizing your data into a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Comma-separated_values\">comma-separated values<\/a> (CSV) file. However, RStudio also presents you with various Import Options that you can tweak until the Data Preview indicates that all is well. You then hit the <code>Import<\/code> button and a data frame will be created.<\/p>\n<p>A CSV file might look like this, in a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Plain_text\">plain-text<\/a> file we could call <code>cars.csv<\/code>:<\/p>\n<pre><code>Year,Make,Model,Length\n1997,Ford,E350,2.34\n2000,Mercury,Cougar,2.38<\/code><\/pre>\n<p>In the CSV format, the first line of the data file contains field names, and subsequent lines contain the actual data, with all values separated by commas.<\/p>\n<p>Once imported, you can open the data frame in a Data Viewer window from the RStudio <code>Environment<\/code> pane. (See <a href=\"https:\/\/support.rstudio.com\/hc\/en-us\/articles\/205175388-Using-the-Data-Viewer\">Using the Data Viewer<\/a>.) You can\u2019t edit data directly in the Data Viewer; it\u2019s just a viewer. To edit an R data frame, type, for example, <code>fix(bumpus)<\/code> into the console window. This places the data frame into an editor pane in which you can change values. When you quit the editor, the changes will be saved into the data frame. You can then save the modified data frame using the save command, e.g.,<\/p>\n<p><code>save(bumpus, file=\"bumpus.rdata\")<\/code><\/p>\n<p>This creates a data file in R\u2019s own binary format, which can later be re-loaded into your workspace with the <code>load<\/code> command:<\/p>\n<p><code>load(\"bumpus.rdata\")<\/code><\/p>\n<p>In RStudio you can also load an R data file by choosing it from the <code>Files<\/code> pane.<\/p>\n<p>An R data frame can be exported to a plain-text CSV file using the <code>write.table()<\/code> function:<\/p>\n<p><code>write.table(bumpus, \"bumpus.csv\", sep=\",\")<\/code><\/p>\n<p>Here the three parameters passed to <code>write.table()<\/code> are (1) the name of the R data frame to export, (2) the name of the exported file, and (3) the data separator.<\/p>\n<hr \/>\n<p>If we had an <a href=\"https:\/\/en.wikipedia.org\/wiki\/SPSS\">SPSS<\/a> file, we could read the data into an R data frame like this:<\/p>\n<p><code>&gt; cars = read.spss(\"d:\/empty\/cars.sav\", to.data.frame = TRUE)<\/code><\/p>\n<p>As with CSV files, you can import SPSS, <a href=\"https:\/\/en.wikipedia.org\/wiki\/SAS_Institute\">SAS<\/a>, and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Stata\">Stata<\/a> files from the <code>Import Dataset<\/code> menus in RStudio.<\/p>\n<p>Go <a href=\"http:\/\/www.statmethods.net\/input\/importingdata.html\">here<\/a> to see how to import other popular data formats.<\/p>\n<p>The following sequence of commands creates a new data frame from scratch:<\/p>\n<pre><code>age = c(25, 30, 56)\ngender = c(\"male\", \"female\", \"male\")\nweight = c(160, 110, 220)\nmydata = data.frame(age, gender, weight)<\/code><\/pre>\n<p>The <code>c()<\/code> function combines values into a vector, and the <code>data.frame()<\/code> function packages the vectors as columns into a data frame.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"exploratory-data-analysis\" class=\"section level1\">\n<h1>Exploratory data analysis<\/h1>\n<p>The principles of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Exploratory_data_analysis\">exploratory data analysis<\/a> (EDA) were pioneered by one of my favorite statisticians, <a href=\"http:\/\/en.wikipedia.org\/wiki\/John_Tukey\">John Tukey<\/a>. An important tenet of EDA is that, before you statistically analyze your data, you should actually look at it.<\/p>\n<p>Tukey stressed the contrasting approaches of <a href=\"http:\/\/www.geog.ucsb.edu\/~joel\/g210_w07\/lecture_notes\/lect01\/oh07_01_2.html\">exploratory versus confirmatory data analysis<\/a>. Classical <a href=\"http:\/\/en.wikipedia.org\/wiki\/Parametric_statistics\">parametric<\/a> statistical techniques like <a href=\"http:\/\/en.wikipedia.org\/wiki\/Analysis_of_variance\">analysis of variance<\/a> and <a href=\"http:\/\/en.wikipedia.org\/wiki\/Linear_regression\">linear regression<\/a> are confirmatory techniques requiring that the data adhere to some fairly rigid assumptions, such as <a href=\"http:\/\/en.wikipedia.org\/wiki\/Normal_distribution\">normality<\/a> and <a href=\"http:\/\/en.wikipedia.org\/wiki\/Linearity\">linearity<\/a>. Modern software makes it easy to crank out sophisticated statistical analyses with a couple of mouse clicks, but it also makes it easy to ignore the underlying assumptions. Performing some exploratory analysis of your data will allow you to assess those assumptions, and will also allow you to detect things like outliers and errors in data entry. (See Tukey\u2019s classic 1980 American Statistician paper <a href=\"\/pdf\/Tukey1980.pdf\">We Need Both Exploratory and Confirmatory<\/a>).<\/p>\n<p>What follows is an EDA session in R using the Bumpus dataset.<\/p>\n<div id=\"scatterplot-matrix\" class=\"section level2\">\n<h2>Scatterplot matrix<\/h2>\n<p>A handy graphical tool for EDA is the <em>scatterplot matrix<\/em>. This is simple to produce in R: Just give the <code>plot<\/code> function the name of a data frame:<\/p>\n<p><code>&gt; plot(bumpus)<\/code><\/p>\n<p><a href=\"\/img\/IntroductionToR\/splom.png\"><img decoding=\"async\" src=\"\/img\/IntroductionToR\/splom50.png\" \/><\/a><\/p>\n<p>(Click image to enlarge.)<\/p>\n<p>A scatterplot matrix shows individual scatterplots of all pairwise combinations of variables in the data frame and is an excellent way to search for odd patterns in your data, such as groupings of data points, nonlinearities, etc.<\/p>\n<hr \/>\n<div id=\"hint-r-packages\" class=\"section level3\">\n<h3>HINT: R packages<\/h3>\n<p>R organizes all of its routines into <em>packages<\/em>. For example, the <code>plot<\/code> routine is part of R\u2019s <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-patched\/library\/graphics\/html\/00Index.html\">graphics<\/a> package. The packages that are installed by default can be listed as follows:<\/p>\n<pre><code>&gt; getOption(\"defaultPackages\")\n\n[1] \"datasets\" \"utils\" \"grDevices\" \"graphics\" \"stats\" \"methods\"<\/code><\/pre>\n<p>And as you may have noticed by now, R commands are case-sensitive.<\/p>\n<p>The <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-patched\/library\/datasets\/html\/00Index.html\">datasets<\/a> package contains example datasets. The <a href=\"http:\/\/cran.r-project.org\/web\/packages\/R.utils\/index.html\">utils<\/a> package contains functions used for programming and developing other packages. The <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/grDevices\/html\/grDevices-package.html\">grDevices<\/a> package provides support for graphics hardware and software. The <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/graphics\/html\/00Index.html\">graphics<\/a> package contains graphics functions for the base R distribution. (Additional, more sophisticated graphics functions are in the <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/lattice\/html\/Lattice.html\">lattice<\/a> and <a href=\"http:\/\/ggplot2.tidyverse.org\/index.html\">ggplot2<\/a> packages.) Common statistical routines are found in R\u2019s <a href=\"http:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/stats\/html\/00Index.html\">stats<\/a> package. The <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/methods\/html\/Methods.html\">methods<\/a> package supports R functions.<\/p>\n<p>Users of R will probably interact most with the graphics and stats packages. There are many other R <a href=\"http:\/\/www.statmethods.net\/interface\/packages.html\">packages<\/a> that provide a way to expand the functionality of the base software. <a href=\"http:\/\/lib.stat.cmu.edu\/R\/CRAN\/web\/packages\/\">Contributed packages<\/a> are written by members of the R user community. <a href=\"http:\/\/lib.stat.cmu.edu\/R\/CRAN\/web\/views\/\">CRAN Task Views<\/a> present collections of packages organized by topic and provide tools to automatically install all packages for particular areas of interest. Another way to find R packages is to do a Google search, e.g.\u00a0<a href=\"https:\/\/www.google.com\/search?q=how+to+do+principal+coordinates+analysis+in+r&amp;oq=how+to+do+principal+coordinates+analysis+in+r&amp;aqs=chrome.0.69i57j69i62l3.11220j0&amp;sourceid=chrome&amp;ie=UTF-8#safe=off&amp;q=How+do+I+do+principal+coordinates+analysis+in+R\">How do I do principal coordinates analysis in R<\/a>?<\/p>\n<p>If you want to use an R function that is not part of the base packages, you must first install that function\u2019s package before you can use it. A Package Installer is included as part of the R GUI and can be used to download and install contributed packages. In RStudio, the Packages pane and Tools menu (<code>Install Packages<\/code>) can be used to facilitate installation and management of packages.<\/p>\n<hr \/>\n<\/div>\n<\/div>\n<div id=\"summary-statistics-and-data-screening\" class=\"section level2\">\n<h2>Summary statistics and data screening<\/h2>\n<p>To produce summary statistics for all variables in data frame <code>bumpus<\/code>, type:<\/p>\n<p><code>&gt; summary(bumpus)<\/code><\/p>\n<pre><code>    survive           length           alar           weight           lbh             lhum             lfem            ltibio          wskull           lkeel       \n Min.   :0.0000   Min.   :153.0   Min.   :39.00   Min.   :3.000   Min.   :30.00   Min.   :0.6590   Min.   :0.6580   Min.   :1.000   Min.   :0.5700   Min.   :0.7880  \n 1st Qu.:0.0000   1st Qu.:158.8   1st Qu.:44.75   1st Qu.:4.875   1st Qu.:30.30   1st Qu.:0.7080   1st Qu.:0.7000   1st Qu.:1.103   1st Qu.:0.5940   1st Qu.:0.8300  \n Median :1.0000   Median :160.0   Median :47.00   Median :5.750   Median :31.00   Median :0.7360   Median :0.7090   Median :1.117   Median :0.6000   Median :0.8500  \n Mean   :0.5893   Mean   :160.0   Mean   :46.98   Mean   :5.764   Mean   :30.96   Mean   :0.7291   Mean   :0.7105   Mean   :1.123   Mean   :0.6012   Mean   :0.8506  \n 3rd Qu.:1.0000   3rd Qu.:161.0   3rd Qu.:50.00   3rd Qu.:6.500   3rd Qu.:31.50   3rd Qu.:0.7460   3rd Qu.:0.7180   3rd Qu.:1.150   3rd Qu.:0.6082   3rd Qu.:0.8795  \n Max.   :1.0000   Max.   :166.0   Max.   :53.00   Max.   :8.300   Max.   :31.90   Max.   :0.7800   Max.   :0.7650   Max.   :1.197   Max.   :0.6330   Max.   :0.9160<\/code><\/pre>\n<p>Previously we used the <code>summary<\/code> function for a single variable. Here we give <code>summary<\/code> the name of our data frame and get summary statistics for all of the variables. <code>survive<\/code> is a binary variable, taking on only two values: 1 for survived, 0 for dead. With a binary variable the mean is not very useful: A bird can\u2019t be 0.58 dead. However, the other variables are morphological measures that vary continuously along their scale of measurement (see <a href=\"\/datasets\/bumpus.met.txt\">bumpus.met.txt<\/a>), and therefore the summary statistics can be very informative. Look at the minimum and maximum values carefully for errors, such as a misplaced decimal point. Would a House Sparrow really weigh 3000 grams (6.61 pounds)? Many researchers skip this elementary data-screening step, which is especially important if humans are typing in the data from mud-splattered field notebooks. Also, laboratory instruments can go out of calibration, batteries can become weak, power surges can fry delicate circuitry, etc., etc. Check your data!<\/p>\n<p>The mean and median both measure <a href=\"http:\/\/en.wikipedia.org\/wiki\/Central_tendency\">central tendency<\/a>, but in very different ways. Most people are familiar with the mean, or arithmetic average, calculated by summing the values and dividing the sum by the number of observations. The median, however, is the middle value of ranked data. In other words, if you sorted the values from low to high, the median would be the one in the middle. The mean is very sensitive to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Outlier\">outliers<\/a> (extreme values), whereas the median is not. The following R experiment will illustrate this.<\/p>\n<pre><code>&gt; x = c(1, 2, 3, 4, 5)\n&gt; summary(x)<\/code><\/pre>\n<pre><code>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n      1       2       3       3       4       5 <\/code><\/pre>\n<p>Note that the median and mean are equal. But if we add an extreme value (outlier):<\/p>\n<pre><code>&gt; x = c(1, 2, 3, 4, 3000)\n&gt; summary(x)<\/code><\/pre>\n<pre><code>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. \n      1       2       3     602       4    3000 <\/code><\/pre>\n<p>The median is unchanged by the huge outlier (the median is still the middle value, 3) but the mean has increased enormously. When the data are normally distributed the mean and median are equal (see <a href=\"https:\/\/statistics.laerd.com\/statistical-guides\/measures-central-tendency-mean-mode-median.php\">Measures of Central Tendency<\/a>). So another exploratory diagnostic is to compare the mean and the median. If they are very different, you may have some outliers that are skewing the distribution.<\/p>\n<\/div>\n<div id=\"box-plots\" class=\"section level2\">\n<h2>Box plots<\/h2>\n<p>The summary statistics <code>1st Qu.<\/code> and <code>3rd Qu.<\/code> are the first and third <a href=\"http:\/\/en.wikipedia.org\/wiki\/Quartile\">quartiles<\/a>. Along with the median, the quartiles divide the dataset into four equal groups, each group comprising a quarter of the observations.<\/p>\n<p>This <a href=\"http:\/\/en.wikipedia.org\/wiki\/Five-number_summary\">five-number summary<\/a> (minimum, maximum, median, first quartile, third quartile) is better visualised using another invention of John Tukey: the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Box_plot\">box plot<\/a>:<\/p>\n<p><code>&gt; boxplot(length)<\/code><\/p>\n<p><a href=\"\/img\/IntroductionToR\/boxplot.png\"><img decoding=\"async\" src=\"\/img\/IntroductionToR\/boxplot50.png\" \/><\/a><\/p>\n<p>(Click image to enlarge.)<\/p>\n<p>The line running through the central rectangle is the median. The distances from the median to the top and bottom of the rectangle span the third and first quartiles respectively. The central rectangle itself spans the distance from the first to third quartiles (the <em>interquartile range<\/em>, or IQR). Above and below the rectangle are dotted lines (called \u201cwhiskers\u201d) that usually terminate in the minimum and maximum values. However, if outliers are present (as in this example) the whiskers terminate at 1.5 times the IQR and the outlying values are shown as dots.<\/p>\n<p>A box plot is a nice way to visually assess the distribution of your data. For example, if the distribution deviates from a smooth bell-shaped curve, a box plot will make this obvious. Let\u2019s make a skewed set of numbers using the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Log-normal_distribution\">log-normal<\/a> distribution:<\/p>\n<pre><code>&gt; q = rlnorm(1000)\n&gt; hist(q)<\/code><\/pre>\n<p><a href=\"\/img\/IntroductionToR\/lognormal.png\"><img decoding=\"async\" src=\"\/img\/IntroductionToR\/lognormal50.png\" \/><\/a><\/p>\n<p>(Click image to enlarge.)<\/p>\n<p><code>&gt; boxplot(q)<\/code><\/p>\n<p><a href=\"\/img\/IntroductionToR\/squished.png\"><img decoding=\"async\" src=\"\/img\/IntroductionToR\/squished50.png\" \/><\/a><\/p>\n<p>(Click image to enlarge.)<\/p>\n<p>The severely squished box plot instantly reveals the highly skewed nature of the data.<\/p>\n<\/div>\n<div id=\"stem-and-leaf-display\" class=\"section level2\">\n<h2>Stem-and-leaf display<\/h2>\n<p>Yet another exploratory tool popularized by Tukey is the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Stem-and-leaf_display\">stem-and-leaf display<\/a>:<\/p>\n<p><code>&gt; stem(weight)<\/code><\/p>\n<pre><code>  The decimal point is at the |\n  3 | 089\n  4 | 0133566677899\n  5 | 001456667777899\n  6 | 00000013555567799\n  7 | 01569\n  8 | 033<\/code><\/pre>\n<p>The stem-and-leaf plot is best explained by looking at the variable <code>weight<\/code> as a sorted vector from low to high:<\/p>\n<p><code>sort(weight)<\/code><\/p>\n<pre><code>[1] 3.0 3.8 3.9 4.0 4.1 4.3 4.3 4.5 4.6 4.6 4.6 4.7 4.7 4.8 4.9 4.9 5.0 5.0 5.1 5.4 5.5 5.6 5.6\n[24] 5.6 5.7 5.7 5.7 5.7 5.8 5.9 5.9 6.0 6.0 6.0 6.0 6.0 6.0 6.1 6.3 6.5 6.5 6.5 6.5 6.6 6.7 6.7\n[47] 6.9 6.9 7.0 7.1 7.5 7.6 7.9 8.0 8.3 8.3<\/code><\/pre>\n<p>The vertical line in the stem-leaf plot separates the \u201cstems\u201d from the \u201cleaves,\u201d with the stems to the left of the line. The vertical line itself is interpreted as the decimal point. So, in our sorted vector of <code>weight<\/code>, the first value is 3.0. The 3 is the stem, to the left of the vertical line, the line is the decimal point, and then the first leaf to the right of the vertical line is the zero. The next two values are 3.8 and 3.9, filling out the first row of the stem-and-leaf plot. A stem-and-leaf plot is like a histogram lying on its side, with the bars made up of the actual data values. This gives you a much more detailed view of the distribution, right down to individual values, unlike a traditional histogram in which the individual values are thrown into \u201cbins\u201d that are the bars of the histogram. A histogram is based on frequencies, whereas a stem-leaf plot is based on individual values. A histogram obscures the original data by summarizing it, while a stem-leaf plot displays each and every value.<\/p>\n<\/div>\n<div id=\"categorical-variables-factors\" class=\"section level2\">\n<h2>Categorical variables: factors<\/h2>\n<p>In our <code>bumpus<\/code> data frame we have the variable survive, a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Categorical_variable\">categorical variable<\/a> taking on only two values, a 1 indicating that the bird survived the storm and a 0 indicating that it did not survive. R calls such categorical variables <a href=\"http:\/\/www.stat.berkeley.edu\/~s133\/factors.html\">factors<\/a>. We can generate multiple box plots according to factor levels, such as:<\/p>\n<p><code>&gt; boxplot(length~survive)<\/code><\/p>\n<p><a href=\"\/img\/IntroductionToR\/boxplot2.png\"><img decoding=\"async\" src=\"\/img\/IntroductionToR\/boxplot2_50.png\" \/><\/a><\/p>\n<p>(Click image to enlarge.)<\/p>\n<p>The <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/base\/html\/tilde.html\">tilde<\/a> symbol is used to separate the left and right sides in a formula. The formula <code>length~survive<\/code> is telling the <code>boxplot<\/code> function that we want a separate box plot of the continuous variable <code>length<\/code> for each level of the factor <code>survive<\/code>.<\/p>\n<p>A variant of the box plot, and one of my favorite statistical graphics, is the <em>notched<\/em> box plot:<\/p>\n<p><code>&gt; boxplot(length~survive, notch=TRUE)<\/code><\/p>\n<p><a href=\"\/img\/IntroductionToR\/notched.png\"><img decoding=\"async\" src=\"\/img\/IntroductionToR\/notched50.png\" \/><\/a><\/p>\n<p>(Click image to enlarge.)<\/p>\n<p>The notches show a 95% <a href=\"https:\/\/en.wikipedia.org\/wiki\/Confidence_interval\">confidence interval<\/a> around the medians. If the notches do not overlap, there is \u201cstrong evidence\u201d that the medians differ. Our notched box plot indicates that larger birds (males in this subset of Bumpus\u2019 data) did not survive the storm, potential evidence for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_selection\">natural selection<\/a>. [Note, however, that the Bumpus dataset has been subjected to multiple analyses and interpretations over the years. See, for example, <a href=\"\/pdf\/Johnston.pdf\">Hermon Bumpus and Natural Selection in the House Sparrow <em>Passer domesticus<\/em><\/a> (PDF) and <a href=\"\/pdf\/p0944-p0954.pdf\">Differential overnight survival by Bumpus\u2019 House Sparrows: An alternate interpretation<\/a> (PDF)].<\/p>\n<\/div>\n<div id=\"some-confirmatory-analysis\" class=\"section level2\">\n<h2>Some confirmatory analysis<\/h2>\n<p>At this point we are beginning to stray into confirmatory analysis. A couple of more examples will suffice.<\/p>\n<p>The classical <a href=\"http:\/\/en.wikipedia.org\/wiki\/Student%27s_t-test\">t-test<\/a> of the difference in mean body length between surviving and non-surviving male sparrows is as follows:<\/p>\n<p><code>&gt; t.test(length~survive)<\/code><\/p>\n<pre><code>\tWelch Two Sample t-test\n\ndata:  length by survive\nt = 3.2576, df = 50.852, p-value = 0.002005\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n 0.9007984 3.7948537\nsample estimates:\nmean in group 0 mean in group 1 \n       161.3478        159.0000 <\/code><\/pre>\n<p>And your basic one-way <a href=\"http:\/\/en.wikipedia.org\/wiki\/Analysis_of_variance\">analysis of variance<\/a> (ANOVA) for the same data is:<\/p>\n<p><code>&gt; summary(aov(length~survive))<\/code><\/p>\n<p>Note that, in the above R command, we are nesting the ANOVA function (<code>aov<\/code>) inside of the <code>summary<\/code> function. This is common practice in R, and in fact in many other programming languages, where the result returned by a function is used as the input of another function. The <code>summary<\/code> function, like many functions in R, can modify its output according to what kind of input it receives.<\/p>\n<p>Here are the ANOVA results:<\/p>\n<pre><code>            Df Sum Sq Mean Sq F value  Pr(&gt;F)   \nsurvive      1   74.7   74.71   10.16 0.00239 **\nResiduals   54  397.2    7.36                   \n---\nSignif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1<\/code><\/pre>\n<p>Both tests seem to indicate that there is a significant difference between the two means, with survivors being smaller than non-survivors.<\/p>\n<p>However, before you believe this, read Jacob Cohen\u2019s <a href=\"\/pdf\/Cohen_1990.pdf\">Things I Have Learned (So Far)<\/a>.<\/p>\n<\/div>\n<\/div>\n<div id=\"more-graphics\" class=\"section level1\">\n<h1>More graphics<\/h1>\n<p>To somewhat control the length of this post, I am not including images for all of the graphics produced by the following code. Thus, as before, the examples will be more instructive if you follow along and execute them in R.<\/p>\n<p>The R language makes it possible to create impressive graphics with only a few lines of code. To make a scatterplot of humerus versus femur length using the Bumpus data:<\/p>\n<p><code>&gt; plot(lhum, lfem)<\/code><\/p>\n<p>This yields a simple plot of the data points. However, if we instead enter:<\/p>\n<p><code>&gt; scatter.smooth(lhum, lfem)<\/code><\/p>\n<p>we get the same scatterplot of points but with a smoothed line fit to the data.<\/p>\n<p>Another way. This needs a prior call to <code>plot(lhum,lfem)<\/code>:<\/p>\n<p><code>&gt; lines(loess.smooth(lhum, lfem))<\/code><\/p>\n<p>Next, a linear instead of a lowess\/loess smooth (lowess stands for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Local_regression\">LOcally WEighted Scatterplot Smoothing<\/a>). Note that variables are reversed in the call to <code>abline<\/code>. This also needs a prior call to <code>plot(lhum,lfem)<\/code>.<\/p>\n<p><code>&gt; abline(lm(lfem~lhum))<\/code><\/p>\n<p>[<code>lines()<\/code> connects dots to produce a curve; <code>abline()<\/code> plots a single straight line.]<\/p>\n<p>This works because the <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/graphics\/html\/00Index.html\">base graphics package<\/a> in R will keep drawing a sequence of graphic statements to the same graphics window until the window is closed or a new graphics window is created. Issuing a new <code>plot<\/code> command or <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/graphics\/html\/frame.html\">plot.new()<\/a> will create a new graphics window; the <a href=\"https:\/\/stat.ethz.ch\/R-manual\/R-devel\/library\/grDevices\/html\/dev.html\">dev.off()<\/a> function will close the current graphics window. The statement <code>plot(lhum,lfem)<\/code> creates the scatterplot of points, and then the <code>abline(lm(lfem~lhum))<\/code> statement overlays the smoothed line onto the same graphics window, producing a plot that contains both the data points and the smoothed line.<\/p>\n<p>Change the line color and thickness (again, a prior call to <code>plot(lhum,lfem)<\/code> is needed here:<\/p>\n<p><code>&gt; abline(lm(lfem~lhum), col='blue', lwd=10)<\/code><\/p>\n<p>Here again is our notched box plot, but we now add a title and some axis labels:<\/p>\n<p><code>&gt; boxplot(length~survive, notch=TRUE, main=\"Bumpus' Sparrows\", xlab=\"Survival\", ylab=\"Body length\")<\/code><\/p>\n<p>And now for something a little fancier:<\/p>\n<p>Invoke the <a href=\"https:\/\/cran.r-project.org\/web\/packages\/rgl\/index.html\">rgl<\/a> 3D graphics package.<\/p>\n<p><code>&gt; library(rgl)<\/code><\/p>\n<p>Rotate your data in 3D! Drag your mouse cursor over the plot to rotate. (It\u2019s much more fun when you are doing it in real time instead of looking at a 2D picture.)<\/p>\n<p><code>&gt; plot3d(length, alar, lfem, col=\"red\", size=6)<\/code><\/p>\n<p><a href=\"\/img\/IntroductionToR\/rotate.png\"><img decoding=\"async\" src=\"\/img\/IntroductionToR\/rotate50.png\" \/><\/a><\/p>\n<p>(Click image to enlarge.)<\/p>\n<p>Create a new variable for labeling points:<\/p>\n<p><code>&gt; bumpus$survcat = ifelse(bumpus$survive == 0, c('N'), c('Y'))<\/code><\/p>\n<p>If you now type <code>names(bumpus)<\/code> you will see that the categorical variable <code>survcat<\/code> has been added to the data frame.<\/p>\n<p>Next, re-plot with smaller plotting symbols:<\/p>\n<p><code>&gt; plot3d(length, alar, lfem, size=2)<\/code><\/p>\n<p>and then add text labels to the plotted data points. Rotate your data interactively with the mouse and ponder the effects of body size on survival:<\/p>\n<p><code>&gt; text3d(length, alar, lfem, texts=bumpus$survcat)<\/code><\/p>\n<p>You can even skip creating a separate labeling variable by inserting the recoding statement inside of the call to <code>text3d()<\/code>:<\/p>\n<p><code>text3d(length, alar, lfem, texts=ifelse(bumpus$survive == 0, c('N'), c('Y')))<\/code><\/p>\n<\/div>\n<div id=\"r-scripts-command-files\" class=\"section level1\">\n<h1>R scripts (command files)<\/h1>\n<p>As mentioned earlier, R statements can be stored in a text file and submitted together as a batch. Thus we could produce our 3D rotating scatterplot by storing the following R commands in a plain text file:<\/p>\n<pre><code>library(rgl)\nbumpus = read.table(\"https:\/\/richardlent.github.io\/datasets\/bumpus.dat.txt\", sep = \"\", header = TRUE)\nattach(bumpus)\nplot3d(length, alar, lfem, size=2)\ntext3d(length, alar, lfem, texts=ifelse(bumpus$survive == 0, c('N'), c('Y')))<\/code><\/pre>\n<p>To create R scripts, also known as command files, use whatever plain text editor you like as long as it produces nothing more than <a href=\"http:\/\/en.wikipedia.org\/wiki\/Plain_text\">plain text<\/a>. DO NOT use a word processor, because word processors insert invisible formatting symbols that will mess up your R commands. Plain text editors include <a href=\"http:\/\/en.wikipedia.org\/wiki\/TextEdit\">TextEdit<\/a> for the Mac (just be sure you use plain text, not rich text), <a href=\"http:\/\/en.wikipedia.org\/wiki\/Notepad_%28software%29\">Notepad<\/a> for Windows, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Emacs\">Emacs<\/a> for Unix, etc. RStudio has a plain text editor built in; access it from the <code>File<\/code> menu (choose <code>New File<\/code>, then <code>R Script<\/code>). You can also create scripts using standalone R, from the <code>File<\/code> menu (<code>File|New Document<\/code>).<\/p>\n<p>A script contains R commands written exactly as you would have entered them interactively at the R console. If we named our script <code>plot.R<\/code> (the R suffix is standard practice for R command files), we could then submit this script to R with the following statement:<\/p>\n<p><code>source(\"plot.R\")<\/code><\/p>\n<p>You can also submit command files using the R GUI (<code>File|Source File...<\/code>). And the RStudio text editor has built-in tools for editing and running scripts from inside of RStudio. See <a href=\"https:\/\/support.rstudio.com\/hc\/en-us\/articles\/200484448-Editing-and-Executing-Code\">Editing and Executing Code<\/a> for more information.<\/p>\n<p>Scripts are an excellent way to maintain complicated batches of R statements without having to constantly retype them.<\/p>\n<\/div>\n<div id=\"an-r-menu-system-r-commander\" class=\"section level1\">\n<h1>An R menu system: R-commander<\/h1>\n<p>If you simply must have a menu-driven system in R, an R package called <a href=\"http:\/\/socserv.mcmaster.ca\/jfox\/Misc\/Rcmdr\/\">R Commander<\/a> runs on top of R and enables users to access a selection of commonly-used R commands with a simple, menu-driven interface. R Commander is a <a href=\"http:\/\/cran.r-project.org\/web\/packages\/Rcmdr\/index.html\">standard R package<\/a> and can be installed using the R Package Installer that is part of the standard R GUI, or from the RStudio <code>Packages<\/code> pane. R Commander depends on a number of other contributed packages, which must be present in order for it to work correctly. (Be sure to click the <code>Install Dependencies<\/code> checkbox in the Package Installer when you are installing R Commander.) If you are getting error messages when you try to load R Commander, see <a href=\"http:\/\/socserv.mcmaster.ca\/jfox\/Misc\/Rcmdr\/installation-notes.html\">R Commander Installation Notes<\/a>.<\/p>\n<p>After installing the <a href=\"http:\/\/cran.r-project.org\/web\/packages\/Rcmdr\/index.html\">Rcmdr<\/a> package, enter <code>library(Rcmdr)<\/code> at the R prompt and R Commander should appear. It looks like this:<\/p>\n<p><a href=\"\/img\/IntroductionToR\/rcommander.png\"><img decoding=\"async\" src=\"\/img\/IntroductionToR\/rcommander50.png\" \/><\/a><\/p>\n<p>(Click image to enlarge.)<\/p>\n<p>The beauty of R Commander is that it generates the R code that corresponds to your various menu selections, which makes it a great tool for learning the R language. You can interactively change the code and resubmit it. This is often necessary because no point-and-click menu system could possibly cover all of the options in a system as powerful and complex as R.<\/p>\n<p>For more information on using R Commander, see <a href=\"https:\/\/socserv.mcmaster.ca\/jfox\/Misc\/Rcmdr\/Getting-Started-with-the-Rcmdr.pdf\">Getting Started With the R Commander<\/a> (PDF).<\/p>\n<\/div>\n<div id=\"epilogue\" class=\"section level1\">\n<h1>Epilogue<\/h1>\n<p>This introduction has only scratched the proverbial surface of what can be done with R. I firmly believe that R is well worth the effort involved in learning it. To quote Robert I. Kabacoff, Ph.D., author of <a href=\"http:\/\/www.statmethods.net\">Quick-R<\/a>, speaking about <a href=\"http:\/\/www.statmethods.net\/about\/learningcurve.html\">Why R has A Steep Learning Curve<\/a>:<\/p>\n<blockquote><p>\u201c[In the end] I believe that you will feel much more intimately in touch with your data and in control of your work. And it\u2019s fun!\u201d<\/p><\/blockquote>\n<p>Yes. You <em>will<\/em> have fun.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>(R is not like other statistical software. But that&#8217;s a good thing.) Prologue Getting a feel for R The console Function calls and arguments HINT: Getting help Some graphics HINT: Command history and workspace Getting data into R Data frames Creating data frames HINT: Data management in R and RStudio. Exploratory data analysis Scatterplot matrix &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/blogs.holycross.edu\/tech\/2015\/03\/17\/introduction-to-r\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;An Introduction to R&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,3,5],"tags":[],"class_list":["post-80","post","type-post","status-publish","format-standard","hentry","category-data-analysis","category-data-visualization","category-programming"],"_links":{"self":[{"href":"https:\/\/blogs.holycross.edu\/tech\/wp-json\/wp\/v2\/posts\/80","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.holycross.edu\/tech\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.holycross.edu\/tech\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.holycross.edu\/tech\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.holycross.edu\/tech\/wp-json\/wp\/v2\/comments?post=80"}],"version-history":[{"count":0,"href":"https:\/\/blogs.holycross.edu\/tech\/wp-json\/wp\/v2\/posts\/80\/revisions"}],"wp:attachment":[{"href":"https:\/\/blogs.holycross.edu\/tech\/wp-json\/wp\/v2\/media?parent=80"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.holycross.edu\/tech\/wp-json\/wp\/v2\/categories?post=80"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.holycross.edu\/tech\/wp-json\/wp\/v2\/tags?post=80"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}