Project 2: Exploratory Data Analysis and Descriptive Statistics in R

Project 2: Exploratory Data Analysis and Descriptive Statistics in R msm26

Background

This week, we covered some key concepts important for conducting research and analysis. Now, it is time for you to put this information to use through some analysis. In this week's project, you will be analyzing some point data describing crimes in St. Louis, Missouri and creating a descriptive analysis of the data. Next week, you will build on this analysis by using some spatial point pattern statistics with the same dataset to look at the spatial patterns of crime in more detail.

To carry out this week's analysis, you will use RStudio and several packages in R. The assignment has been broken down into several components.

You will analyze the data and familiarize yourself with R by copying and pasting the provided R code from Lesson 2 into RStudio. In some cases, you will want to modify this code to analyze additional variables or produce different graphs and charts that you think will help with your analysis. As you work through the analysis, make some informal notes about what you are finding.

The outputs generated by running these analyses provides evidence to answer the questions listed at the bottom of this page. Once you have completed your analysis and obtained the different statistical and graphical outputs, use this evidence to create a more formal written report of the analysis.

Project Resources

You need an installation of R and RStudio, to which you will need to add the packages listed on a subsequent page of these instructions.

You will also need some data. You will find a data archive to download (Geog586_Les2_Project.zip) in the Lesson 2 materials in Canvas. If you have any difficulty in downloading the data, please contact me.

This archive includes the following files:

  • Crime data files for St Louis: crimeStLouis20132014b.csv, crimeStLouis20132014b_agg.csv
  • Boundary of St Louis data file (shapefile): stl_boundary_ll.shp

How to Move Forward with the Analysis

As you work your way through the R code on the following pages of the project instructions, keep the following questions in mind. Jotting down some notes related to each of them will help you to synthesize your findings into your project report.

Discuss Overall Findings

  1. What types of crime were recorded in the city?
  2. Was there a temporal element associated with each of the crimes?
  3. How did the occurrence of crimes changes between the years?
  4. Where did crimes take place? Was there a particular spatial distribution associated with each of the crimes? Did they take place in a particular part of the city?
  5. What additional types of analysis would be useful?

Descriptive Statistics

  1. What type of distribution does your data have?
  2. What is a typical value in this dataset?
  3. How widely do values in the dataset vary?
  4. Are there any unusually high or low value in this dataset? (e.g., outliers)

Visualizing the Data

  1. Map the data: What is spatial distribution of crimes?
  2. Create graphs:
    1. Of the crimes in the dataset, is there a crime that occurs more than others?
    2. Is there a month when crimes occur more than others?
    3. Have crimes increased or decreased over time?

Summary of the Minimum Project 2 Deliverables

To give you an idea of the work that will be required for this project, here is a summary of the minimum items you will create for Project 2.

  • Create necessary maps, graphics, or tables of the data. Include in your written report any and all maps, graphics, and tables that are needed to support your analysis.
  • The analysis should help you present the spatial patterns of crime types across the city as well as what evidence exists showing any temporal trend in crime overall and for individual crime types.
NOTE: When you have completed the project, please submit your writeup to the Canvas assignment for this lesson.

Questions?

Please use the 'Week 2 lesson discussion' forum to ask for clarification on any of these concepts and ideas. Hopefully, some of your classmates will be able to help with answering your questions, and I will also provide further commentary there where appropriate.

Project 2: Getting Started in R and RStudio

Project 2: Getting Started in R and RStudio mjg8

RStudio/R is a standard statistical analysis package that is free, is extensible, and contains many new methods and analysis that are contributed by researchers across the globe.

Install R First

Navigate to the R website and choose the CRAN link under Download in the menu at the left.

Select a CRAN location. R is hosted at multiple locations around the world to enable quicker downloads. Choose a location (any site will do) from which, among other things, you will obtain various packages used during your coding.

On the next page, under Download and Install R, choose the appropriate OS for your computer.

Windows installs

On the next page, for most of you, you will want to choose the Install R for the first time link (on Windows). Note that the link you select is unique to the CRAN site you selected.

On the next page, choose the Download R 4.1.2 for Windows link (or whatever version is current).

The file that is downloaded is titled R-4.1.2-win.exe (note that the 4.1.2 specifies the version while win indicates the OS).

Once the file is downloaded, initiate the executable and follow the prompts to download and install R.

Mac OS X installs

From the Download and Install page, simply choose the link for the latest release package (for Mac OSX) and then double-click the downloaded installer and follow the installation prompts.

Note: If you have a recently purchased Mac with the new M1 chip, you need to make sure you select the installer for that hardware framework (and vice-versa for slightly older machines, which use the Intel chips).

Install RStudio

Once R is successfully installed, you can download and install RStudio.

You'll install RStudio at this link. Versions are available for all the major platforms. RStudio offers several price points. Choose the price point you are interested in and download and install RStudio. (I assume you are interested in the FREE version).

Once both R and RStudio are installed, start RStudio.

The console window appears as shown in Figure 2.4. The console window includes many components. The Console tab is where you will do your coding. You can also go to File - New File - R Script to open the Source window and enter your code there. The Environment tab in the upper-right corner shows information about variables and their contents. The History tab allows you to see the history of your coding and changes. The lower-right corner has several tabs. The Plots tab shows you any graphical output (such as histograms, boxplots, maps, etc.) that your code outputs. The Packages tab allows you to, among other things, check to see which packages are currently installed with your program. The Help tab allows you to search for R functions and learn more about their utility in your code.

As you work through your code, you should get in the habit of saving your work often. Compared to other programs, R is very stable, but you never know. You can save your R code by looking under File and choosing the Save As option. Name your code file something logical like Lesson_2.

The RStudio Console Window showing type commands, range of tabs to view variables and history, range of tabs to view and manage different objects, and install and manage packages
Figure 2.4. The RStudio Console window.
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0

Project 2: Obtaining and installing R packages

Project 2: Obtaining and installing R packages ksc17

R contains many packages that researchers around the world have created. To learn more about what is available, look through the following links.

Packages in R

R itself has a relatively modest number of capabilities grouped in the base package. Its real power comes from its extensive set of libraries, which are bundled up groups of functions called packages. However, this power comes at the cost of the fact that knowing which statistical tool is in which particular package can be confusing to new users.

During this course, you will be using many different packages. To use a package, you will first need to obtain it, install it, and then use the library() command to load it. There are two ways in which you can select and install a package.

Installing packages by command line

> install.packages("ggplot2")
> library(ggplot2)

Installing packages with RStudio

Click the Packages tab and click Install (Figure 2.5). A window will pop up. Select repository to install from, and then type in the package(s) you want. To install the package(s), click Install. Once the package has been installed, you will see it listed in the packages library. To turn a package on or off, click on the box next to the package listed in the User Library on the packages tab. Some packages 'depend' on other packages, so sometimes to use one package you have to also download and install a second one. Packages are updated at different intervals, so check for updates regularly.

Troubleshooting

In some cases, a package will fail to load. You may not realize this has happened until you try to use one of the functions and an error will return saying something to the effect that function() was not found. This happens, but you should not take this as a serious error. If you run into a situation where a package fails to install, try installing it again.

If you get an error message saying something like: Error if readOGR(gisfile, layer = "stl_boundary_11", GDAL1_integert64_policy = FALSE) : could not find function "readOGR"

Then, issue the ??readOGR command in the RStudio console window and figure out which package you are missing and install (or reinstall) it.

Screenshot of the R-shiny software. Image displays the installation process described in the previous paragraph.
Figure 2.5: View of the Packages tab and window showing how to install packages.
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0

What to install

By whatever method you choose, install the following packages. Links are provided for a few packages, providing a bit more information about the package.

Installing these packages will take a little time. Pay attention to the messages being printed in the RStudio console to make sure that things are installing correctly and that you're not getting error messages. If you get asked whether you want to install a version that needs compiling, you can say no and use the older version.

Various packages used for mapping, spatial analysis, statistics and visualizations:

  • doBy
  • dplyr
  • ggmap (although the tmap package is relatively newer and works with the newer sf format)
  • ggplot2
  • gridExtra
  • gstat
  • leaflet: mapping elements
  • spatstat
  • sf (a relatively new format for R objects)
  • sp
  • spdep
  • raster
  • RColorBrewer

Additional help is available through video tutorials at LinkedInLearning.com. As a PSU student, you can access these videos. Sign in with your Penn State account, browse software, and select R.

Project 2: Introduction to R: the essentials

Project 2: Introduction to R: the essentials ksc17

In Canvas, you will find Statistics_RtheEssentials_2019.pdf in the Additional Reading Materials link in Lesson 2. It contains some essential commands to get you going.

This is not a definitive resource – since it would end up as a book in its own right. Instead, it provides a variety of commands that will allow you to load your data, explore, and visualize your data.


Three Considerations As You Write and Debug Code
Coding takes patience but can be very rewarding. To help you work through code, here are three suggestions.

  1. If there is something that you would like your code to do but is not covered explicitly in the lesson instructions, or you are not able to figure out an error, then post a question to the discussion forum and either another student or I may know the answer. If you are asking about a solution to an error in your code, please provide a screenshot of the error message and lines around the place in the code where the error is reported to occur. Simply posting that "my program doesn't work" or "I am getting an error message" will not help others help you.
  2. If you discover or know some code that is useful or goes beyond that which is covered in the lesson and would be beneficial to others in the class, then please share it. We are all here to learn in a collective environment.
  3. Use scripts! (go to file/new script). These allow you to save your code, so you don't have to retype it later. If you enter your code into the console then the computer will "talk to you" and run the commands, but it will not remember anything - AKA you are having a phone call with the computer but speaking the language R. But with a script, you can enter and save your commands and return back to them (or re-run them later) - so to build on the metaphor, this is like sending an e-mail, you have a written record of what you said.

Project 2: Setting Up Your Analysis Environment in R

Project 2: Setting Up Your Analysis Environment in R msm26

Adding the R packages that will be needed for the analysis

Many of the analyses you will be using require commands contained in different packages. You should have already loaded many of these packages, but just to ensure we document what packages we are using, let's add in the relevant information so that if the packages have not been switched on, they will be when we run our analysis.

Note that the lines that start with #install.packages will be ignored in this instance because of the # in front of the command. Lines of code with a # symbol are treated as comments by the compiler. If you do need to install a package (e.g., if you hadn't already installed all of the relevant packages), then remove the # when you run your code so the line will execute.

Note that for every package you install, you must also issue the library() function to tell R that you wish to access that package. Installing a package does not ensure that you have access to the functions that are a part of that package.

#install and load packages here

install.packages("ggplot2")
library(ggplot2)
install.packages("spatstat")
library (spatstat)
install.packages("leaflet")
library(leaflet)
install.packages("dplyr")
library (dplyr)
install.packages("doBy")
library(doBy)
install.packages("sf")
library(sf)
install.packages("ggmap")
library (ggmap)
install.packages("gridExtra")
library(gridExtra)
install.packages("sp")
library(sp)
install.packages("RColorBrewer")
library (RColorBrewer)

Set the working directory

Before we start writing R code, you should make sure that R knows where to look for the shapefiles and text files. First, you need to set the working directory. In your case, the working directory is the folder location where you placed the contents of the .zip file when you unzipped it. To follow up, make sure that the correct file location is specified. There are several ways you can do this. As shown below, you can set the directory path to a variable and then reference that variable whereever and whenever you need to. Notice that there is a slash at the end of the directory name. If you don't include this, it can lead to the final folder name getting merged with a filename, meaning R cannot find your file.

file_dir_crime <-"C:/Geog586_Les2_Project/crime/"

Note that R is very sensitive to the use of "/" as a directory level indicator. R does not recognize "\" as a directory level indicator and an error will return. DO NOT COPY THE DIRECTORY PATH FROM FILE EXPLORER AS THE FILE PATH USES "\" but R only recognizes "/" for this purpose.

You can also issue the setwd() function that "sets" the working directory for your program.

setwd("C:/Geog586_Les2_Project/crime/")

Finally and most easily, you can also set the working directory through RStudio's interface: Session - Set Working Directory - Choose Directory.

Issue the getwd() function to confirm that R knows what the correct path is.

#check and set the working directory
#if you do not set the working directory ensure that you use the full directory pathname to where you saved the file.
#R is very sensitive to the use of "/" as a directory level indicator. R does not recognize "\" as a directory level indicator and an error will return. 
#DO NOT COPY THE DIRECTORY PATH FROM YOUR FILE EXPLORER AS THE FILE PATH USES "\"
#e.g. "C:/Geog586_Les2_Project/crime/">
#check the directory path that R thinks is correct with the getwd() function.
getwd()

#set directory to where the data is so that you can reference these varibles later rather than typing the directory path out again.
#you will need to adjust this for your own filepath.
file_dir_crime <-"C:/Geog586_Les2_Project/crime/"

#make sure that the path is correct and that the csv files are in the expected directory
list.files(file_dir_crime)
#this is an alternative way of checking that R can see your csv files. In this version of the command, you are asking R to list only the .csv files
#that are in the folder located at the filepath filedircrime.
list.files(path = file_dir_crime, pattern = "csv")

#to view the files in the other directory
file_dir_gis <-"C:/Geog586_Les2_Project/gis/"

#make sure that the path is correct and that the shp file is in the expected directory
list.files(file_dir_gis)
# and again, the version of the command that limits the list to shapefiles only
list.files(path = file_dir_gis, pattern="shp")

#You are not creating any outputs here, but you may want to think about setting up a working directory
#where you can write outputs when and if needed. First create a new directory, in this case called outputs 
#and then set the working directory to point to that directory.
wk_dir_output<-setwd("C:/Geog586_Les2_Project/outputs/")

When you've finished replacing the filepaths above to the relevant ones on your own computer, check the results in the RStudio console with the list.files() command to make sure that R is seeing the files in the directories. You should see outputs that print that list the files in each of the directories. 90% of problems with R code come from either a library not being loaded correctly or a filepath problem, so it is worth taking some care here.

Loading and viewing the data file (Two Code Options)

Add the crime data file that you need and view the contents of the file. Inspect the results to make sure that you are able to read the crime data.

#Option #1: Read in crime file by pasting the path into a variable name
#concatenate the filedircrime (set earlier to your directory) with the filename using paste()
crime_file <-paste(file_dir_crime,"crimeStLouis20132014b.csv", sep = "")
#check that you've accessed the file correctly - next line of code should return TRUE
file.exists(crime_file)

crimesstl <- read.csv(crime_file, header=TRUE,sep=",")

#Option #2: Set the file path and read in the file into a variable name
crimesstl2 <- read.csv("C:/Geog586_Les2_Project/crime/crimeStLouis20132014b.csv", header=TRUE, sep=",")

#returns the first ten rows of the crime data
head(crimesstl, n=10)

#view the variable names
names(crimesstl)

# view data as attribute tables (spreadsheet-style data viewer) opened in a new tab.
# Your opportunity to explorethe data a bit more!
View(crimesstl) #feel free to use the 'View' function whenever you need to explore data as an attribute table
View(crimesstl2)

#create a list of the unique crime types in the data set and view what these are so that you can select using these so that you can explore their distributions.
listcrimetype <-unique(crimesstl$crimetype)
listcrimetype

Project 2: Running the Analysis in R

Project 2: Running the Analysis in R msm26

Now that the packages and the data have been loaded, you can start to analyze the crime data.

Before Beginning: An Important Consideration

The example descriptive statistics that are reported below are just that, examples of what is possible - not necessarily a template that you should follow exactly. In a similiar light, the visuals that are shown below are only examples of the evidence that you could create for your analysis. In other words, avoid the tendancy to simply reproduce the descriptive statistics and visuals that are shown in the sections that follow. This analysis you create should be based on your investigation of the data and the kinds of research questions that you want to answer. Think about the questions about the data that you want answered and the descriptive statistics and visuals that are needed to help you answer those questions.

Descriptive Statistics

Use various descriptive statistics that have been discussed or others that you are aware of to analyze the types of crime that have been recorded, when these crimes were most abundant (e.g., during what year, month, and time of the day), and where crimes were most abundant. Refer to the RtheEssentials.pdf for additional commands and visualizations you can use.

#Summarise data
summary(crimesstl)

#Make a data frame (a data structure) with crimes by crime type
dt <- data.frame(cnt=crimesstl$count, group=crimesstl$crimetype)
#save these grouped data to a variable so you can use it other commands
grp <- group_by(dt, group)

#Summarise data from library (dplyr)
#Summarise the number of counts for each group
summarise(grp, sum=sum(cnt))
#transpose the table
tapply(crimesstl$count, crimesstl$crimetype,sum)

Visualizations of the data

#Descriptive analysis
#Barchart of crimes by month
countsmonth <- table(crimesstl$month)
barplot(countsmonth, col="grey", main="Number of Crimes by Month",xlab="Month",ylab="Number of Crimes")

Bar chart of the number of crimes by month
Figure 2.9: Bar chart of crimes by month
Credit: Griffin, © Penn State University, licensed under CC BY-NC-SA 4.0
#Barchart of crimes by year
countsyr <- table(crimesstl$year)
barplot(countsyr, col="darkcyan", main="Number of Crimes by Year",xlab="Year",ylab="Number of Crimes") 
Figure 2.10: Bar chart of crimes by year
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0
#Barchart of crimes by crimetype
counts <- table(crimesstl$crimetype)
barplot(counts, col = "cornflowerblue", main = "Number of Crimes by Crime Type", xlab="Crime Type", ylab="Number of Crimes")
Figure 2.11: Frequency of crime by crime type
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0
#BoxPlots are useful for comparing data. 
#Use the dataset crimeStLouis20132014b_agg.csv.  
#These data are aggregated by neighbourhood.  
agg_crime_file <-paste(file_dir_crime,"crimeStLouis20132014b_agg.csv", sep = "")
#check everything worked ok with accessing the file
file.exists(agg_crime_file)
crimesstlagg <- read.csv(agg_crime_file, header=TRUE,sep=",")

#Compare crimetypes
boxplot(count~crimetype, data=crimesstlagg,main="Boxplots According to Crime Type",
            xlab="Crime Type", ylab="Number of Crimes",
            col="cornsilk", border="brown", pch=19)
Figure 2.12: Boxplot by crime type
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0

Mapping the data

Now let’s see where the crimes are taking place. Create an interactive map so that you can view where the crimes have taken place.

#Create an interactive map that plots the crime points on a background map.

#This will create a map with all of the points
gis_file <- paste(file_dir_gis,"stl_boundary_ll.shp", sep="")
file.exists(gis_file)

#Read the St Louis Boundary Shapefile
StLouisBND <- read_sf(gis_file)

leaflet(crimesstl) %>%
 addTiles() %>%
 addPolygons(data=StLouisBND, color = "#444444", weight = 3, smoothFactor = 0.5,
         opacity = 1.0, fillOpacity = 0.5, fill= FALSE,
         highlightOptions = highlightOptions(color = "white", weight = 2,
                           bringToFront = TRUE)) %>%
 addCircles(lng = ~xL, lat = ~yL, weight = 7, radius = 5,
  popup = paste0("Crime type: ", as.character(crimesstl$crimetype),
               "; Month: ", as.character(crimesstl$month)))


Image of the interactive map created of the distribution of crimes
Figure 2.13: Image of the interactive map created of the distribution of crimes
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0

Viewing different crimes

To view different crimes you will need to refer back to the section where you viewed the different crime types in the datasets so that you can specify what crimes to select. View either the boxplot you created or the summaries you created. You will map the distribution of arson, dui, and homicide in the St. Louis area.

To create a map of a specific type of crime you will first need to create a subset of the data.

#Now view each of the crimes
#create a subset of the data
crimess22 <- subset(crimesstl, crimetype == "arson")
crimess33 <- subset(crimesstl, crimetype == "dui")
crimess44 <- subset(crimesstl, crimetype == "homicide")

#Check to see that the selection worked and view the records in the subset. 
crimess22
crimess33
crimess44

#Create an individual plot of each crime to show the variation in crime distribution.
# To view a map using ggplot the shapefile needs to be converted to a spatial data frame.
stLbnd_df <- as_Spatial(StLouisBND)

#create the individual maps using ggplot 
g1<-ggplot() + 
      geom_polygon(data=stLbnd_df, aes(x=long, y=lat,group=group),color='black',size = .2, fill=NA) +
      geom_point(data = crimesstl, aes(x = xL, y = yL),color = "black", size = 1) + ggtitle("All Crimes") +
      coord_fixed(1.3)

g2<- ggplot() + 
     geom_polygon(data=stLbnd_df, aes(x=long, y=lat,group=group),color='black',size = .2, fill=NA) +
      geom_point(data = crimess22, aes(x = xL, y = yL),color = "black", size = 1) +
  ggtitle("Arson") +
      coord_fixed(1.3)

g3<- ggplot() + 
       geom_polygon(data=stLbnd_df, aes(x=long, y=lat,group=group),color='black',size = .2, fill=NA) +
        geom_point(data = crimess33, aes(x = xL, y = yL), color = "black", size = 1) + ggtitle("DUI") +
      coord_fixed(1.3)

g4<- ggplot() + 
  geom_polygon(data=stLbnd_df, aes(x=long, y=lat,group=group),color='black',size = .2, fill=NA) +    
  geom_point(data = crimess44, aes(x = xL, y = yL), color = "black", size = 1) + ggtitle("Homicide") +
      coord_fixed(1.3)


#Arrange the plots in a 2x2 column

grid.arrange(g1,g2,g3,g4, nrow=2,ncol=2)
Distribution of the different crimes
Figure 2.14: Distribution of the different crimes
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0
#Create an interactive map for each crimetype
#To view a particular crime you will want to create a map using a subset of the data.
#Or use leaflet and include a background map to put the crime in context.

leaflet(crimesstl) %>% 
  addTiles() %>%
  addPolygons(data=StLouisBND, color = "#444444", weight = 3, smoothFactor = 0.5,
                  opacity = 1.0, fillOpacity = 0.5, fill= FALSE,
                  highlightOptions = highlightOptions(color = "white", weight = 2,
                                                      bringToFront = TRUE)) %>%
  
  addCircles(data = crimess22, lng = crimess22$xL, lat = crimess22$yL, weight = 5, radius = 10,
  popup = paste0("Crime type: ", as.character(crimess22$crimetype),
               "; Month: ", as.character(crimess22$month)))


#Turn these on when you are ready to view them. To do so remove the # and switch off the addCircles above by adding a #
#addCircles(data = crimess33, lng = crimess33$xL, lat = crimess33$yL, weight = 5, radius = 10,
  #popup = paste0("Crime type: ", as.character(crimess33$crimetype),
               #"; Month: ", as.character(crimess33$month)))

#addCircles(data = crimess44, lng = crimess44$xL, lat = crimess44$yL, weight = 5, radius = 10,
  #popup = paste0("Crime type: ", as.character(crimess44$crimetype),
               #"; Month: ", as.character(crimess44$month)))

OpenStreetMap: Interactive map of arson events
Figure 2.15: Interactive map of arson events
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0

Project 2: Finishing Up

Project 2: Finishing Up fck2

Summary of Project 2 Deliverables

To give you an idea of the work that will be required for this project, here is a summary of the minimum items you will create for Project 2.

  • Create necessary maps, graphics, or tables of the data. Include in your written report any and all maps, graphics, and tables that are needed to support your analysis.
  • The analysis should help you present the spatial patterns of crime types across the city as well as what evidence exists showing any temporal trend in crime overall and for individual crime types.

Note that this summary does not supersede elements requested in the main text of this project (refer back to those for full details). Also, you should include discussions of issues important to the lesson material in your write-up, even if they are not explicitly mentioned here.

Use the questions below to organize the narrative that you craft about your analysis.

  • Discuss Overall Findings
    1. What types of crime were recorded in the city?
    2. Was there a temporal element associated with each of the crimes? If so, then were there any considerations about the temporal element in your analysis.
    3. How did crimes changes between the years?
    4. Where did crimes take place? Was there a particular spatial distribution associated with each of the crimes? Did they take place in a particular part of the city?
    5. What additional types of analysis would be useful?
  • Visualizing the Data
    1. Map the data: Explain the spatial distribution of crimes?
    2. Create some graphs:
      1. Of the crimes in the dataset, is there a crime that occurs more than others?
      2. Is there a month when crimes occur more than others?
      3. Have crimes increased or decreased over time?
  • Descriptive Statistics
    1. What type of statistical distribution does your data have?
    2. What is the typical value in this dataset?
    3. How widely do values in the dataset vary?
    4. Are there any unusually high or low value in this dataset?

Please upload your write-up (in Word or PDF format) in the Project 2 Drop Box.

That's it for Project 2!

Project 2: (Optional) Try This - R Markdown

Project 2: (Optional) Try This - R Markdown ksc17

You've managed to complete your first analysis in RStudio, congratulations! As was mentioned earlier in the lesson, R also offers a notebook-like environment in which to do statistical analysis. If you'd like to try this out, follow the instructions on this page to reproduce your Lesson 2 analysis in an R Markdown file.

R Markdown: a quick overview

Since typing commands into the console can get tedious after a while, you might like to experiment with R Markdown.

R Markdown acts as a notebook where you can integrate your notes or the text of a report with your analysis. When you are finished, save the file; if you need to return to your analysis, you can load the file into R and run your analysis again. Once you have completed the analysis and write-up, you can create a final document.

Here are some resources to help you better understand RMarkdown and other tools.

Before doing anything, make sure you have installed and loaded the rmarkdown package.

 install.packages("rmarkdown")
library(rmarkdown)

To create a new RMarkdown file, go to File – New File – R Markdown. Select the Output Format and Document. Select HTML (Figure 2.16).

New R Markdown window
Figure 2.16. Window for creating a new R Markdown File.
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0

Opening the RMD File and Coding

In the Lesson 2 data folder, you will find a .rmd file.

In RStudio, load the .rmd file, File – Open File. Select the Les2_crimeAnalysis.Rmd file from the Lesson 2 folder.

The file will open in a window in the top left quadrant of RStudio. This is essentially a text file where you can add your R code, run your analysis, and write up your notes and results.

The header is where you should add a title, date, and your name as the author. Try customizing the information there.

R code should be written in the grey areas between the ``` and ``` comments. These grey areas are called "chunks".

Note that the command {r, echo=TRUE), written at the top of each R code chunk, will include the R code in the output (see line 20 in Figure 2.17). If you do not want to include the R code, then set echo=FALSE. There are several chunk options that you can explore on your own.

  • include = FALSE prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
  • echo = FALSE prevents code, but not the results from appearing in the finished file. This is a useful way to embed figures.
  • message = FALSE prevents messages that are generated by code from appearing in the finished file.
  • warning = FALSE prevents warnings that are generated by code from appearing in the finished.
  • fig.cap = "..." adds a caption to graphical results.

Of course, setting any of these options to TRUE creates the opposite effect. Chunk options must be written in one line; no line breaks are allowed inside chunk options.

Once you have added what you need to the R Markdown file, you can use it to create a formatted text document, as shown below.

Annotated screenshot of an R Markdown file and an example of the output produced by running the R Markdown file. Annotations show where header information, R code and report text can be found within an R Markdown file

Figure 2.17: Example of the rmd file that you edit and a view of a formatted output.

Annotated screenshot of an R Markdown file and an example of the output produced by running the R Markdown file. Annotations show where header information, R code, and report text can be found within an R Markdown file.
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0

For additional information on formatting, see the RStudio cheat sheet.

To execute the R code and create a formatted document, use Knit. Knit integrates the pieces in the file to create the output document, in this case an html_document. The Run button will execute the R code and integrate the results into the notebook. There are different options available for running your code. You can run all of your code or just a chunk of code. To run just one chunk, click the green arrow near the top of the chunk.

If you are having trouble with running just one chunk, it might be because the chunk depends on something that was done in a prior chunk. If you haven't run that prior chunk, then those results may not be available, causing either an error or an erroneous result. We recommend generally using the Knit approach or the Run Analysis button at the top (as shown in Figure 2.18).

Try the Run and Knit buttons

Screenshot view of RStudio with an RMarkdown file.
Figure 2.18: View of RStudio with an RMarkdown file.
Credit: Blanford, © Penn State University, licensed under CC BY-NC-SA 4.0

Note for Figure 2.18:

You probably set the output of the "knit" to HTML. Oher output options are available and can be set at the top of the markdown file.

title: "Lesson1_RMarkdownProject1doc"
author: "JustineB"
date: "January 24, 2019"
output:
word_document: default
html_document: default
editor_options:
chunk_output_type: inline

always_allow_html: yes

Add the "always_allow_html: yes" to the bottom

Your Turn!

Now you've seen how the analysis works in an R Markdown file. Try adding another chunk (Insert-R) to the R Markdown document and then copy the next blocks of your analysis code into the new chunk in the R Markdown file and then try to Knit the file. It's best to add one chunk at a time and then test the file/code by knitting it. That will make it easier to find any problems.

If you're successful, continue on until you've built out your whole R Markdown file. If you get stuck, post a description of your problem to the Lesson 2 discussion forum and ask for some help!