Banner

 

15 - Introduction to R

Answers to exercises

1.
Access http://richardtwatson.com/data/manheim.txt which contains details of the sales of three car models: X, Y, and Z.
1a.
Compute the number of sales for each type of model.
library(readr)
url <- 'http://richardtwatson.com/data/manheim.txt'
t <- read_delim(url,delim=',')
t %>%
group_by(model) %>%
summarize(frequency = n())
1b.
Compute the number of sales for each type of sale.
library(readr)
url <- 'http://richardtwatson.com/data/manheim.txt' t <- read_delim(url,delim=',')
t %>%
group_by(sale) %>%
summarize(frequency = n())
1c.
Report the mean price for each model.
library(readr)
library(sqldf)
url <- 'http://richardtwatson.com/data/manheim.txt' t <- read_delim(url,delim=',')
t %>%
group_by(model) %>%
summarize(mean = mean(price))
1d.
Report the mean price for each type of sale.
library(readr)
library(sqldf)
url <- 'http://richardtwatson.com/data/manheim.txt'
t <- read_delim(url,delim = ',')
t %>%
group_by(sale) %>%
summarize(mean = mean(price))
2.
Use the 'Import Dataset' feature of RStudio to read http://richardtwatson.com/data/electricityprices.csv which contains details of electricity prices for a major city.
2a.
What is the maximum cost?
max(electricityprices$cost)
2b.
What is the minimum cost?
min(electricityprices$cost)
2c.
What is the mean cost?
mean(electricityprices$cost)
2d.
What is the median cost?
median(electricityprices$cost)
3.
Read http://richardtwatson.com/data/electricityprices.csv containing details of the wealth of various countries and complete the following exercises.
3a.
Sort the table by GDP per capita.
library(readr)
library(sqldf)
url <- 'http://people.terry.uga.edu/rwatson/data/wealth.csv'
t <- read_delim(url,delim = ',')
ts <- sqldf('select * from t order by `GDP per capita`;')
3b.
What is the average GDP per capita?
library(readr)
library(sqldf)
url <- 'http://people.terry.uga.edu/rwatson/data/wealth.csv'
t <- read_delim(url,delim = ',')
sqldf('select avg(`GDP per capita`) from t;')
3c.
Compute the ratio of US GDP per capita to the average GDP per capita.
library(readr)
library(sqldf)
url <- 'http://people.terry.uga.edu/rwatson/data/wealth.csv'
sqldf('Select `GDP per capita`/(select avg(`GDP per capita`) from t) as "US ratio" from t where Country = "United States";')
3d.
What's the correlation between GDP per capita and wealth per capita? Is the relationship statistically significant?
library(readr)
url <- 'http://people.terry.uga.edu/rwatson/data/wealth.csv'
t <- read_delim(url,delim = ',') cor.test(t$`GDP per capita`,t$`Wealth per capita`)
4.
Merge the data for weather (database weather at richardtwatson.com discussed in the chapter) and electricity prices (Use RStudio's 'Import Dataset' to read http://people.terry.uga.edu/rwatson/data/electricityprices.csv) and compute the correlation between temperature and electricity price. Recall that the weather data are in the database table accessed earlier. Hint: MySQL might return a timestamp with decimal seconds (e.g., 2010-01-01 01:00:00.0), and you can remove the rightmost two characters using substr(), so that the two timestamp columns are of the same format and length. Also, you need to ensure that the timestamps from the two data frames are of the same data type (e.g., both character).
library(readr)
library(sqldf)
options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL
library(DBI)
# connect to the database
conn <- dbConnect(RMySQL::MySQL(),"richardtwatson.com", dbname = "Weather", user = "db2", password = "student")
# Query the database and create file t for use with R
t1 <- dbGetQuery(conn,"SELECT * from record;")
url <- 'http://people.terry.uga.edu/rwatson/data/electricityprices.csv'
t2 <- read_delim(url,delim=',')
# Convert so timestamps are of the same data type
t2$timestamp = as.character(t2$timestamp)
m <- sqldf('select t1.timestamp, airTemp, humidity, precipitation, cost from t1, t2 where t1.timestamp = t2.timestamp;')
cor.test(m$airTemp,m$cost)
5.
Get the list of failed US banks.
5a.
Determine how many banks failed in each state.
# download the table as csv file and read the file using RStudio's Import Dataset
library(sqldf)
sqldf('Select ST, count(*) from banklist group by ST')
5b.
How many banks were not acquired (hint: Entry is 'No Acquirer')?
# download the table as csv file and read the file using RStudio's Import Dataset
library(sqldf)
sqldf('Select count(*) from banklist where `Acquiring.Institution` = "No Acquirer";')
5c.
How many banks were closed each year (hint: use the lubridate package and strptime() as sqldf does not have a YEAR function)
library(sqldf)
options(sqldf.driver = "SQLite") # to avoid a conflict with RMySQL
library(lubridate)
# banklist$closeYear <- year(strptime(banklist$Closing.Date, "%d-%b-%y"))
temp <- strptime(banklist$Closing.Date, "%d-%b-%y")
banklist$closeYear <- year(temp)
sqldf('Select closeYear, count(*) from banklist group by closeYear;')
6.
Get Table01.xls of U.S. broccoli data on farms and area harvested. Get rid of unwanted rows to create a spreadsheet for the area harvested with one header row and the 50 states. Change cells without integer values to 0 and save the file in CSV format for reading with R.
6a.
Reshape the data so that each observation contains state name, year, and area harvested.
# Read the file using RStudio's Import Datase
# set the column names as R puts an X in front of the year.
colnames(broccoli) <- c('state',1974,1978,1982,1987,1992,1997,2002,2007) library(reshape)
m <- melt(broccoli,id='state')
colnames(m) <- c('state','year','acres')
6b.
Add hectares as a column in the table. Round the calculation to two decimal places.
m$hectares <-  round(m$acres/2.4711,2)
6c.
Compute total hectares harvested each year for which data are available.
library(sqldf)
sqldf('select year, sum(hectares) as "Total area" from m group by year;')
6d.
Save the reshaped file.
write.table(m,"broccoli01.csv")
This page is part of the promotional and support material for Data Management (open edition) by Richard T. Watson
For questions and comments please contact the author
Date revised: 10-Dec-2021