Introduction to tidyverse/R - short version

Brandon LeBeau

2018-07-19

Introduction to tidyverse/R - short version

Brandon LeBeau

University of Iowa

About Me

  • I’m an Assistant Professor in the College of Education
    • I enjoy model building, particularly longitudinal models, and statistical programming.
  • I’ve used R for over 10 years.
    • I have 4 R packages, 3 on CRAN, 1 on GitHub
      • simglm
      • pdfsearch
      • highlightHTML
      • SPSStoR
  • GitHub Repository for this workshop: https://github.com/lebebr01/iowa_data_science

Why teach the tidyverse

  • The tidyverse is a series of packages developed by Hadley Wickham and his team at RStudio. https://www.tidyverse.org/
  • I teach/use the tidyverse for 3 major reasons:
    • Simple functions that do one thing well
    • Consistent implementations across functions within tidyverse (i.e. common APIs)
    • Provides a framework for data manipulation

PSQF 6250

Course Setup

install.packages("tidyverse")
library(tidyverse)

Explore Data

plot of chunk data

First ggplot

ggplot(data = midwest) +
  geom_point(mapping = aes(x = popdensity, y = percollege))

plot of chunk plot1

Equivalent Code

ggplot(midwest) +
  geom_point(aes(x = popdensity, y = percollege))

plot of chunk plot1_reduced

Your Turn

  1. Try plotting popdensity by state.
  2. Try plotting county by state.
    • Does this plot work?
  3. Bonus: Try just using the ggplot(data = midwest) from above.
    • What do you get?
    • Does this make sense?

Add Aesthetics

ggplot(midwest) +
  geom_point(aes(x = popdensity, y = percollege, color = state))

plot of chunk aesthetic

Global Aesthetics

ggplot(midwest) +
  geom_point(aes(x = popdensity, y = percollege), color = 'pink')

plot of chunk global_aes

Your Turn

  1. Instead of using colors, make the shape of the points different for each state.
  2. Instead of color, use alpha instead.
    • What does this do to the plot?
  3. Try the following command: colors().
    • Try a few colors to find your favorite.
  4. What happens if you use the following code:
ggplot(midwest) + 
  geom_point(aes(x = popdensity, y = percollege, color = 'green'))

Additional Geoms

ggplot(midwest) +
  geom_smooth(aes(x = popdensity, y = percollege))

plot of chunk smooth

Add more Aesthetics

ggplot(midwest) +
  geom_smooth(aes(x = popdensity, y = percollege, linetype = state), 
              se = FALSE)

plot of chunk smooth_states

Your Turn

  1. It is possible to combine geoms, which we will do next, but try it first. Try to recreate this plot. plot of chunk combine

Layered ggplot

ggplot(midwest) +
  geom_point(aes(x = popdensity, y = percollege, color = state)) +
  geom_smooth(aes(x = popdensity, y = percollege, color = state), 
              se = FALSE)

plot of chunk combine_geoms

Remove duplicate aesthetics

ggplot(midwest, 
       aes(x = popdensity, y = percollege, color = state)) +
  geom_point() +
  geom_smooth(se = FALSE)

plot of chunk two_geoms

Your Turn

  1. Can you recreate the following figure? plot of chunk differ_aes

Brief plot customization

ggplot(midwest, 
       aes(x = popdensity, y = percollege, color = state)) +
  geom_point() + 
  scale_x_continuous("Population Density", 
                     breaks = seq(0, 80000, 20000)) + 
  scale_y_continuous("Percent College Graduates") + 
  scale_color_discrete("State")

Brief plot customization Output

plot of chunk breaks_x2

Additional ggplot2 resources

R works as a calculator

1 + 2 - 3
## [1] 0
5 * 7
## [1] 35
2/1
## [1] 2

R Calculator 2

sqrt(4)
## [1] 2
2^2
## [1] 4

Can save objects to use later

x <- 1 + 3
x
## [1] 4
x * 3
## [1] 12

R is case sensitive

case_sensitive <- 10
Case_sensitive
## Error in eval(expr, envir, enclos): object 'Case_sensitive' not found

R Functions

set.seed(1)
rnorm(n = 5, mean = 0, sd = 1)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
set.seed(1)
rnorm(5, 0, 1)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
set.seed(1)
rnorm(sd = 1, n = 5, mean = 0)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

Working through Errors

  1. Use ?function_name to explore the details of the function. The examples at the bottom of every R help page can be especially helpful.
  2. If this does not help, copy and paste the error and search on the internet.

Using dplyr for data manipulation

The dplyr package uses verbs for common data manipulation tasks. These include:

  • filter()
  • count()
  • arrange()
  • select()
  • mutate()
  • summarise()

Data

https://fivethirtyeight.com/features/both-republicans-and-democrats-have-an-age-problem/

install.packages('fivethirtyeight')
library(fivethirtyeight)

plot of chunk congress

Using filter

filter(congress_age, congress == 80)
## # A tibble: 555 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       80 house   M000112  Joseph    Jefferson  Mansfie~ <NA>  
##  2       80 house   D000448  Robert    Lee        Doughton <NA>  
##  3       80 house   S000001  Adolph    Joachim    Sabath   <NA>  
##  4       80 house   E000023  Charles   Aubrey     Eaton    <NA>  
##  5       80 house   L000296  William   <NA>       Lewis    <NA>  
##  6       80 house   G000017  James     A.         Gallagh~ <NA>  
##  7       80 house   W000265  Richard   Joseph     Welch    <NA>  
##  8       80 house   B000565  Sol       <NA>       Bloom    <NA>  
##  9       80 house   H000943  Merlin    <NA>       Hull     <NA>  
## 10       80 house   G000169  Charles   Laceille   Gifford  <NA>  
## # ... with 545 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Save filtered results to object

congress_80 <- filter(congress_age, congress == 80)

Other operators for numbers

  • >
  • <
  • >=
  • <=

Your Turn

  1. Select all rows where the congress member was older than 80 at the start of the term.
  2. Use the is.na function to identify congress members that have missing middlenames.

Filter character variables

senate <- filter(congress_age, chamber == 'senate')

Combine Operations - AND

filter(congress_age, congress == 80, chamber == 'senate')
## # A tibble: 102 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       80 senate  C000133  Arthur    <NA>       Capper   <NA>  
##  2       80 senate  G000418  Theodore  Francis    Green    <NA>  
##  3       80 senate  M000499  Kenneth   Douglas    McKellar <NA>  
##  4       80 senate  R000112  Clyde     Martin     Reed     <NA>  
##  5       80 senate  M000895  Edward    Hall       Moore    <NA>  
##  6       80 senate  O000146  John      Holmes     Overton  <NA>  
##  7       80 senate  M001108  James     Edward     Murray   <NA>  
##  8       80 senate  M000308  Patrick   Anthony    McCarran <NA>  
##  9       80 senate  T000165  Elmer     <NA>       Thomas   <NA>  
## 10       80 senate  W000021  Robert    Ferdinand  Wagner   <NA>  
## # ... with 92 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Equivalent AND Statement

filter(congress_age, congress == 80 & chamber == 'senate')
## # A tibble: 102 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       80 senate  C000133  Arthur    <NA>       Capper   <NA>  
##  2       80 senate  G000418  Theodore  Francis    Green    <NA>  
##  3       80 senate  M000499  Kenneth   Douglas    McKellar <NA>  
##  4       80 senate  R000112  Clyde     Martin     Reed     <NA>  
##  5       80 senate  M000895  Edward    Hall       Moore    <NA>  
##  6       80 senate  O000146  John      Holmes     Overton  <NA>  
##  7       80 senate  M001108  James     Edward     Murray   <NA>  
##  8       80 senate  M000308  Patrick   Anthony    McCarran <NA>  
##  9       80 senate  T000165  Elmer     <NA>       Thomas   <NA>  
## 10       80 senate  W000021  Robert    Ferdinand  Wagner   <NA>  
## # ... with 92 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Filter - OR

filter(congress_age, congress == 80 | congress == 81)
## # A tibble: 1,112 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       80 house   M000112  Joseph    Jefferson  Mansfie~ <NA>  
##  2       80 house   D000448  Robert    Lee        Doughton <NA>  
##  3       80 house   S000001  Adolph    Joachim    Sabath   <NA>  
##  4       80 house   E000023  Charles   Aubrey     Eaton    <NA>  
##  5       80 house   L000296  William   <NA>       Lewis    <NA>  
##  6       80 house   G000017  James     A.         Gallagh~ <NA>  
##  7       80 house   W000265  Richard   Joseph     Welch    <NA>  
##  8       80 house   B000565  Sol       <NA>       Bloom    <NA>  
##  9       80 house   H000943  Merlin    <NA>       Hull     <NA>  
## 10       80 house   G000169  Charles   Laceille   Gifford  <NA>  
## # ... with 1,102 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

%in%

filter(congress_age, congress %in% c(80, 81))
## # A tibble: 1,112 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       80 house   M000112  Joseph    Jefferson  Mansfie~ <NA>  
##  2       80 house   D000448  Robert    Lee        Doughton <NA>  
##  3       80 house   S000001  Adolph    Joachim    Sabath   <NA>  
##  4       80 house   E000023  Charles   Aubrey     Eaton    <NA>  
##  5       80 house   L000296  William   <NA>       Lewis    <NA>  
##  6       80 house   G000017  James     A.         Gallagh~ <NA>  
##  7       80 house   W000265  Richard   Joseph     Welch    <NA>  
##  8       80 house   B000565  Sol       <NA>       Bloom    <NA>  
##  9       80 house   H000943  Merlin    <NA>       Hull     <NA>  
## 10       80 house   G000169  Charles   Laceille   Gifford  <NA>  
## # ... with 1,102 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Not Operator

filter(congress_age, congress != 80)
## # A tibble: 18,080 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       81 house   D000448  Robert    Lee        Doughton <NA>  
##  2       81 house   S000001  Adolph    Joachim    Sabath   <NA>  
##  3       81 house   E000023  Charles   Aubrey     Eaton    <NA>  
##  4       81 house   W000265  Richard   Joseph     Welch    <NA>  
##  5       81 house   B000565  Sol       <NA>       Bloom    <NA>  
##  6       81 house   H000943  Merlin    <NA>       Hull     <NA>  
##  7       81 house   B000545  Schuyler  Otis       Bland    <NA>  
##  8       81 house   K000138  John      Hosea      Kerr     <NA>  
##  9       81 house   C000932  Robert    <NA>       Crosser  <NA>  
## 10       81 house   K000039  John      <NA>       Kee      <NA>  
## # ... with 18,070 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Not Operator 2

filter(congress_age, congress == 80 & !chamber == 'senate')
## # A tibble: 453 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       80 house   M000112  Joseph    Jefferson  Mansfie~ <NA>  
##  2       80 house   D000448  Robert    Lee        Doughton <NA>  
##  3       80 house   S000001  Adolph    Joachim    Sabath   <NA>  
##  4       80 house   E000023  Charles   Aubrey     Eaton    <NA>  
##  5       80 house   L000296  William   <NA>       Lewis    <NA>  
##  6       80 house   G000017  James     A.         Gallagh~ <NA>  
##  7       80 house   W000265  Richard   Joseph     Welch    <NA>  
##  8       80 house   B000565  Sol       <NA>       Bloom    <NA>  
##  9       80 house   H000943  Merlin    <NA>       Hull     <NA>  
## 10       80 house   G000169  Charles   Laceille   Gifford  <NA>  
## # ... with 443 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Working with logicals

filter(congress_age, incumbent)
## # A tibble: 15,698 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       80 house   M000112  Joseph    Jefferson  Mansfie~ <NA>  
##  2       80 house   D000448  Robert    Lee        Doughton <NA>  
##  3       80 house   S000001  Adolph    Joachim    Sabath   <NA>  
##  4       80 house   E000023  Charles   Aubrey     Eaton    <NA>  
##  5       80 house   W000265  Richard   Joseph     Welch    <NA>  
##  6       80 house   B000565  Sol       <NA>       Bloom    <NA>  
##  7       80 house   H000943  Merlin    <NA>       Hull     <NA>  
##  8       80 house   G000169  Charles   Laceille   Gifford  <NA>  
##  9       80 house   B000545  Schuyler  Otis       Bland    <NA>  
## 10       80 house   R000358  John      Marshall   Robsion  <NA>  
## # ... with 15,688 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Your Turn

  1. Select the Senators from Iowa.
  2. Select the Senators from Iowa that are not inbumbents.

All boolean options

Using count

count(congress_age, party, incumbent)
## # A tibble: 11 x 3
##    party incumbent     n
##    <chr> <lgl>     <int>
##  1 AL    FALSE         1
##  2 AL    TRUE          2
##  3 D     FALSE      1519
##  4 D     TRUE       8771
##  5 I     FALSE        13
##  6 I     TRUE         50
##  7 ID    FALSE         2
##  8 ID    TRUE          2
##  9 L     FALSE         1
## 10 R     FALSE      1401
## 11 R     TRUE       6873

Using arrange

arrange(congress_age, state, party)
## # A tibble: 18,635 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       80 house   B000201  Edward    Lewis      Bartlett <NA>  
##  2       81 house   B000201  Edward    Lewis      Bartlett <NA>  
##  3       82 house   B000201  Edward    Lewis      Bartlett <NA>  
##  4       83 house   B000201  Edward    Lewis      Bartlett <NA>  
##  5       84 house   B000201  Edward    Lewis      Bartlett <NA>  
##  6       85 house   B000201  Edward    Lewis      Bartlett <NA>  
##  7       86 house   R000282  Ralph     Julian     Rivers   <NA>  
##  8       86 senate  G000508  Ernest    <NA>       Gruening <NA>  
##  9       86 senate  B000201  Edward    Lewis      Bartlett <NA>  
## 10       87 house   R000282  Ralph     Julian     Rivers   <NA>  
## # ... with 18,625 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Descending Order

arrange(congress_age, desc(congress))
## # A tibble: 18,635 x 13
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1      113 house   H000067  Ralph     M.         Hall     <NA>  
##  2      113 house   D000355  John      D.         Dingell  <NA>  
##  3      113 house   C000714  John      <NA>       Conyers  Jr.   
##  4      113 house   S000480  Louise    McIntosh   Slaught~ <NA>  
##  5      113 house   R000053  Charles   B.         Rangel   <NA>  
##  6      113 house   J000174  Sam       Robert     Johnson  <NA>  
##  7      113 house   Y000031  C.        W. Bill    Young    <NA>  
##  8      113 house   C000556  Howard    <NA>       Coble    <NA>  
##  9      113 house   L000263  Sander    M.         Levin    <NA>  
## 10      113 house   Y000033  Don       E.         Young    <NA>  
## # ... with 18,625 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Your Turn

  1. Count the number of congress members from each party that are older than 80 at term start.
  2. Arrange the result from above by the variable n.

Using select

select(congress_age, congress, chamber, party, age)
## # A tibble: 18,635 x 4
##    congress chamber party   age
##       <int> <chr>   <chr> <dbl>
##  1       80 house   D      85.9
##  2       80 house   D      83.2
##  3       80 house   D      80.7
##  4       80 house   R      78.8
##  5       80 house   R      78.3
##  6       80 house   R      78  
##  7       80 house   R      77.9
##  8       80 house   D      76.8
##  9       80 house   R      76  
## 10       80 house   R      75.8
## # ... with 18,625 more rows

Helper functions

  • starts_with()
  • ends_with()
  • contains()
  • matches()
  • num_range()
  • :
  • everything()

starts_with helper

select(congress_age, starts_with('s'))
## # A tibble: 18,635 x 2
##    suffix state
##    <chr>  <chr>
##  1 <NA>   TX   
##  2 <NA>   NC   
##  3 <NA>   IL   
##  4 <NA>   NJ   
##  5 <NA>   KY   
##  6 <NA>   PA   
##  7 <NA>   CA   
##  8 <NA>   NY   
##  9 <NA>   WI   
## 10 <NA>   MA   
## # ... with 18,625 more rows

Contains helper

select(congress_age, contains('name'))
## # A tibble: 18,635 x 3
##    firstname middlename lastname 
##    <chr>     <chr>      <chr>    
##  1 Joseph    Jefferson  Mansfield
##  2 Robert    Lee        Doughton 
##  3 Adolph    Joachim    Sabath   
##  4 Charles   Aubrey     Eaton    
##  5 William   <NA>       Lewis    
##  6 James     A.         Gallagher
##  7 Richard   Joseph     Welch    
##  8 Sol       <NA>       Bloom    
##  9 Merlin    <NA>       Hull     
## 10 Charles   Laceille   Gifford  
## # ... with 18,625 more rows

Colon

select(congress_age, congress:birthday)
## # A tibble: 18,635 x 8
##    congress chamber bioguide firstname middlename lastname suffix
##       <int> <chr>   <chr>    <chr>     <chr>      <chr>    <chr> 
##  1       80 house   M000112  Joseph    Jefferson  Mansfie~ <NA>  
##  2       80 house   D000448  Robert    Lee        Doughton <NA>  
##  3       80 house   S000001  Adolph    Joachim    Sabath   <NA>  
##  4       80 house   E000023  Charles   Aubrey     Eaton    <NA>  
##  5       80 house   L000296  William   <NA>       Lewis    <NA>  
##  6       80 house   G000017  James     A.         Gallagh~ <NA>  
##  7       80 house   W000265  Richard   Joseph     Welch    <NA>  
##  8       80 house   B000565  Sol       <NA>       Bloom    <NA>  
##  9       80 house   H000943  Merlin    <NA>       Hull     <NA>  
## 10       80 house   G000169  Charles   Laceille   Gifford  <NA>  
## # ... with 18,625 more rows, and 1 more variable: birthday <date>

Drop variables

select(congress_age, -firstname, -state, -party, -incumbent, -chamber)
## # A tibble: 18,635 x 8
##    congress bioguide middlename lastna~ suffix birthday   termstart    age
##       <int> <chr>    <chr>      <chr>   <chr>  <date>     <date>     <dbl>
##  1       80 M000112  Jefferson  Mansfi~ <NA>   1861-02-09 1947-01-03  85.9
##  2       80 D000448  Lee        Dought~ <NA>   1863-11-07 1947-01-03  83.2
##  3       80 S000001  Joachim    Sabath  <NA>   1866-04-04 1947-01-03  80.7
##  4       80 E000023  Aubrey     Eaton   <NA>   1868-03-29 1947-01-03  78.8
##  5       80 L000296  <NA>       Lewis   <NA>   1868-09-22 1947-01-03  78.3
##  6       80 G000017  A.         Gallag~ <NA>   1869-01-16 1947-01-03  78  
##  7       80 W000265  Joseph     Welch   <NA>   1869-02-13 1947-01-03  77.9
##  8       80 B000565  <NA>       Bloom   <NA>   1870-03-09 1947-01-03  76.8
##  9       80 H000943  <NA>       Hull    <NA>   1870-12-18 1947-01-03  76  
## 10       80 G000169  Laceille   Gifford <NA>   1871-03-15 1947-01-03  75.8
## # ... with 18,625 more rows

Reorder with everything

select(congress_age, congress, chamber, incumbent, age, everything())
## # A tibble: 18,635 x 13
##    congress chamber incumbent   age bioguide firstname middlename lastname
##       <int> <chr>   <lgl>     <dbl> <chr>    <chr>     <chr>      <chr>   
##  1       80 house   TRUE       85.9 M000112  Joseph    Jefferson  Mansfie~
##  2       80 house   TRUE       83.2 D000448  Robert    Lee        Doughton
##  3       80 house   TRUE       80.7 S000001  Adolph    Joachim    Sabath  
##  4       80 house   TRUE       78.8 E000023  Charles   Aubrey     Eaton   
##  5       80 house   FALSE      78.3 L000296  William   <NA>       Lewis   
##  6       80 house   FALSE      78   G000017  James     A.         Gallagh~
##  7       80 house   TRUE       77.9 W000265  Richard   Joseph     Welch   
##  8       80 house   TRUE       76.8 B000565  Sol       <NA>       Bloom   
##  9       80 house   TRUE       76   H000943  Merlin    <NA>       Hull    
## 10       80 house   TRUE       75.8 G000169  Charles   Laceille   Gifford 
## # ... with 18,625 more rows, and 5 more variables: suffix <chr>,
## #   birthday <date>, state <chr>, party <chr>, termstart <date>

rename function

rename(congress_age, first_name = firstname, last_name = lastname)
## # A tibble: 18,635 x 13
##    congress chamber bioguide first_name middlename last_name suffix
##       <int> <chr>   <chr>    <chr>      <chr>      <chr>     <chr> 
##  1       80 house   M000112  Joseph     Jefferson  Mansfield <NA>  
##  2       80 house   D000448  Robert     Lee        Doughton  <NA>  
##  3       80 house   S000001  Adolph     Joachim    Sabath    <NA>  
##  4       80 house   E000023  Charles    Aubrey     Eaton     <NA>  
##  5       80 house   L000296  William    <NA>       Lewis     <NA>  
##  6       80 house   G000017  James      A.         Gallagher <NA>  
##  7       80 house   W000265  Richard    Joseph     Welch     <NA>  
##  8       80 house   B000565  Sol        <NA>       Bloom     <NA>  
##  9       80 house   H000943  Merlin     <NA>       Hull      <NA>  
## 10       80 house   G000169  Charles    Laceille   Gifford   <NA>  
## # ... with 18,625 more rows, and 6 more variables: birthday <date>,
## #   state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>

Your Turn

  1. Using the dplyr helper functions, select all the variables that start with the letter ‘c’.
  2. Rename the first three variables in the congress data to ‘x1’, ‘x2’, ‘x3’.
  3. After renaming the first three variables, use this new data (ensure you saved the previous step to an object) to select these three variables with the num_range function.

Using mutate

congress_red <- select(congress_age, congress, chamber, state, party)

mutate(congress_red, 
       democrat = ifelse(party == 'D', 1, 0),
       num_democrat = sum(democrat)
       )
## # A tibble: 18,635 x 6
##    congress chamber state party democrat num_democrat
##       <int> <chr>   <chr> <chr>    <dbl>        <dbl>
##  1       80 house   TX    D            1        10290
##  2       80 house   NC    D            1        10290
##  3       80 house   IL    D            1        10290
##  4       80 house   NJ    R            0        10290
##  5       80 house   KY    R            0        10290
##  6       80 house   PA    R            0        10290
##  7       80 house   CA    R            0        10290
##  8       80 house   NY    D            1        10290
##  9       80 house   WI    R            0        10290
## 10       80 house   MA    R            0        10290
## # ... with 18,625 more rows

Your Turn

  1. Using the diamonds data, use ?diamonds for more information on the data, use the mutate function to calculate the price per carat. Hint, this operation would involve standardizing the price variable so that all are comparable at 1 carat.
  2. Using mutate, calculate the rank of the original price variable and the new price variable calculated above using the min_rank function. Are there differences in the ranking of the prices?

Using summarise

congress_2 <- mutate(congress_age, 
       democrat = ifelse(party == 'D', 1, 0)
       )

summarise(congress_2, 
          num_democrat = sum(democrat)
          )
## # A tibble: 1 x 1
##   num_democrat
##          <dbl>
## 1        10290

group_by

congress_grp <- group_by(congress_2, congress)

summarise(congress_grp, 
          num_democrat = sum(democrat),
          total = n(),
          prop_democrat = num_democrat / total
)
## # A tibble: 34 x 4
##    congress num_democrat total prop_democrat
##       <int>        <dbl> <int>         <dbl>
##  1       80          247   555         0.445
##  2       81          330   557         0.592
##  3       82          292   555         0.526
##  4       83          274   557         0.492
##  5       84          288   544         0.529
##  6       85          295   547         0.539
##  7       86          356   554         0.643
##  8       87          339   559         0.606
##  9       88          332   552         0.601
## 10       89          371   548         0.677
## # ... with 24 more rows

Explore trend

num_dem <- summarise(congress_grp, 
                     num_democrat = sum(democrat),
                     total = n(),
                     prop_democrat = num_democrat / total
)
ggplot(num_dem, aes(x = congress, y = prop_democrat)) + 
  geom_line()

Explore trend output

plot of chunk trend2

Your Turn

  1. Suppose we wanted to calculate the number and proportion of republicans instead of democrats, assuming these are the only two parties, edit the summarise command above to calculate these values.
  2. Suppose instead of using sum(democrat) above, we used mean(democrat), what does this value return? Why does it return this value?

group_by with mutate

congress_red <- select(congress_age, congress, chamber, state, party)
congress_grp <- group_by(congress_red, congress)

mutate(congress_grp, 
       democrat = ifelse(party == 'D', 1, 0),
       num_democrat = sum(democrat),
       total = n(),
       prop_democrat = num_democrat / total
)

group_by with mutate output

## # A tibble: 18,635 x 8
## # Groups:   congress [34]
##    congress chamber state party democrat num_democrat total prop_democrat
##       <int> <chr>   <chr> <chr>    <dbl>        <dbl> <int>         <dbl>
##  1       80 house   TX    D            1          247   555         0.445
##  2       80 house   NC    D            1          247   555         0.445
##  3       80 house   IL    D            1          247   555         0.445
##  4       80 house   NJ    R            0          247   555         0.445
##  5       80 house   KY    R            0          247   555         0.445
##  6       80 house   PA    R            0          247   555         0.445
##  7       80 house   CA    R            0          247   555         0.445
##  8       80 house   NY    D            1          247   555         0.445
##  9       80 house   WI    R            0          247   555         0.445
## 10       80 house   MA    R            0          247   555         0.445
## # ... with 18,625 more rows

Chaining operations

summarise(
  group_by(
    mutate(
      filter(
        congress_age, congress >= 100
      ), 
      democrat = ifelse(party == 'D', 1, 0)
    ),
    congress, chamber
  ),
  num_democrat = sum(democrat),
  total = n(),
  prop_democrat = num_democrat / total
)

The pipe %>% is the answer

congress_age %>%
  filter(congress >= 100) %>%
  mutate(democrat = ifelse(party == 'D', 1, 0)) %>%
  group_by(congress, chamber) %>%
  summarise(
    num_democrat = sum(democrat),
    total = n(),
    prop_democrat = num_democrat / total
  )

The two are identical

pipe_congress <- congress_age %>%
  filter(congress >= 100) %>%
  mutate(democrat = ifelse(party == 'D', 1, 0)) %>%
  group_by(congress, chamber) %>%
  summarise(
    num_democrat = sum(democrat),
    total = n(),
    prop_democrat = num_democrat / total
  )

nested_congress <- summarise(
  group_by(
    mutate(
      filter(
        congress_age, congress >= 100
      ), 
      democrat = ifelse(party == 'D', 1, 0)
    ),
    congress, chamber
  ),
  num_democrat = sum(democrat),
  total = n(),
  prop_democrat = num_democrat / total
)

identical(pipe_congress, nested_congress)
## [1] TRUE

Your Turn

  1. Look at the following nested code and determine what is being done. Then translate this code to use the pipe operator.
summarise(
  group_by(
    mutate(
      filter(
        diamonds, 
        color %in% c('D', 'E', 'F') & cut %in% c('Fair', 'Good', 'Very Good')
      ),
      f_color = ifelse(color == 'F', 1, 0),
      vg_cut = ifelse(cut == 'Very Good', 1, 0)
    ),
    clarity
  ),
  avg = mean(carat),
  sd = sd(carat),
  avg_p = mean(price),
  num = n(),
  summary_f_color = mean(f_color),
  summary_vg_cut = mean(vg_cut)
)

Read in your own data

Data Import

ufo <- read_csv("https://raw.githubusercontent.com/lebebr01/iowa_data_science/master/data/ufo.csv")
## Parsed with column specification:
## cols(
##   `Date / Time` = col_character(),
##   City = col_character(),
##   State = col_character(),
##   Shape = col_character(),
##   Duration = col_character(),
##   Summary = col_character(),
##   Posted = col_character()
## )

Show Data

ufo
## # A tibble: 8,031 x 7
##    `Date / Time`  City      State Shape Duration Summary            Posted
##    <chr>          <chr>     <chr> <chr> <chr>    <chr>              <chr> 
##  1 12/12/14 17:30 North Wa~ PA    Tria~ 5 minut~ I heard an extrem~ <NA>  
##  2 12/12/14 12:40 Cartersv~ GA    Unkn~ 3.6 min~ Looking up toward~ 12/12~
##  3 12/12/14 06:30 Isle of ~ <NA>  Light 2 secon~ Over the Isle of ~ 12/12~
##  4 12/12/14 01:00 Miamisbu~ OH    Chan~ <NA>     "Bright color cha~ 12/12~
##  5 12/12/14 00:00 Spotsylv~ VA    Unkn~ 1 minute "White then orang~ 12/12~
##  6 12/11/14 23:25 Kenner    LA    Chev~ ~1 minu~ Strange, chevron-~ 12/12~
##  7 12/11/14 23:15 Eugene    OR    Disk  2 minut~ Dual orange orbs ~ 12/12~
##  8 12/11/14 20:04 Phoenix   AZ    Chev~ 3 minut~ 4 Orange Lights S~ 12/12~
##  9 12/11/14 20:00 Franklin  NC    Disk  5 minut~ There were 5 or 6~ 12/12~
## 10 12/11/14 18:30 Longview  WA    Cyli~ 10 seco~ Two cylinder shap~ 12/12~
## # ... with 8,021 more rows

Other text formats

  • tsv - tab separated files - read_tsv
  • fixed width files - read_fwf
  • white space generally - read_table
  • delimiter generally - read_delim

Your Turn

  1. There is a tsv file posted on icon called “lotr_clean.tsv”. Download this and read this data file into R.
  2. Instead of specifying the path, use the function file.choose(). For example, read_tsv(file.choose()).
    • What does this function do?
    • Would you recommend this to be used in a reproducible document?

Excel Files

install.packages('readxl')

read_excel

library(readxl)
read_excel('data/titanic.xlsx')
## # A tibble: 891 x 12
##    PassengerId Survived Pclass Name  Sex     Age SibSp Parch Ticket  Fare
##          <dbl>    <dbl>  <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>  <dbl>
##  1           1        0      3 Brau~ male     22     1     0 A/5 2~  7.25
##  2           2        1      1 Cumi~ fema~    38     1     0 PC 17~ 71.3 
##  3           3        1      3 Heik~ fema~    26     0     0 STON/~  7.92
##  4           4        1      1 Futr~ fema~    35     1     0 113803 53.1 
##  5           5        0      3 Alle~ male     35     0     0 373450  8.05
##  6           6        0      3 Mora~ male     NA     0     0 330877  8.46
##  7           7        0      1 McCa~ male     54     0     0 17463  51.9 
##  8           8        0      3 Pals~ male      2     3     1 349909 21.1 
##  9           9        1      3 John~ fema~    27     0     2 347742 11.1 
## 10          10        1      2 Nass~ fema~    14     1     0 237736 30.1 
## # ... with 881 more rows, and 2 more variables: Cabin <chr>,
## #   Embarked <chr>

Additional Resources