2018-07-19
plot of chunk data
plot of chunk plot1
plot of chunk plot1_reduced
popdensity
by state
.county
by state
.
ggplot(data = midwest)
from above.
plot of chunk aesthetic
plot of chunk global_aes
alpha
instead.
colors()
.
plot of chunk smooth
plot of chunk smooth_states
plot of chunk combine_geoms
plot of chunk two_geoms
plot of chunk breaks_x2
## [1] 0
## [1] 35
## [1] 2
## [1] 2
## [1] 4
## [1] 4
## [1] 12
## Error in eval(expr, envir, enclos): object 'Case_sensitive' not found
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
?function_name
to explore the details of the function. The examples at the bottom of every R help page can be especially helpful.
dplyr
for data manipulationThe dplyr
package uses verbs for common data manipulation tasks. These include:
filter()
count()
arrange()
select()
mutate()
summarise()
https://fivethirtyeight.com/features/both-republicans-and-democrats-have-an-age-problem/
plot of chunk congress
filter
## # A tibble: 555 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 house M000112 Joseph Jefferson Mansfie~ <NA>
## 2 80 house D000448 Robert Lee Doughton <NA>
## 3 80 house S000001 Adolph Joachim Sabath <NA>
## 4 80 house E000023 Charles Aubrey Eaton <NA>
## 5 80 house L000296 William <NA> Lewis <NA>
## 6 80 house G000017 James A. Gallagh~ <NA>
## 7 80 house W000265 Richard Joseph Welch <NA>
## 8 80 house B000565 Sol <NA> Bloom <NA>
## 9 80 house H000943 Merlin <NA> Hull <NA>
## 10 80 house G000169 Charles Laceille Gifford <NA>
## # ... with 545 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
>
<
>=
<=
is.na
function to identify congress members that have missing middlenames.## # A tibble: 102 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 senate C000133 Arthur <NA> Capper <NA>
## 2 80 senate G000418 Theodore Francis Green <NA>
## 3 80 senate M000499 Kenneth Douglas McKellar <NA>
## 4 80 senate R000112 Clyde Martin Reed <NA>
## 5 80 senate M000895 Edward Hall Moore <NA>
## 6 80 senate O000146 John Holmes Overton <NA>
## 7 80 senate M001108 James Edward Murray <NA>
## 8 80 senate M000308 Patrick Anthony McCarran <NA>
## 9 80 senate T000165 Elmer <NA> Thomas <NA>
## 10 80 senate W000021 Robert Ferdinand Wagner <NA>
## # ... with 92 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
## # A tibble: 102 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 senate C000133 Arthur <NA> Capper <NA>
## 2 80 senate G000418 Theodore Francis Green <NA>
## 3 80 senate M000499 Kenneth Douglas McKellar <NA>
## 4 80 senate R000112 Clyde Martin Reed <NA>
## 5 80 senate M000895 Edward Hall Moore <NA>
## 6 80 senate O000146 John Holmes Overton <NA>
## 7 80 senate M001108 James Edward Murray <NA>
## 8 80 senate M000308 Patrick Anthony McCarran <NA>
## 9 80 senate T000165 Elmer <NA> Thomas <NA>
## 10 80 senate W000021 Robert Ferdinand Wagner <NA>
## # ... with 92 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
## # A tibble: 1,112 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 house M000112 Joseph Jefferson Mansfie~ <NA>
## 2 80 house D000448 Robert Lee Doughton <NA>
## 3 80 house S000001 Adolph Joachim Sabath <NA>
## 4 80 house E000023 Charles Aubrey Eaton <NA>
## 5 80 house L000296 William <NA> Lewis <NA>
## 6 80 house G000017 James A. Gallagh~ <NA>
## 7 80 house W000265 Richard Joseph Welch <NA>
## 8 80 house B000565 Sol <NA> Bloom <NA>
## 9 80 house H000943 Merlin <NA> Hull <NA>
## 10 80 house G000169 Charles Laceille Gifford <NA>
## # ... with 1,102 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
%in%
## # A tibble: 1,112 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 house M000112 Joseph Jefferson Mansfie~ <NA>
## 2 80 house D000448 Robert Lee Doughton <NA>
## 3 80 house S000001 Adolph Joachim Sabath <NA>
## 4 80 house E000023 Charles Aubrey Eaton <NA>
## 5 80 house L000296 William <NA> Lewis <NA>
## 6 80 house G000017 James A. Gallagh~ <NA>
## 7 80 house W000265 Richard Joseph Welch <NA>
## 8 80 house B000565 Sol <NA> Bloom <NA>
## 9 80 house H000943 Merlin <NA> Hull <NA>
## 10 80 house G000169 Charles Laceille Gifford <NA>
## # ... with 1,102 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
## # A tibble: 18,080 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 81 house D000448 Robert Lee Doughton <NA>
## 2 81 house S000001 Adolph Joachim Sabath <NA>
## 3 81 house E000023 Charles Aubrey Eaton <NA>
## 4 81 house W000265 Richard Joseph Welch <NA>
## 5 81 house B000565 Sol <NA> Bloom <NA>
## 6 81 house H000943 Merlin <NA> Hull <NA>
## 7 81 house B000545 Schuyler Otis Bland <NA>
## 8 81 house K000138 John Hosea Kerr <NA>
## 9 81 house C000932 Robert <NA> Crosser <NA>
## 10 81 house K000039 John <NA> Kee <NA>
## # ... with 18,070 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
## # A tibble: 453 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 house M000112 Joseph Jefferson Mansfie~ <NA>
## 2 80 house D000448 Robert Lee Doughton <NA>
## 3 80 house S000001 Adolph Joachim Sabath <NA>
## 4 80 house E000023 Charles Aubrey Eaton <NA>
## 5 80 house L000296 William <NA> Lewis <NA>
## 6 80 house G000017 James A. Gallagh~ <NA>
## 7 80 house W000265 Richard Joseph Welch <NA>
## 8 80 house B000565 Sol <NA> Bloom <NA>
## 9 80 house H000943 Merlin <NA> Hull <NA>
## 10 80 house G000169 Charles Laceille Gifford <NA>
## # ... with 443 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
## # A tibble: 15,698 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 house M000112 Joseph Jefferson Mansfie~ <NA>
## 2 80 house D000448 Robert Lee Doughton <NA>
## 3 80 house S000001 Adolph Joachim Sabath <NA>
## 4 80 house E000023 Charles Aubrey Eaton <NA>
## 5 80 house W000265 Richard Joseph Welch <NA>
## 6 80 house B000565 Sol <NA> Bloom <NA>
## 7 80 house H000943 Merlin <NA> Hull <NA>
## 8 80 house G000169 Charles Laceille Gifford <NA>
## 9 80 house B000545 Schuyler Otis Bland <NA>
## 10 80 house R000358 John Marshall Robsion <NA>
## # ... with 15,688 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
count
## # A tibble: 11 x 3
## party incumbent n
## <chr> <lgl> <int>
## 1 AL FALSE 1
## 2 AL TRUE 2
## 3 D FALSE 1519
## 4 D TRUE 8771
## 5 I FALSE 13
## 6 I TRUE 50
## 7 ID FALSE 2
## 8 ID TRUE 2
## 9 L FALSE 1
## 10 R FALSE 1401
## 11 R TRUE 6873
arrange
## # A tibble: 18,635 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 house B000201 Edward Lewis Bartlett <NA>
## 2 81 house B000201 Edward Lewis Bartlett <NA>
## 3 82 house B000201 Edward Lewis Bartlett <NA>
## 4 83 house B000201 Edward Lewis Bartlett <NA>
## 5 84 house B000201 Edward Lewis Bartlett <NA>
## 6 85 house B000201 Edward Lewis Bartlett <NA>
## 7 86 house R000282 Ralph Julian Rivers <NA>
## 8 86 senate G000508 Ernest <NA> Gruening <NA>
## 9 86 senate B000201 Edward Lewis Bartlett <NA>
## 10 87 house R000282 Ralph Julian Rivers <NA>
## # ... with 18,625 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
## # A tibble: 18,635 x 13
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 113 house H000067 Ralph M. Hall <NA>
## 2 113 house D000355 John D. Dingell <NA>
## 3 113 house C000714 John <NA> Conyers Jr.
## 4 113 house S000480 Louise McIntosh Slaught~ <NA>
## 5 113 house R000053 Charles B. Rangel <NA>
## 6 113 house J000174 Sam Robert Johnson <NA>
## 7 113 house Y000031 C. W. Bill Young <NA>
## 8 113 house C000556 Howard <NA> Coble <NA>
## 9 113 house L000263 Sander M. Levin <NA>
## 10 113 house Y000033 Don E. Young <NA>
## # ... with 18,625 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
select
## # A tibble: 18,635 x 4
## congress chamber party age
## <int> <chr> <chr> <dbl>
## 1 80 house D 85.9
## 2 80 house D 83.2
## 3 80 house D 80.7
## 4 80 house R 78.8
## 5 80 house R 78.3
## 6 80 house R 78
## 7 80 house R 77.9
## 8 80 house D 76.8
## 9 80 house R 76
## 10 80 house R 75.8
## # ... with 18,625 more rows
starts_with()
ends_with()
contains()
matches()
num_range()
:
everything()
starts_with
helper## # A tibble: 18,635 x 2
## suffix state
## <chr> <chr>
## 1 <NA> TX
## 2 <NA> NC
## 3 <NA> IL
## 4 <NA> NJ
## 5 <NA> KY
## 6 <NA> PA
## 7 <NA> CA
## 8 <NA> NY
## 9 <NA> WI
## 10 <NA> MA
## # ... with 18,625 more rows
## # A tibble: 18,635 x 3
## firstname middlename lastname
## <chr> <chr> <chr>
## 1 Joseph Jefferson Mansfield
## 2 Robert Lee Doughton
## 3 Adolph Joachim Sabath
## 4 Charles Aubrey Eaton
## 5 William <NA> Lewis
## 6 James A. Gallagher
## 7 Richard Joseph Welch
## 8 Sol <NA> Bloom
## 9 Merlin <NA> Hull
## 10 Charles Laceille Gifford
## # ... with 18,625 more rows
## # A tibble: 18,635 x 8
## congress chamber bioguide firstname middlename lastname suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 house M000112 Joseph Jefferson Mansfie~ <NA>
## 2 80 house D000448 Robert Lee Doughton <NA>
## 3 80 house S000001 Adolph Joachim Sabath <NA>
## 4 80 house E000023 Charles Aubrey Eaton <NA>
## 5 80 house L000296 William <NA> Lewis <NA>
## 6 80 house G000017 James A. Gallagh~ <NA>
## 7 80 house W000265 Richard Joseph Welch <NA>
## 8 80 house B000565 Sol <NA> Bloom <NA>
## 9 80 house H000943 Merlin <NA> Hull <NA>
## 10 80 house G000169 Charles Laceille Gifford <NA>
## # ... with 18,625 more rows, and 1 more variable: birthday <date>
## # A tibble: 18,635 x 8
## congress bioguide middlename lastna~ suffix birthday termstart age
## <int> <chr> <chr> <chr> <chr> <date> <date> <dbl>
## 1 80 M000112 Jefferson Mansfi~ <NA> 1861-02-09 1947-01-03 85.9
## 2 80 D000448 Lee Dought~ <NA> 1863-11-07 1947-01-03 83.2
## 3 80 S000001 Joachim Sabath <NA> 1866-04-04 1947-01-03 80.7
## 4 80 E000023 Aubrey Eaton <NA> 1868-03-29 1947-01-03 78.8
## 5 80 L000296 <NA> Lewis <NA> 1868-09-22 1947-01-03 78.3
## 6 80 G000017 A. Gallag~ <NA> 1869-01-16 1947-01-03 78
## 7 80 W000265 Joseph Welch <NA> 1869-02-13 1947-01-03 77.9
## 8 80 B000565 <NA> Bloom <NA> 1870-03-09 1947-01-03 76.8
## 9 80 H000943 <NA> Hull <NA> 1870-12-18 1947-01-03 76
## 10 80 G000169 Laceille Gifford <NA> 1871-03-15 1947-01-03 75.8
## # ... with 18,625 more rows
everything
## # A tibble: 18,635 x 13
## congress chamber incumbent age bioguide firstname middlename lastname
## <int> <chr> <lgl> <dbl> <chr> <chr> <chr> <chr>
## 1 80 house TRUE 85.9 M000112 Joseph Jefferson Mansfie~
## 2 80 house TRUE 83.2 D000448 Robert Lee Doughton
## 3 80 house TRUE 80.7 S000001 Adolph Joachim Sabath
## 4 80 house TRUE 78.8 E000023 Charles Aubrey Eaton
## 5 80 house FALSE 78.3 L000296 William <NA> Lewis
## 6 80 house FALSE 78 G000017 James A. Gallagh~
## 7 80 house TRUE 77.9 W000265 Richard Joseph Welch
## 8 80 house TRUE 76.8 B000565 Sol <NA> Bloom
## 9 80 house TRUE 76 H000943 Merlin <NA> Hull
## 10 80 house TRUE 75.8 G000169 Charles Laceille Gifford
## # ... with 18,625 more rows, and 5 more variables: suffix <chr>,
## # birthday <date>, state <chr>, party <chr>, termstart <date>
rename
function## # A tibble: 18,635 x 13
## congress chamber bioguide first_name middlename last_name suffix
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 80 house M000112 Joseph Jefferson Mansfield <NA>
## 2 80 house D000448 Robert Lee Doughton <NA>
## 3 80 house S000001 Adolph Joachim Sabath <NA>
## 4 80 house E000023 Charles Aubrey Eaton <NA>
## 5 80 house L000296 William <NA> Lewis <NA>
## 6 80 house G000017 James A. Gallagher <NA>
## 7 80 house W000265 Richard Joseph Welch <NA>
## 8 80 house B000565 Sol <NA> Bloom <NA>
## 9 80 house H000943 Merlin <NA> Hull <NA>
## 10 80 house G000169 Charles Laceille Gifford <NA>
## # ... with 18,625 more rows, and 6 more variables: birthday <date>,
## # state <chr>, party <chr>, incumbent <lgl>, termstart <date>, age <dbl>
dplyr
helper functions, select all the variables that start with the letter ‘c’.num_range
function.mutate
## # A tibble: 18,635 x 6
## congress chamber state party democrat num_democrat
## <int> <chr> <chr> <chr> <dbl> <dbl>
## 1 80 house TX D 1 10290
## 2 80 house NC D 1 10290
## 3 80 house IL D 1 10290
## 4 80 house NJ R 0 10290
## 5 80 house KY R 0 10290
## 6 80 house PA R 0 10290
## 7 80 house CA R 0 10290
## 8 80 house NY D 1 10290
## 9 80 house WI R 0 10290
## 10 80 house MA R 0 10290
## # ... with 18,625 more rows
diamonds
data, use ?diamonds
for more information on the data, use the mutate
function to calculate the price per carat. Hint, this operation would involve standardizing the price variable so that all are comparable at 1 carat.mutate
, calculate the rank of the original price variable and the new price variable calculated above using the min_rank
function. Are there differences in the ranking of the prices?summarise
## # A tibble: 1 x 1
## num_democrat
## <dbl>
## 1 10290
group_by
## # A tibble: 34 x 4
## congress num_democrat total prop_democrat
## <int> <dbl> <int> <dbl>
## 1 80 247 555 0.445
## 2 81 330 557 0.592
## 3 82 292 555 0.526
## 4 83 274 557 0.492
## 5 84 288 544 0.529
## 6 85 295 547 0.539
## 7 86 356 554 0.643
## 8 87 339 559 0.606
## 9 88 332 552 0.601
## 10 89 371 548 0.677
## # ... with 24 more rows
plot of chunk trend2
summarise
command above to calculate these values.sum(democrat)
above, we used mean(democrat)
, what does this value return? Why does it return this value?group_by
with mutate
group_by
with mutate
output## # A tibble: 18,635 x 8
## # Groups: congress [34]
## congress chamber state party democrat num_democrat total prop_democrat
## <int> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
## 1 80 house TX D 1 247 555 0.445
## 2 80 house NC D 1 247 555 0.445
## 3 80 house IL D 1 247 555 0.445
## 4 80 house NJ R 0 247 555 0.445
## 5 80 house KY R 0 247 555 0.445
## 6 80 house PA R 0 247 555 0.445
## 7 80 house CA R 0 247 555 0.445
## 8 80 house NY D 1 247 555 0.445
## 9 80 house WI R 0 247 555 0.445
## 10 80 house MA R 0 247 555 0.445
## # ... with 18,625 more rows
%>%
is the answerpipe_congress <- congress_age %>%
filter(congress >= 100) %>%
mutate(democrat = ifelse(party == 'D', 1, 0)) %>%
group_by(congress, chamber) %>%
summarise(
num_democrat = sum(democrat),
total = n(),
prop_democrat = num_democrat / total
)
nested_congress <- summarise(
group_by(
mutate(
filter(
congress_age, congress >= 100
),
democrat = ifelse(party == 'D', 1, 0)
),
congress, chamber
),
num_democrat = sum(democrat),
total = n(),
prop_democrat = num_democrat / total
)
identical(pipe_congress, nested_congress)
## [1] TRUE
summarise(
group_by(
mutate(
filter(
diamonds,
color %in% c('D', 'E', 'F') & cut %in% c('Fair', 'Good', 'Very Good')
),
f_color = ifelse(color == 'F', 1, 0),
vg_cut = ifelse(cut == 'Very Good', 1, 0)
),
clarity
),
avg = mean(carat),
sd = sd(carat),
avg_p = mean(price),
num = n(),
summary_f_color = mean(f_color),
summary_vg_cut = mean(vg_cut)
)
## Parsed with column specification:
## cols(
## `Date / Time` = col_character(),
## City = col_character(),
## State = col_character(),
## Shape = col_character(),
## Duration = col_character(),
## Summary = col_character(),
## Posted = col_character()
## )
## # A tibble: 8,031 x 7
## `Date / Time` City State Shape Duration Summary Posted
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 12/12/14 17:30 North Wa~ PA Tria~ 5 minut~ I heard an extrem~ <NA>
## 2 12/12/14 12:40 Cartersv~ GA Unkn~ 3.6 min~ Looking up toward~ 12/12~
## 3 12/12/14 06:30 Isle of ~ <NA> Light 2 secon~ Over the Isle of ~ 12/12~
## 4 12/12/14 01:00 Miamisbu~ OH Chan~ <NA> "Bright color cha~ 12/12~
## 5 12/12/14 00:00 Spotsylv~ VA Unkn~ 1 minute "White then orang~ 12/12~
## 6 12/11/14 23:25 Kenner LA Chev~ ~1 minu~ Strange, chevron-~ 12/12~
## 7 12/11/14 23:15 Eugene OR Disk 2 minut~ Dual orange orbs ~ 12/12~
## 8 12/11/14 20:04 Phoenix AZ Chev~ 3 minut~ 4 Orange Lights S~ 12/12~
## 9 12/11/14 20:00 Franklin NC Disk 5 minut~ There were 5 or 6~ 12/12~
## 10 12/11/14 18:30 Longview WA Cyli~ 10 seco~ Two cylinder shap~ 12/12~
## # ... with 8,021 more rows
read_tsv
read_fwf
read_table
read_delim
file.choose()
. For example, read_tsv(file.choose())
.
read_excel
## # A tibble: 891 x 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 0 3 Brau~ male 22 1 0 A/5 2~ 7.25
## 2 2 1 1 Cumi~ fema~ 38 1 0 PC 17~ 71.3
## 3 3 1 3 Heik~ fema~ 26 0 0 STON/~ 7.92
## 4 4 1 1 Futr~ fema~ 35 1 0 113803 53.1
## 5 5 0 3 Alle~ male 35 0 0 373450 8.05
## 6 6 0 3 Mora~ male NA 0 0 330877 8.46
## 7 7 0 1 McCa~ male 54 0 0 17463 51.9
## 8 8 0 3 Pals~ male 2 3 1 349909 21.1
## 9 9 1 3 John~ fema~ 27 0 2 347742 11.1
## 10 10 1 2 Nass~ fema~ 14 1 0 237736 30.1
## # ... with 881 more rows, and 2 more variables: Cabin <chr>,
## # Embarked <chr>