<h1>Web Scraping to Item Response Theory: A College Football Adventure</h1>
<h2>Brandon LeBeau, Andrew Zieffler, and Kyle Nickodem</h2>
<h3>University of Iowa & University of Minnesota</h3>
# Background
- Began after Tim Brewster was fired
- Wanted to try to predict next great coach
# Data Available
- Data is available at three levels
1. Coach
2. Game by Game
3. Team
# Coach
- Data
- Overall record
- Team history
- Not Available
- Coordinator history
# Example Coach Data
```
## Year Team Win Loss Tie Pct PF PA Delta coach
## 1 2010 Iowa 8 5 0 0.61538 376 221 155 Kirk Ferentz
## 2 2011 Iowa 7 6 0 0.53846 358 310 48 Kirk Ferentz
## 3 2012 Iowa 4 8 0 0.33333 232 275 -43 Kirk Ferentz
## 4 2013 Iowa 8 5 0 0.61538 342 246 96 Kirk Ferentz
## 5 2014 Iowa 7 6 0 0.53846 367 333 34 Kirk Ferentz
```
# Game by Game
- Data
- Final score of each game
- Date played
- Location
- Not Available
- No information within a game
# Example GBG Data
```
## Team Official Year Date WL Opponent PF PA
## 1 Iowa University of Iowa 2014 8/30/2014 W Northern Iowa 31 23
## 2 Iowa University of Iowa 2014 9/6/2014 W Ball St. (IN) 17 13
## 3 Iowa University of Iowa 2014 9/13/2014 L Iowa St. 17 20
## 4 Iowa University of Iowa 2014 9/20/2014 W Pittsburgh (PA) 24 20
## 5 Iowa University of Iowa 2014 9/27/2014 W Purdue (IN) 24 10
## 6 Iowa University of Iowa 2014 10/11/2014 W Indiana 45 29
## 7 Iowa University of Iowa 2014 10/18/2014 L Maryland 31 38
## 8 Iowa University of Iowa 2014 11/1/2014 W Northwestern (IL) 48 7
## 9 Iowa University of Iowa 2014 11/8/2014 L Minnesota 14 51
## 10 Iowa University of Iowa 2014 11/15/2014 W Illinois 30 14
## 11 Iowa University of Iowa 2014 11/22/2014 L Wisconsin 24 26
## 12 Iowa University of Iowa 2014 11/28/2014 L Nebraska 34 37
## 13 Iowa University of Iowa 2014 1/2/2015 L Tennessee 28 45
## Location
## 1 Iowa City, IA
## 2 Iowa City, IA
## 3 Iowa City, IA
## 4 Pittsburgh, PA
## 5 West Lafayette, IN
## 6 Iowa City, IA
## 7 College Park, MD
## 8 Iowa City, IA
## 9 Minneapolis, MN
## 10 Champaign, IL
## 11 Iowa City, IA
## 12 Iowa City, IA
## 13 Jacksonville, FL
```
# Team
- Data
- Overall team record
- Team statistics
- Rankings
- Conference Affiliation
- Data is very similar to that of the coach level
# Web Scraping
- Data were obtained from many sources
- Much from <http://cfbdatawarehouse.com>
- Also used wikipedia, ESPN, and rivals
# Iowa Coaches Over Time
<img src="/figs/iowa.png" alt="" height = "500" width = "1200"/>
# Iowa State Coaches Over Time
<img src="/figs/iowa_state.png" alt="" height = "500" width="1200"/>
# Strengths in web scraping
- Data is relatively easily obtained
- Structured process for obtaining data
- Can be easily updated
# Challenges of web scraping
- At the mercy of the website
- Many sites are old
- Not up to date on current design standards
- Data validation can be difficult and time consuming
- Need some basic knowledge of html
# When is Web Scraping Worthwhile?
- Best when scraping many pages
- Particularly when web addresses are not structured
- Useful when data need to be updated
<hr>
- Not useful if only scraping a single page/table
# HTML Basics
<ul>
<li> HTML is structured by start tags (e.g. <code><table></code>) and end tags (e.g. <code><⁄table></code>) </li>
<li> Common tags
</li>
</ul>
<div style="float: left; width: 75%;">
<ul>
<li><code><h1></code> - <code><h6></code></li>
<li><code><b></code> <code><i></code></li>
<li><code><a href="http://www.google.com"></code></li>
<li><code><table></code></li>
</ul>
</div>
<div style="float: right; width: 25%;">
<ul>
<li><code><p></code></li>
<li><code><ul></code> & <code><li></code></li>
<li><code><div></code></li>
<li><code><img></code></li>
</ul>
</div>
<hr>
<ul>
<li> Highly structured pages are the easiest to scrape </li>
</ul>
# HTML Code Example
<img src="/figs/ferentz_wikiside.png" alt="" height = "500" width="1200"/>
# Tools for web scraping
- R
- `rvest`: <http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>
- `XML`: <http://www.omegahat.org/RSXML/>
- Python
- `beautiful soup`: <http://www.crummy.com/software/BeautifulSoup/>
- Misc
- `SelectorGadget`: <http://selectorgadget.com/>
# Basics of rvest
- `read_html` is the most basic function
- `html_node` or `html_nodes`
- These functions need css selectors or xpath
- SelectorGadget is the easiest way to get this
# SelectorGadget
- SelectorGadget is a Javascript addon for web browsers
- Can quickly identify a css selector or xpath to select correct portion of web page
- Demo:
- <https://en.wikipedia.org/wiki/Kirk_Ferentz>
# Combine SelectorGadget with rvest
```r
library(rvest)
wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz")
wiki_kirk_extract <- wiki_kirk %>%
html_nodes(".vcard td , .vcard th")
head(wiki_kirk_extract)
```
```
## {xml_nodeset (6)}
## [1] <td colspan="2" style="text-align:center"><a href="/wiki/File:Kirk_p ...
## [2] <th scope="row">Sport(s)</th>
## [3] <td class="category">\n <a href="/wiki/American_football" title="Am ...
## [4] <th colspan="2" style="text-align:center;background-color: lightgray ...
## [5] <th scope="row">Title</th>
## [6] <td>\n <a href="/wiki/Head_coach" title="Head coach">Head coach</a> ...
```
# Extract text
- Use the `html_text` function
```r
wiki_kirk_extract <- wiki_kirk %>%
html_nodes(".vcard td , .vcard th") %>%
html_text()
head(wiki_kirk_extract)
```
```
## [1] "\nFerentz at the 2010 Orange Bowl\n"
## [2] "Sport(s)"
## [3] "Football"
## [4] "Current position"
## [5] "Title"
## [6] "Head coach"
```
# Encoding problems
- Two solutions to fix encoding problems
- `guess_encoding`
- `repair_encoding`: fix encoding problems
```r
wiki_kirk %>%
html_nodes(".vcard td , .vcard th") %>%
html_text() %>%
guess_encoding()
```
```
## encoding language confidence
## 1 UTF-8 1.00
## 2 windows-1252 en 0.36
## 3 windows-1250 ro 0.18
## 4 windows-1254 tr 0.13
## 5 UTF-16BE 0.10
## 6 UTF-16LE 0.10
```
# Fix Encoding Problems
- Best practice to reload page with correct encoding
```r
wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz",
encoding = 'UTF-8')
```
- Can also repair encoding after the fact
```r
wiki_kirk_extract <- wiki_kirk %>%
html_nodes(".vcard td , .vcard th") %>%
html_text() %>%
repair_encoding()
```
# Extract html tags
- Use the `html_name` function
```r
wiki_kirk_extract <- wiki_kirk %>%
html_nodes(".vcard td , .vcard th") %>%
html_name()
head(wiki_kirk_extract)
```
```
## [1] "td" "th" "td" "th" "th" "td"
```
# Extract html attributes
- Use the `html_attrs` function
```r
wiki_kirk_extract <- wiki_kirk %>%
html_nodes(".vcard td , .vcard th") %>%
html_attrs()
head(wiki_kirk_extract)
```
```
## [[1]]
## colspan style
## "2" "text-align:center"
##
## [[2]]
## scope
## "row"
##
## [[3]]
## class
## "category"
##
## [[4]]
## colspan
## "2"
## style
## "text-align:center;background-color: lightgray;"
##
## [[5]]
## scope
## "row"
##
## [[6]]
## named character(0)
```
# Extract links
- Use the `html_attrs` function again
```r
wiki_kirk_extract <- wiki_kirk %>%
html_nodes(".vcard a") %>%
html_attr('href')
head(wiki_kirk_extract)
```
```
## [1] "/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
## [2] "/wiki/American_football"
## [3] "/wiki/Head_coach"
## [4] "/wiki/Iowa_Hawkeyes_football"
## [5] "/wiki/Big_Ten_Conference"
## [6] "/wiki/Iowa_City,_Iowa"
```
# Valid Links
- The `paste0` function is helpful for this
```r
valid_links <- paste0('https://www.wikipedia.org', wiki_kirk_extract)
head(valid_links)
```
```
## [1] "https://www.wikipedia.org/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
## [2] "https://www.wikipedia.org/wiki/American_football"
## [3] "https://www.wikipedia.org/wiki/Head_coach"
## [4] "https://www.wikipedia.org/wiki/Iowa_Hawkeyes_football"
## [5] "https://www.wikipedia.org/wiki/Big_Ten_Conference"
## [6] "https://www.wikipedia.org/wiki/Iowa_City,_Iowa"
```
# Extract Tables
- The `html_table` function is useful to scrape well formatted tables
```r
record_kirk <- wiki_kirk %>%
html_nodes(".wikitable") %>%
.[[1]] %>%
html_table(fill = TRUE)
```
# Caveats to Web Scraping
- Keep in mind when scraping we are using their bandwidth
- Do not want to repeatedly do expensive bandwidth operations
- Better to scrape once, then run only to update data
- Some websites are copyrighted (i.e. illegal to scrape)
# Data Modeling
- Research Questions
1. Who is the next great coach?
2. What characteristics are in common for these coaches?
# IRT modeling
- So far we have explored the win/loss records of teams in the BCS era with item response theory (IRT)
- IRT is commonly used to model assessment data to estimate item parameters and person 'ability'
- We recode the Win/Loss/Tie game by game results
- 1 = Win
- 0 = Otherwise
# Example code with lme4
- A 1 parameter multilevel IRT model can be fitted using `glmer` in the `lme4` package
```r
library(lme4)
fm1a <- glmer(wingbg ~ 0 + (1|coach) + (1|Team),
data = yby_coach, family = binomial)
```
# Plot Showing Team Ability
<img src="/figs/team_ability.png" alt="" height = "500" width="1200"/>
# Connect
- e-mail: brandon-lebeau (at) uiowa.edu
- Twitter: @blebeau11; <https://twitter.com/blebeau11>
- Linkedin: <https://www.linkedin.com/in/lebeaubr>
- Website: <http://brandonlebeau.org>
- <http://brandonlebeau.org/2015/12/04/centraliowaruser/>