When I use `plyr`/`dplyr`

My last post I talked about how I use the data.table package for aggregating and removing duplicate observations. Although I use the data.table package quite often, there are many times when I use plyr (and now the new dplyr) package, primarily because of its easy, intuitive syntax.

Arrange

One of my personal favorite functions in the plyr suite of basic functions is the arrange function. The base functions for sorting/ordering are more difficult to use. Not to mention there have been many times that I have used the base::sort function when I really need to use the base::order function (sort to me is the word I think of first). arrange is great due to the easy, general syntax used for it as shown below:

library(dplyr)
arrange(dataframe, col1, col2, col3)

When using the base::order function, this needs to be done through the indexing operators and is not nearly as intuitive to me. I always have to think for a second to get it right. Here are two general examples:

dataframe[order(dataframe$col1, dataframe$col2, dataframe$col3), ]
with(dataframe, dataframe[order(col1, col2, col3), ])

Both involve much more typing and are more difficult to read the code in my opinion.

Simple, Intuitive syntax

The other aspect of the plyr (and dplyr) suite of functions that keeps me coming back is their simple, intuitive syntax. For example, if I am teaching a student how to aggregate or sort, plyr is my go to package. Easy to explain, easy to understand. The common structure across all of the functions is brilliantly programmed and a standard for everyone else to replicate.

New bonus use with dplyr

The new ability to use the chain function or alternatively the %>% operator is a great addition to R. One of the difficulties with code readability in R is the whenever functions are nested together. By default R interprets from inside to out, not how most of us read written words let along code. The chain function and %>% operator allows the user to write the functions in the order they will be processed by R, therefore the code can read from left to right.

Using the mtcars dataset, suppose we wanted to select the columns we wanted, aggregate the miles per gallon, and filter so we select the average miles per gallon greater than 20.

library(dplyr)
mtcars %>% 
  group_by(cyl, am) %>%
  select(mpg, cyl, wt, am) %>%
  summarise(avgmpg = mean(mpg), avgwt = mean(wt)) %>%
  filter(avgmpg > 20)
## # A tibble: 3 x 4
## # Groups:   cyl [2]
##     cyl    am avgmpg avgwt
##   <dbl> <dbl>  <dbl> <dbl>
## 1     4     0   22.9  2.94
## 2     4     1   28.1  2.04
## 3     6     1   20.6  2.76

Compare the above syntax to:

filter(
  summarise(
    select(
      group_by(mtcars, cyl, am),
      mpg, cyl, wt, am),
    avgmpg = mean(mpg), avgwt = mean(wt)),
  avgmpg > 20)
## # A tibble: 3 x 4
## # Groups:   cyl [2]
##     cyl    am avgmpg avgwt
##   <dbl> <dbl>  <dbl> <dbl>
## 1     4     0   22.9  2.94
## 2     4     1   28.1  2.04
## 3     6     1   20.6  2.76

Both chunks give you the same result, however the first one is much easier to see the process taken to get to the end result. Much easier to adapt the code to add/remove parts of it.

Conclusion

I use both data.table and plyr/dplyr packages. All of these packages are a great tool to for certian data problems. If I want to write a single line of code to do a lot of manipulations I will revert toward data.table. However, if I am writing code where I am doing more exploration or if I am collaborating with others I tend to write my code using plyr/dplyr. The versatility that both packages bring together in tandem is a great combination. I do not have time to be a complete package elitest, the correct tool for the right data problem.