Introduction to Dplyr

Dplyr is an R package, designed for the purpose of manipulating data. This package is the creation of Hadley Wickem, and it offers many advantageous features. For those of you familiar with Plyr, Dplyr is the next iteration, and it specifically focuses only on dataframes.

Why You Should Use It

Efficiency. Simplicity. Speed.

Dplyr provides tools to allow for increased speed, smaller utilization of memory, and efficiency/simplicity in writing functions. For instance:

Speed: dplyr functions are often faster than some of the more traditional functions in R.
Direct connection to external databases: dplyr allows you to access your database without reading it entirely into memory.
Syntax simplicity: dplyr provides functions, in the form of simple “verbs”, to help tackle the most common data manipulation tasks and help you translate your thoughts into code.
Function chaining: dplyr incorporates pipe operators, %>%, that allow functions to be chained together. This allows users to write code in the order that they want the functions performed, rather than a nested format, in which inner most functions are performed first before outmost functions. In addition to improved ease of reading code, piping also reduces the amount of intermittent objects needed to complete the task.

For more details regarding the benefits to dplyr, I recommend checking out this blog post created by Zev Ross. However, when reading please substitute the old piping syntax %.% for the updated syntax %>%.

Functions

Dplyr implements the following verbs useful for data manipulation:

select(): evaluates a subset of specified variables (columns) within a dataset.
filter(): extracts data (all columns) pertaining to a row that meets certain criteria.
arrange(): changes the ordering of the rows.
mutate(): adds new variables that are functions of existing variables; displays all data included new variables.
transmute(): adds new variables that are functions of existing variables; displays only these new variables.
summarise(): reduces each group to a smaller number of summary statistics.
group_by(): is sub-function often used in conjunction with the summarise() function. Specifically, group_by() creates a group object, which clusters the data using one or more variables. Then summarise() can be used, where functions are evaluated on each of the unique groups.
sample_n(): randomly samples a total of n row from the dataset.
sample_frac(): randomly samples a fraction of rows from the dataset.
rename(): renames a variable (column name) in the dataset.

Dplyr functions used with select() – aka select helpers functions:

starts_with(): starts with a prefix.
ends_with(): ends with a prefix.
contains(): contains a literal string.
matches(): matches a regular expression.
num_range(): selects columns that have a similar formatting but differ based on a numerical range like x01, x02, x03.
one_of(): variables in character vector.
everything(): all variables.

Getting Started

First, download the dplyr package and tidyverse package. The dplyr package will give you access to all of the functions mentioned above. The tidyverse package will assist in transforming each dataframe into a tibble. This will be done for convenience purposes since printing a tibble only shows the first ten rows, rather than the entire dataset. For more information regarding tibbles, look here.

library(dplyr)
library(tidyverse)

Next, download three datasets that will be used in various examples. The first dataset is called Credit.

library(ISLR)
Credit <- as_tibble(ISLR::Credit)
Credit

## # A tibble: 400 x 12
##       ID Income Limit Rating Cards   Age Education Gender  Student Married
##    <int>  <dbl> <int>  <int> <int> <int>     <int> <fct>   <fct>   <fct>  
##  1     1   14.9  3606    283     2    34        11 " Male" No      Yes    
##  2     2  106.   6645    483     3    82        15 Female  Yes     Yes    
##  3     3  105.   7075    514     4    71        11 " Male" No      No     
##  4     4  149.   9504    681     3    36        11 Female  No      No     
##  5     5   55.9  4897    357     2    68        16 " Male" No      Yes    
##  6     6   80.2  8047    569     4    77        10 " Male" No      No     
##  7     7   21.0  3388    259     2    37        12 Female  No      No     
##  8     8   71.4  7114    512     2    87         9 " Male" No      No     
##  9     9   15.1  3300    266     5    66        13 Female  No      No     
## 10    10   71.1  6819    491     3    41        19 Female  Yes     Yes    
## # ... with 390 more rows, and 2 more variables: Ethnicity <fct>,
## #   Balance <int>

The second dataset is called Flights.

library(nycflights13)
flights <- as_tibble(nycflights13::flights)
flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

The third dataset is called Air Quality.

library(datasets)
airquality <- as_tibble(datasets::airquality)
airquality

## # A tibble: 153 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    36     118   8      72     5     2
##  3    12     149  12.6    74     5     3
##  4    18     313  11.5    62     5     4
##  5    NA      NA  14.3    56     5     5
##  6    28      NA  14.9    66     5     6
##  7    23     299   8.6    65     5     7
##  8    19      99  13.8    59     5     8
##  9     8      19  20.1    61     5     9
## 10    NA     194   8.6    69     5    10
## # ... with 143 more rows

Examples

In the following examples I will demonstrate how to use each of the dplyr functions, and I will also show how to do that same task using Base R. Each example will be formatted so (a) performs the task using dplyr (b) performs the task using Base R.

I) Select Function

Example 1

The purpose of Example 1 is to select specific variables of the airquality dataset, and display the subset of data in the exact order shown within the function (e.g- “Month, Day, Ozone, Temp”). Using DPLYR (a) this is done using the select function, and using Base R (b) this is done by subsetting the airquality dataset using the which function.

Example 1(a): DPLYR

select(airquality, Month, Day, Ozone, Temp)

## # A tibble: 153 x 4
##    Month   Day Ozone  Temp
##    <int> <int> <int> <int>
##  1     5     1    41    67
##  2     5     2    36    72
##  3     5     3    12    74
##  4     5     4    18    62
##  5     5     5    NA    56
##  6     5     6    28    66
##  7     5     7    23    65
##  8     5     8    19    59
##  9     5     9     8    61
## 10     5    10    NA    69
## # ... with 143 more rows

Example 1(b): Base R

airquality[,c(which(colnames(airquality)=="Month"), which(colnames(airquality)=="Day"), 
              which(colnames(airquality)=="Ozone"), which(colnames(airquality)=="Temp"))]

## # A tibble: 153 x 4
##    Month   Day Ozone  Temp
##    <int> <int> <int> <int>
##  1     5     1    41    67
##  2     5     2    36    72
##  3     5     3    12    74
##  4     5     4    18    62
##  5     5     5    NA    56
##  6     5     6    28    66
##  7     5     7    23    65
##  8     5     8    19    59
##  9     5     9     8    61
## 10     5    10    NA    69
## # ... with 143 more rows

Example 2

The purpose of Example 2 is to display the subset of data in the exact order shown within the function (e.g- “Temp, Month, Day”). Using DPLYR (a) this is done using the select function and ‘:’ notation, which means show all variables between the variable on the left of the colon and the variable on the right of colon. Using Base R (b), this is done by subsetting the airquality dataset using the which function.

Example 2(a): DPLYR

select(airquality, Temp:Day)

## # A tibble: 153 x 3
##     Temp Month   Day
##    <int> <int> <int>
##  1    67     5     1
##  2    72     5     2
##  3    74     5     3
##  4    62     5     4
##  5    56     5     5
##  6    66     5     6
##  7    65     5     7
##  8    59     5     8
##  9    61     5     9
## 10    69     5    10
## # ... with 143 more rows

Example 2(b): Base R

airquality[,c(which(colnames(airquality)=="Temp"), which(colnames(airquality)=="Month"), which(colnames(airquality)=="Day"))]

## # A tibble: 153 x 3
##     Temp Month   Day
##    <int> <int> <int>
##  1    67     5     1
##  2    72     5     2
##  3    74     5     3
##  4    62     5     4
##  5    56     5     5
##  6    66     5     6
##  7    65     5     7
##  8    59     5     8
##  9    61     5     9
## 10    69     5    10
## # ... with 143 more rows

Example 3

Example 3 is similar to Example 2 in that its purpose is to display a subset of varibales. However, the ‘-’ sign in Example 3, removes all of the variables listed, and displays only the remaining variables in the dataset.

Example 3(a): DPLYR

select(airquality, -(Temp:Day))

## # A tibble: 153 x 3
##    Ozone Solar.R  Wind
##    <int>   <int> <dbl>
##  1    41     190   7.4
##  2    36     118   8  
##  3    12     149  12.6
##  4    18     313  11.5
##  5    NA      NA  14.3
##  6    28      NA  14.9
##  7    23     299   8.6
##  8    19      99  13.8
##  9     8      19  20.1
## 10    NA     194   8.6
## # ... with 143 more rows

Example 3(b): Base R

airquality[,-c(which(colnames(airquality)=="Temp"), which(colnames(airquality)=="Month"), which(colnames(airquality)=="Day"))]

## # A tibble: 153 x 3
##    Ozone Solar.R  Wind
##    <int>   <int> <dbl>
##  1    41     190   7.4
##  2    36     118   8  
##  3    12     149  12.6
##  4    18     313  11.5
##  5    NA      NA  14.3
##  6    28      NA  14.9
##  7    23     299   8.6
##  8    19      99  13.8
##  9     8      19  20.1
## 10    NA     194   8.6
## # ... with 143 more rows

II) Select Helper Functions

Example 1

The purpose of Example 1 is to identify all variables (column names) in the dataset that start with a letter “E”. Using DPLYR (a) this is done simulataneously using both the select and starts_with functions. Using Base R (b) this is done by subsetting the dataset and using the startsWith function. For (a) capitalization is not important; however, for (b) capitalization is important– variables need to be exactly the same as the way they exist within the dataset in order for them to be recognized by the function.

Example 1(a): DPLYR

dplyr::select(Credit, starts_with("E"))

## # A tibble: 400 x 2
##    Education Ethnicity       
##        <int> <fct>           
##  1        11 Caucasian       
##  2        15 Asian           
##  3        11 Asian           
##  4        11 Asian           
##  5        16 Caucasian       
##  6        10 Caucasian       
##  7        12 African American
##  8         9 Asian           
##  9        13 Caucasian       
## 10        19 African American
## # ... with 390 more rows

Example 1(b): Base R

Credit[,startsWith(colnames(ISLR::Credit), "E")]

## # A tibble: 400 x 2
##    Education Ethnicity       
##        <int> <fct>           
##  1        11 Caucasian       
##  2        15 Asian           
##  3        11 Asian           
##  4        11 Asian           
##  5        16 Caucasian       
##  6        10 Caucasian       
##  7        12 African American
##  8         9 Asian           
##  9        13 Caucasian       
## 10        19 African American
## # ... with 390 more rows

Example 2

The purpose of Example 2 is to identify and select all variables in the dataset that end with a letter “E”. Using DPLYR (a) this is done simulataneously using both the select and ends_with functions. Using Base R (b) this is done by subsetting the dataset and using the endsWith function. For (a) capitalization is not important; however, for (b) capitalization is important– variables need to be exactly the same as the way they exist within the dataset in order for them to be recognized by the function.

Example 2(a): DPLYR

dplyr::select(Credit, ends_with("E"))

## # A tibble: 400 x 3
##    Income   Age Balance
##     <dbl> <int>   <int>
##  1   14.9    34     333
##  2  106.     82     903
##  3  105.     71     580
##  4  149.     36     964
##  5   55.9    68     331
##  6   80.2    77    1151
##  7   21.0    37     203
##  8   71.4    87     872
##  9   15.1    66     279
## 10   71.1    41    1350
## # ... with 390 more rows

Example 2(b): Base R

Credit[,endsWith(colnames(ISLR::Credit), "e")]

## # A tibble: 400 x 3
##    Income   Age Balance
##     <dbl> <int>   <int>
##  1   14.9    34     333
##  2  106.     82     903
##  3  105.     71     580
##  4  149.     36     964
##  5   55.9    68     331
##  6   80.2    77    1151
##  7   21.0    37     203
##  8   71.4    87     872
##  9   15.1    66     279
## 10   71.1    41    1350
## # ... with 390 more rows

Example 3

The purpose of Example 3 is to select all variables in the data that contain the sequence of letters ‘arr’. Using DPLYR (a) this is done simulataneously using both the select and matches functions. Using Base R (b) this is done by subsetting the dataset and using the grepl function.

Example 3(a): DPLYR

dplyr::select(flights, matches("arr"))

## # A tibble: 336,776 x 4
##    arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl> <chr>  
##  1      830            819        11 UA     
##  2      850            830        20 UA     
##  3      923            850        33 AA     
##  4     1004           1022       -18 B6     
##  5      812            837       -25 DL     
##  6      740            728        12 UA     
##  7      913            854        19 B6     
##  8      709            723       -14 EV     
##  9      838            846        -8 B6     
## 10      753            745         8 AA     
## # ... with 336,766 more rows

Example 3(b): Base R

flights[,grepl("arr", colnames(flights))]

## # A tibble: 336,776 x 4
##    arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl> <chr>  
##  1      830            819        11 UA     
##  2      850            830        20 UA     
##  3      923            850        33 AA     
##  4     1004           1022       -18 B6     
##  5      812            837       -25 DL     
##  6      740            728        12 UA     
##  7      913            854        19 B6     
##  8      709            723       -14 EV     
##  9      838            846        -8 B6     
## 10      753            745         8 AA     
## # ... with 336,766 more rows

Example 4

The purpose of Example 4 is to select all variables in the data that contain the sequence of letters ‘time’. Using DPLYR (a) this is done simulataneously using both the select and contains functions. Using Base R (b) this is done by subsetting the dataset and using the grepl function.

Example 4(a): DPLYR

dplyr::select(flights, contains("time"))

## # A tibble: 336,776 x 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <int>          <int>    <int>          <int>    <dbl>
##  1      517            515      830            819      227
##  2      533            529      850            830      227
##  3      542            540      923            850      160
##  4      544            545     1004           1022      183
##  5      554            600      812            837      116
##  6      554            558      740            728      150
##  7      555            600      913            854      158
##  8      557            600      709            723       53
##  9      557            600      838            846      140
## 10      558            600      753            745      138
## # ... with 336,766 more rows, and 1 more variable: time_hour <dttm>

Example 4(b): Base R

flights[,grepl("time", colnames(flights))]

## # A tibble: 336,776 x 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <int>          <int>    <int>          <int>    <dbl>
##  1      517            515      830            819      227
##  2      533            529      850            830      227
##  3      542            540      923            850      160
##  4      544            545     1004           1022      183
##  5      554            600      812            837      116
##  6      554            558      740            728      150
##  7      555            600      913            854      158
##  8      557            600      709            723       53
##  9      557            600      838            846      140
## 10      558            600      753            745      138
## # ... with 336,766 more rows, and 1 more variable: time_hour <dttm>

Example 5

In this example, suppose that the column names of a dataset have a similar formatting but differ based on a numerical range, e.g- x01, x02, x03, x04, etc. The purpose of Example 5 is to select only variables that have column names between the numerical range of 8-11. Using DPLYR (a) this is done simulataneously using both the select and num_range functions. Using Base R (b) this is done by subsetting the dataset and using the which function.

Example 5(a): DPLYR

flight_data <- flights 
colnames(flight_data) <- sprintf("x%d", 1:19) 
dplyr::select(flight_data, num_range("x", 8:11))

## # A tibble: 336,776 x 4
##       x8    x9 x10     x11
##    <int> <dbl> <chr> <int>
##  1   819    11 UA     1545
##  2   830    20 UA     1714
##  3   850    33 AA     1141
##  4  1022   -18 B6      725
##  5   837   -25 DL      461
##  6   728    12 UA     1696
##  7   854    19 B6      507
##  8   723   -14 EV     5708
##  9   846    -8 B6       79
## 10   745     8 AA      301
## # ... with 336,766 more rows

Example 5(b): Base R

flight_data <- flights 
colnames(flight_data) <- sprintf("x%d", 1:19) 
flight_data[,c(which(colnames(flight_data)=="x8"), which(colnames(flight_data)=="x9"), 
               which(colnames(flight_data)=="x10"), which(colnames(flight_data)=="x11"))]

## # A tibble: 336,776 x 4
##       x8    x9 x10     x11
##    <int> <dbl> <chr> <int>
##  1   819    11 UA     1545
##  2   830    20 UA     1714
##  3   850    33 AA     1141
##  4  1022   -18 B6      725
##  5   837   -25 DL      461
##  6   728    12 UA     1696
##  7   854    19 B6      507
##  8   723   -14 EV     5708
##  9   846    -8 B6       79
## 10   745     8 AA      301
## # ... with 336,766 more rows

III) Filter Function

Example 1

The purpose of Example 1 is to extract data pertaining to all rows (airquality days) in the month of September. Using DPLYR (a) this is done using the filter function. Using Base R (b) this is done by simultaneously subsetting and specifying the condition of interest.

Example 1(a): DPLYR

dplyr::filter(airquality, Month == 9)

## # A tibble: 30 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    96     167   6.9    91     9     1
##  2    78     197   5.1    92     9     2
##  3    73     183   2.8    93     9     3
##  4    91     189   4.6    93     9     4
##  5    47      95   7.4    87     9     5
##  6    32      92  15.5    84     9     6
##  7    20     252  10.9    80     9     7
##  8    23     220  10.3    78     9     8
##  9    21     230  10.9    75     9     9
## 10    24     259   9.7    73     9    10
## # ... with 20 more rows

Example 1(b): Base R

airquality[airquality$Month == 9, ]

## # A tibble: 30 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    96     167   6.9    91     9     1
##  2    78     197   5.1    92     9     2
##  3    73     183   2.8    93     9     3
##  4    91     189   4.6    93     9     4
##  5    47      95   7.4    87     9     5
##  6    32      92  15.5    84     9     6
##  7    20     252  10.9    80     9     7
##  8    23     220  10.3    78     9     8
##  9    21     230  10.9    75     9     9
## 10    24     259   9.7    73     9    10
## # ... with 20 more rows

Example 2

The purpose of Example 2 is to extract data pertaining to all rows, also known as airquality days, in the month of September. Using DPLYR (a) this is done using the filter function. Using Base R (b) this is done by simultaneously subsetting and specifying the condition of interest.

Example 2(a): DPLYR

dplyr::filter(airquality, Month == 9, Solar.R > 100)

## # A tibble: 23 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    96     167   6.9    91     9     1
##  2    78     197   5.1    92     9     2
##  3    73     183   2.8    93     9     3
##  4    91     189   4.6    93     9     4
##  5    20     252  10.9    80     9     7
##  6    23     220  10.3    78     9     8
##  7    21     230  10.9    75     9     9
##  8    24     259   9.7    73     9    10
##  9    44     236  14.9    81     9    11
## 10    21     259  15.5    76     9    12
## # ... with 13 more rows

Example 2(b): Base R

airquality[airquality$Month == 9 & airquality$Solar.R > 100, ]

## # A tibble: 23 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    96     167   6.9    91     9     1
##  2    78     197   5.1    92     9     2
##  3    73     183   2.8    93     9     3
##  4    91     189   4.6    93     9     4
##  5    20     252  10.9    80     9     7
##  6    23     220  10.3    78     9     8
##  7    21     230  10.9    75     9     9
##  8    24     259   9.7    73     9    10
##  9    44     236  14.9    81     9    11
## 10    21     259  15.5    76     9    12
## # ... with 13 more rows

IV) Arrange Function

Example 1

The purpose of Example 1 is to rearrange the original airquality dataset so that the variables Month and Day are displayed in ascending order. Using DPLYR (a) this is done using the arrange function. Using Base R (b) this is done by subsetting and using the order function. Keep in mind, for both techniques, the ordering of the variables are prioritized left to right. In other words, Month is arranged in ascending first, and then Day is arranged in ascending order within the Month variables.

Example 1(a): DPLYR

dplyr::arrange(airquality, Month, Day)

## # A tibble: 153 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    36     118   8      72     5     2
##  3    12     149  12.6    74     5     3
##  4    18     313  11.5    62     5     4
##  5    NA      NA  14.3    56     5     5
##  6    28      NA  14.9    66     5     6
##  7    23     299   8.6    65     5     7
##  8    19      99  13.8    59     5     8
##  9     8      19  20.1    61     5     9
## 10    NA     194   8.6    69     5    10
## # ... with 143 more rows

Example 1(b): Base R

airquality[order(airquality$Month, airquality$Day),]

## # A tibble: 153 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    36     118   8      72     5     2
##  3    12     149  12.6    74     5     3
##  4    18     313  11.5    62     5     4
##  5    NA      NA  14.3    56     5     5
##  6    28      NA  14.9    66     5     6
##  7    23     299   8.6    65     5     7
##  8    19      99  13.8    59     5     8
##  9     8      19  20.1    61     5     9
## 10    NA     194   8.6    69     5    10
## # ... with 143 more rows

Example 2

The purpose of Example 2 is to rearrange the original airquality dataset so that the variables Day and Month are displayed in ascending order. Using DPLYR (a) this is done using the arrange function. Using Base R (b) this is done by subsetting and using the order function. The difference between Example 2 compared to Example 1 is that the ordering of the variables are prioritized differently. Specifically, Day is arranged in ascending first, and then Month is arranged in ascending order within the Day variables.

Example 2(a): DPLYR

dplyr::arrange(airquality, Day, Month)

## # A tibble: 153 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    NA     286   8.6    78     6     1
##  3   135     269   4.1    84     7     1
##  4    39      83   6.9    81     8     1
##  5    96     167   6.9    91     9     1
##  6    36     118   8      72     5     2
##  7    NA     287   9.7    74     6     2
##  8    49     248   9.2    85     7     2
##  9     9      24  13.8    81     8     2
## 10    78     197   5.1    92     9     2
## # ... with 143 more rows

Example 2(b): Base R

airquality[order(airquality$Day, airquality$Month),]

## # A tibble: 153 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    NA     286   8.6    78     6     1
##  3   135     269   4.1    84     7     1
##  4    39      83   6.9    81     8     1
##  5    96     167   6.9    91     9     1
##  6    36     118   8      72     5     2
##  7    NA     287   9.7    74     6     2
##  8    49     248   9.2    85     7     2
##  9     9      24  13.8    81     8     2
## 10    78     197   5.1    92     9     2
## # ... with 143 more rows

Example 3

The purpose of Example 3 is to rearrange the original airquality dataset so that the variables Month and Day are displayed in descending order. Using DPLYR (a) this is done using the arrange and desc function. Using Base R (b) this is done by subsetting and using the order & desc functions. Keep in mind, for both techniques, the ordering of the variables are prioritized left to right.

Example 3(a): DPLYR

dplyr::arrange(airquality, desc(Month), desc(Day))

## # A tibble: 153 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    20     223  11.5    68     9    30
##  2    18     131   8      76     9    29
##  3    14     191  14.3    75     9    28
##  4    NA     145  13.2    77     9    27
##  5    30     193   6.9    70     9    26
##  6    14      20  16.6    63     9    25
##  7     7      49  10.3    69     9    24
##  8    36     139  10.3    81     9    23
##  9    23      14   9.2    71     9    22
## 10    13     238  12.6    64     9    21
## # ... with 143 more rows

Example 3(b): Base R

airquality[order(-airquality$Month, -airquality$Day),]

## # A tibble: 153 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    20     223  11.5    68     9    30
##  2    18     131   8      76     9    29
##  3    14     191  14.3    75     9    28
##  4    NA     145  13.2    77     9    27
##  5    30     193   6.9    70     9    26
##  6    14      20  16.6    63     9    25
##  7     7      49  10.3    69     9    24
##  8    36     139  10.3    81     9    23
##  9    23      14   9.2    71     9    22
## 10    13     238  12.6    64     9    21
## # ... with 143 more rows

Example 4

The purpose of Example 4 is to rearrange the original airquality dataset so that the variable Ozone is displayed in ascending order. Using DPLYR (a) this is done using the arrange function. Using Base R (b) this is done by subsetting and using the order function.

Example 4(a): DPLYR

dplyr::arrange(airquality, Ozone)

## # A tibble: 153 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1     1       8   9.7    59     5    21
##  2     4      25   9.7    61     5    23
##  3     6      78  18.4    57     5    18
##  4     7      NA   6.9    74     5    11
##  5     7      48  14.3    80     7    15
##  6     7      49  10.3    69     9    24
##  7     8      19  20.1    61     5     9
##  8     9      24  13.8    81     8     2
##  9     9      36  14.3    72     8    22
## 10     9      24  10.9    71     9    14
## # ... with 143 more rows

Example 4(b): Base R

airquality[order(airquality$Ozone),]

## # A tibble: 153 x 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1     1       8   9.7    59     5    21
##  2     4      25   9.7    61     5    23
##  3     6      78  18.4    57     5    18
##  4     7      NA   6.9    74     5    11
##  5     7      48  14.3    80     7    15
##  6     7      49  10.3    69     9    24
##  7     8      19  20.1    61     5     9
##  8     9      24  13.8    81     8     2
##  9     9      36  14.3    72     8    22
## 10     9      24  10.9    71     9    14
## # ... with 143 more rows

V) Mutate Function

Example 1

The purpose of Example 1 is to add new variables, which are functions of current variables, to the existing dataset and then display the modified dataset. For instance, in Example 1, the variables gain, gain_per_hour, and speed are added to the dataset. Using DPLYR (a) this is done using the mutate function. Using Base R (b) this is done manually.

Example 1(a): DPLYR

mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60),
  speed = distance / air_time * 60)

## # A tibble: 336,776 x 22
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 15 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, gain <dbl>, gain_per_hour <dbl>,
## #   speed <dbl>

Example 1(b): Base R

flights_data <- flights 
flights_data$gain = flights_data$arr_delay - flights_data$dep_delay
flights_data$gain_per_hour = flights_data$gain / (flights_data$air_time / 60)
flights_data$speed = flights_data$distance / (flights_data$air_time * 60)
flights_data

## # A tibble: 336,776 x 22
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 15 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, gain <dbl>, gain_per_hour <dbl>,
## #   speed <dbl>

VI) Transmute Function

Example 1

The purpose of Example 1 is to add new variables, which are functions of current variables, to the existing dataset and then display only the new variables. For instance, in Example 1, the variables gain and gain_per_hour are added to the dataset. Using DPLYR (a) this is done using the transmute function. Using Base R (b) this is done manually.

Example 1(a): DPLYR

transmute(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60))

## # A tibble: 336,776 x 2
##     gain gain_per_hour
##    <dbl>         <dbl>
##  1     9          2.38
##  2    16          4.23
##  3    31         11.6 
##  4   -17         -5.57
##  5   -19         -9.83
##  6    16          6.4 
##  7    24          9.11
##  8   -11        -12.5 
##  9    -5         -2.14
## 10    10          4.35
## # ... with 336,766 more rows

Example 1(b): Base R

flights$gain = flights$arr_delay - flights$dep_delay
flights$gain_per_hour = flights$gain / (flights$air_time / 60)
flights[,c(which(colnames(flights)=="gain"), which(colnames(flights)=="gain_per_hour"))]

## # A tibble: 336,776 x 2
##     gain gain_per_hour
##    <dbl>         <dbl>
##  1     9          2.38
##  2    16          4.23
##  3    31         11.6 
##  4   -17         -5.57
##  5   -19         -9.83
##  6    16          6.4 
##  7    24          9.11
##  8   -11        -12.5 
##  9    -5         -2.14
## 10    10          4.35
## # ... with 336,766 more rows

VII) Summarise & GroupBy Function

Example 1

The purpose of Example 1 is to categorize people based on their unique Education levels. Within the Credit data there are 16 unique Education levels, represented by years of Education. In this example, the 400 people are arranged into these 16 groups. Then for each group three functions are performed: total count within each group, mean total limit for each group, and mean total balance for each group. Using DPLYR (a) this is done using the group_by and summarise function. Using Base R (b) this is done manually using a for loop.

Example 1(a): DPLYR

sort(unique(Credit$Education))

##  [1]  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

by_Education <- group_by(Credit, Education)

mean_info <- summarise(by_Education,
  count = n(),
  mean_total_limit = mean(Limit, na.rm = TRUE),
  mean_total_balance = mean(Balance ,na.rm = TRUE))
mean_info

## # A tibble: 16 x 4
##    Education count mean_total_limit mean_total_balance
##        <int> <int>            <dbl>              <dbl>
##  1         5     1            5666                863 
##  2         6     5            5330                555 
##  3         7     8            4911.               520 
##  4         8    14            5272.               493.
##  5         9    25            5119.               692.
##  6        10    24            4145                459.
##  7        11    33            4883.               441.
##  8        12    37            4971.               599.
##  9        13    38            4124.               396.
## 10        14    48            4718.               493.
## 11        15    49            4523.               486.
## 12        16    50            4732.               584.
## 13        17    34            5319.               604.
## 14        18    22            4478.               443.
## 15        19    10            4638.               544.
## 16        20     2            4584.               421

Example 1(b): Base R

unique_id <- sort(unique(Credit$Education))
count_n <- rep(0, length(unique(Credit$Education)))
total_balance <- rep(0, length(unique(Credit$Education))) 
total_limit <- rep(0, length(unique(Credit$Education)))

for(i in 1:length(unique(Credit$Education))){
  for(j in 1:nrow(Credit)){
    if(unique_id[i]==Credit$Education[j]){
      count_n[i] = count_n[i] + 1
      total_balance[i] = Credit$Balance[j] + total_balance[i]
      total_limit[i] = Credit$Limit[j] + total_limit[i]
    }  
    
  }
}

mean_total_limit <- round(total_limit/count_n,0)
mean_total_balance <- round(total_balance/count_n,0)

df <- data.frame(unique_id, count_n, mean_total_limit, mean_total_balance)
colnames(df) <- c("ID", "Count", "Mean Total Limit", "Mean Total Balance")
df

##    ID Count Mean Total Limit Mean Total Balance
## 1   5     1             5666                863
## 2   6     5             5330                555
## 3   7     8             4911                520
## 4   8    14             5272                493
## 5   9    25             5119                692
## 6  10    24             4145                459
## 7  11    33             4883                441
## 8  12    37             4971                599
## 9  13    38             4124                396
## 10 14    48             4718                493
## 11 15    49             4523                486
## 12 16    50             4732                584
## 13 17    34             5319                604
## 14 18    22             4478                443
## 15 19    10             4638                544
## 16 20     2             4584                421

VIII) Sample_n & Sample_frac Function

Example 1

The purpose of Example 1 is to randomly sample a total of n rows from the dataset. Using DPLYR (a) this is done using the sample_n function. Using Base R (b) this is done by subsetting the dataset and using the sample function.

Example 1(a): DPLYR

set.seed(123)
sample_n(Credit, 10)

## # A tibble: 10 x 12
##       ID Income Limit Rating Cards   Age Education Gender  Student Married
##    <int>  <dbl> <int>  <int> <int> <int>     <int> <fct>   <fct>   <fct>  
##  1   116   40.4  4828    369     5    81         8 Female  No      No     
##  2   315  116.   9272    656     2    69        14 " Male" No      No     
##  3   163   63.8  7530    515     1    56        12 " Male" No      Yes    
##  4   351   30.0  1561    155     4    70        13 Female  No      Yes    
##  5   373   19.8  3782    293     2    46        16 Female  Yes     No     
##  6    18   36.5  4378    339     3    69        15 Female  No      Yes    
##  7   209   51.3  4327    320     3    46        15 " Male" No      No     
##  8   397   13.4  3838    296     5    65        17 " Male" No      No     
##  9   217   15.5  2762    215     3    60        18 " Male" No      No     
## 10   179   28.3  4391    316     2    29        10 Female  No      No     
## # ... with 2 more variables: Ethnicity <fct>, Balance <int>

dim(sample_n(Credit, 10))

## [1] 10 12

Example 1(b): Base R

set.seed(123)
numb <- c(1:nrow(Credit))
Credit[sample(numb, size = 10, replace = FALSE),]

## # A tibble: 10 x 12
##       ID Income Limit Rating Cards   Age Education Gender  Student Married
##    <int>  <dbl> <int>  <int> <int> <int>     <int> <fct>   <fct>   <fct>  
##  1   116   40.4  4828    369     5    81         8 Female  No      No     
##  2   315  116.   9272    656     2    69        14 " Male" No      No     
##  3   163   63.8  7530    515     1    56        12 " Male" No      Yes    
##  4   351   30.0  1561    155     4    70        13 Female  No      Yes    
##  5   373   19.8  3782    293     2    46        16 Female  Yes     No     
##  6    18   36.5  4378    339     3    69        15 Female  No      Yes    
##  7   209   51.3  4327    320     3    46        15 " Male" No      No     
##  8   397   13.4  3838    296     5    65        17 " Male" No      No     
##  9   217   15.5  2762    215     3    60        18 " Male" No      No     
## 10   179   28.3  4391    316     2    29        10 Female  No      No     
## # ... with 2 more variables: Ethnicity <fct>, Balance <int>

dim(Credit[sample(numb, size = 10, replace = FALSE),])

## [1] 10 12

Example 2

The purpose of Example 2 is to randomly sample a fraction of rows from the dataset. Using DPLYR (a) this is done using the sample_frac function. Using Base R (b) this is done by subsetting the dataset and using the sample function.

Example 2(a): DPLYR

set.seed(123)
sample_frac(Credit, 0.01)

## # A tibble: 4 x 12
##      ID Income Limit Rating Cards   Age Education Gender  Student Married
##   <int>  <dbl> <int>  <int> <int> <int>     <int> <fct>   <fct>   <fct>  
## 1   116   40.4  4828    369     5    81         8 Female  No      No     
## 2   315  116.   9272    656     2    69        14 " Male" No      No     
## 3   163   63.8  7530    515     1    56        12 " Male" No      Yes    
## 4   351   30.0  1561    155     4    70        13 Female  No      Yes    
## # ... with 2 more variables: Ethnicity <fct>, Balance <int>

dim(sample_frac(Credit, 0.01))

## [1]  4 12

Example 2(b): Base R

frac <- round(0.01*nrow(Credit),0) #evaluate a random 1% of the data
Credit[sample(numb, size = frac, replace = FALSE),]

## # A tibble: 4 x 12
##      ID Income Limit Rating Cards   Age Education Gender Student Married
##   <int>  <dbl> <int>  <int> <int> <int>     <int> <fct>  <fct>   <fct>  
## 1   221   44.8  5765    437     3    53        13 Female Yes     No     
## 2   183   58.9  6420    459     2    66         9 Female No      Yes    
## 3   381  115.   7760    538     3    83        14 Female No      No     
## 4   180   58.0  7499    560     5    67        11 Female No      No     
## # ... with 2 more variables: Ethnicity <fct>, Balance <int>

dim(Credit[sample(numb, size = frac, replace = FALSE),])

## [1]  4 12

IX) Rename Function

Example 1

The purpose of Example 1 is to rename the Card variable to number_credit_cards, and then displayed the modified dataset. Using DPLYR (a) this is done using the rename function. Using Base R (b) this is done using the an ifelse statement and the colnames function.

Example 1(a): DPLYR

Credit_data <- rename(Credit, number_credit_cards = Cards)
Credit_data

## # A tibble: 400 x 12
##       ID Income Limit Rating number_credit_cards   Age Education Gender 
##    <int>  <dbl> <int>  <int>               <int> <int>     <int> <fct>  
##  1     1   14.9  3606    283                   2    34        11 " Male"
##  2     2  106.   6645    483                   3    82        15 Female 
##  3     3  105.   7075    514                   4    71        11 " Male"
##  4     4  149.   9504    681                   3    36        11 Female 
##  5     5   55.9  4897    357                   2    68        16 " Male"
##  6     6   80.2  8047    569                   4    77        10 " Male"
##  7     7   21.0  3388    259                   2    37        12 Female 
##  8     8   71.4  7114    512                   2    87         9 " Male"
##  9     9   15.1  3300    266                   5    66        13 Female 
## 10    10   71.1  6819    491                   3    41        19 Female 
## # ... with 390 more rows, and 4 more variables: Student <fct>,
## #   Married <fct>, Ethnicity <fct>, Balance <int>

Example 1(b): Base R

colnames(Credit) <- ifelse(colnames(Credit)=="Cards", "number_credit_cards", colnames(Credit))
Credit

## # A tibble: 400 x 12
##       ID Income Limit Rating number_credit_cards   Age Education Gender 
##    <int>  <dbl> <int>  <int>               <int> <int>     <int> <fct>  
##  1     1   14.9  3606    283                   2    34        11 " Male"
##  2     2  106.   6645    483                   3    82        15 Female 
##  3     3  105.   7075    514                   4    71        11 " Male"
##  4     4  149.   9504    681                   3    36        11 Female 
##  5     5   55.9  4897    357                   2    68        16 " Male"
##  6     6   80.2  8047    569                   4    77        10 " Male"
##  7     7   21.0  3388    259                   2    37        12 Female 
##  8     8   71.4  7114    512                   2    87         9 " Male"
##  9     9   15.1  3300    266                   5    66        13 Female 
## 10    10   71.1  6819    491                   3    41        19 Female 
## # ... with 390 more rows, and 4 more variables: Student <fct>,
## #   Married <fct>, Ethnicity <fct>, Balance <int>

Function References

Base R Functions used for the following Dplyr functions:

select(): in base R this can be done by subsetting data
filter(): in base R this can be done by subsetting data
mutate(): in base R this can be done using transform()
arrange(): in base R this can be done using order()
sample_n(): in base R this can be done using sample()
sample_frac(): in base R this can be done using sample()
transmute(): in base R this can be done using which() & simultaneously subsetting data
rename(): in base R this can be done using ifelse()
starts_with(): in base R this can be done using startsWith()
ends_with(): in base R this can be done using endsWith()
contains(): in base R this can be done using grepl()
matches(): in base R this can be done using grepl()

Summary

Dplyr provides tools to allow for increased speed, greater efficiency, and more simplisitic syntax compared to some of the other leading methods that can perform the techniques discussed in these examples. I hope you found this useful. If you have any questions, feel free to leave a comment or reach out to me via e-mail/Twitter shown on the homepage.

Data Manipulation with DPLYR

PUBLISHED ON FEB 18, 2018 — R

Introduction to Dplyr

Why You Should Use It

Functions

Getting Started

Examples

I) Select Function

Example 1

Example 2

Example 3

II) Select Helper Functions

Example 1

Example 2

Example 3

Example 4

Example 5

III) Filter Function

Example 1

Example 2

IV) Arrange Function

Example 1

Example 2

Example 3

Example 4

V) Mutate Function

Example 1

VI) Transmute Function

Example 1

VII) Summarise & GroupBy Function

Example 1

VIII) Sample_n & Sample_frac Function

Example 1

Example 2

IX) Rename Function

Example 1

Function References

Summary

TAGS: R MARKDOWN, DATA MANIPULATION, DPLYR

home