Dplyr is an R package, designed for the purpose of manipulating data. This package is the creation of Hadley Wickem, and it offers many advantageous features. For those of you familiar with Plyr, Dplyr is the next iteration, and it specifically focuses only on dataframes.
Efficiency. Simplicity. Speed.
Dplyr provides tools to allow for increased speed, smaller utilization of memory, and efficiency/simplicity in writing functions. For instance:
For more details regarding the benefits to dplyr, I recommend checking out this blog post created by Zev Ross. However, when reading please substitute the old piping syntax %.% for the updated syntax %>%.
Dplyr implements the following verbs useful for data manipulation:
select(): evaluates a subset of specified variables (columns) within a dataset.
filter(): extracts data (all columns) pertaining to a row that meets certain criteria.
arrange(): changes the ordering of the rows.
mutate(): adds new variables that are functions of existing variables; displays all data included new variables.
transmute(): adds new variables that are functions of existing variables; displays only these new variables.
summarise(): reduces each group to a smaller number of summary statistics.
group_by(): is sub-function often used in conjunction with the summarise() function. Specifically, group_by() creates a group object, which clusters the data using one or more variables. Then summarise() can be used, where functions are evaluated on each of the unique groups.
sample_n(): randomly samples a total of n row from the dataset.
sample_frac(): randomly samples a fraction of rows from the dataset.
rename(): renames a variable (column name) in the dataset.
Dplyr functions used with select() – aka select helpers functions:
starts_with(): starts with a prefix.
ends_with(): ends with a prefix.
contains(): contains a literal string.
matches(): matches a regular expression.
num_range(): selects columns that have a similar formatting but differ based on a numerical range like x01, x02, x03.
one_of(): variables in character vector.
everything(): all variables.
First, download the dplyr package and tidyverse package. The dplyr package will give you access to all of the functions mentioned above. The tidyverse package will assist in transforming each dataframe into a tibble. This will be done for convenience purposes since printing a tibble only shows the first ten rows, rather than the entire dataset. For more information regarding tibbles, look here.
library(dplyr)
library(tidyverse)
Next, download three datasets that will be used in various examples. The first dataset is called Credit.
library(ISLR)
Credit <- as_tibble(ISLR::Credit)
Credit
## # A tibble: 400 x 12
## ID Income Limit Rating Cards Age Education Gender Student Married
## <int> <dbl> <int> <int> <int> <int> <int> <fct> <fct> <fct>
## 1 1 14.9 3606 283 2 34 11 " Male" No Yes
## 2 2 106. 6645 483 3 82 15 Female Yes Yes
## 3 3 105. 7075 514 4 71 11 " Male" No No
## 4 4 149. 9504 681 3 36 11 Female No No
## 5 5 55.9 4897 357 2 68 16 " Male" No Yes
## 6 6 80.2 8047 569 4 77 10 " Male" No No
## 7 7 21.0 3388 259 2 37 12 Female No No
## 8 8 71.4 7114 512 2 87 9 " Male" No No
## 9 9 15.1 3300 266 5 66 13 Female No No
## 10 10 71.1 6819 491 3 41 19 Female Yes Yes
## # ... with 390 more rows, and 2 more variables: Ethnicity <fct>,
## # Balance <int>
The second dataset is called Flights.
library(nycflights13)
flights <- as_tibble(nycflights13::flights)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The third dataset is called Air Quality.
library(datasets)
airquality <- as_tibble(datasets::airquality)
airquality
## # A tibble: 153 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## # ... with 143 more rows
In the following examples I will demonstrate how to use each of the dplyr functions, and I will also show how to do that same task using Base R. Each example will be formatted so (a) performs the task using dplyr (b) performs the task using Base R.
The purpose of Example 1 is to select specific variables of the airquality dataset, and display the subset of data in the exact order shown within the function (e.g- “Month, Day, Ozone, Temp”). Using DPLYR (a) this is done using the select function, and using Base R (b) this is done by subsetting the airquality dataset using the which function.
Example 1(a): DPLYR
select(airquality, Month, Day, Ozone, Temp)
## # A tibble: 153 x 4
## Month Day Ozone Temp
## <int> <int> <int> <int>
## 1 5 1 41 67
## 2 5 2 36 72
## 3 5 3 12 74
## 4 5 4 18 62
## 5 5 5 NA 56
## 6 5 6 28 66
## 7 5 7 23 65
## 8 5 8 19 59
## 9 5 9 8 61
## 10 5 10 NA 69
## # ... with 143 more rows
Example 1(b): Base R
airquality[,c(which(colnames(airquality)=="Month"), which(colnames(airquality)=="Day"),
which(colnames(airquality)=="Ozone"), which(colnames(airquality)=="Temp"))]
## # A tibble: 153 x 4
## Month Day Ozone Temp
## <int> <int> <int> <int>
## 1 5 1 41 67
## 2 5 2 36 72
## 3 5 3 12 74
## 4 5 4 18 62
## 5 5 5 NA 56
## 6 5 6 28 66
## 7 5 7 23 65
## 8 5 8 19 59
## 9 5 9 8 61
## 10 5 10 NA 69
## # ... with 143 more rows
The purpose of Example 2 is to display the subset of data in the exact order shown within the function (e.g- “Temp, Month, Day”). Using DPLYR (a) this is done using the select function and ‘:’ notation, which means show all variables between the variable on the left of the colon and the variable on the right of colon. Using Base R (b), this is done by subsetting the airquality dataset using the which function.
Example 2(a): DPLYR
select(airquality, Temp:Day)
## # A tibble: 153 x 3
## Temp Month Day
## <int> <int> <int>
## 1 67 5 1
## 2 72 5 2
## 3 74 5 3
## 4 62 5 4
## 5 56 5 5
## 6 66 5 6
## 7 65 5 7
## 8 59 5 8
## 9 61 5 9
## 10 69 5 10
## # ... with 143 more rows
Example 2(b): Base R
airquality[,c(which(colnames(airquality)=="Temp"), which(colnames(airquality)=="Month"), which(colnames(airquality)=="Day"))]
## # A tibble: 153 x 3
## Temp Month Day
## <int> <int> <int>
## 1 67 5 1
## 2 72 5 2
## 3 74 5 3
## 4 62 5 4
## 5 56 5 5
## 6 66 5 6
## 7 65 5 7
## 8 59 5 8
## 9 61 5 9
## 10 69 5 10
## # ... with 143 more rows
Example 3 is similar to Example 2 in that its purpose is to display a subset of varibales. However, the ‘-’ sign in Example 3, removes all of the variables listed, and displays only the remaining variables in the dataset.
Example 3(a): DPLYR
select(airquality, -(Temp:Day))
## # A tibble: 153 x 3
## Ozone Solar.R Wind
## <int> <int> <dbl>
## 1 41 190 7.4
## 2 36 118 8
## 3 12 149 12.6
## 4 18 313 11.5
## 5 NA NA 14.3
## 6 28 NA 14.9
## 7 23 299 8.6
## 8 19 99 13.8
## 9 8 19 20.1
## 10 NA 194 8.6
## # ... with 143 more rows
Example 3(b): Base R
airquality[,-c(which(colnames(airquality)=="Temp"), which(colnames(airquality)=="Month"), which(colnames(airquality)=="Day"))]
## # A tibble: 153 x 3
## Ozone Solar.R Wind
## <int> <int> <dbl>
## 1 41 190 7.4
## 2 36 118 8
## 3 12 149 12.6
## 4 18 313 11.5
## 5 NA NA 14.3
## 6 28 NA 14.9
## 7 23 299 8.6
## 8 19 99 13.8
## 9 8 19 20.1
## 10 NA 194 8.6
## # ... with 143 more rows
The purpose of Example 1 is to identify all variables (column names) in the dataset that start with a letter “E”. Using DPLYR (a) this is done simulataneously using both the select and starts_with functions. Using Base R (b) this is done by subsetting the dataset and using the startsWith function. For (a) capitalization is not important; however, for (b) capitalization is important– variables need to be exactly the same as the way they exist within the dataset in order for them to be recognized by the function.
Example 1(a): DPLYR
dplyr::select(Credit, starts_with("E"))
## # A tibble: 400 x 2
## Education Ethnicity
## <int> <fct>
## 1 11 Caucasian
## 2 15 Asian
## 3 11 Asian
## 4 11 Asian
## 5 16 Caucasian
## 6 10 Caucasian
## 7 12 African American
## 8 9 Asian
## 9 13 Caucasian
## 10 19 African American
## # ... with 390 more rows
Example 1(b): Base R
Credit[,startsWith(colnames(ISLR::Credit), "E")]
## # A tibble: 400 x 2
## Education Ethnicity
## <int> <fct>
## 1 11 Caucasian
## 2 15 Asian
## 3 11 Asian
## 4 11 Asian
## 5 16 Caucasian
## 6 10 Caucasian
## 7 12 African American
## 8 9 Asian
## 9 13 Caucasian
## 10 19 African American
## # ... with 390 more rows
The purpose of Example 2 is to identify and select all variables in the dataset that end with a letter “E”. Using DPLYR (a) this is done simulataneously using both the select and ends_with functions. Using Base R (b) this is done by subsetting the dataset and using the endsWith function. For (a) capitalization is not important; however, for (b) capitalization is important– variables need to be exactly the same as the way they exist within the dataset in order for them to be recognized by the function.
Example 2(a): DPLYR
dplyr::select(Credit, ends_with("E"))
## # A tibble: 400 x 3
## Income Age Balance
## <dbl> <int> <int>
## 1 14.9 34 333
## 2 106. 82 903
## 3 105. 71 580
## 4 149. 36 964
## 5 55.9 68 331
## 6 80.2 77 1151
## 7 21.0 37 203
## 8 71.4 87 872
## 9 15.1 66 279
## 10 71.1 41 1350
## # ... with 390 more rows
Example 2(b): Base R
Credit[,endsWith(colnames(ISLR::Credit), "e")]
## # A tibble: 400 x 3
## Income Age Balance
## <dbl> <int> <int>
## 1 14.9 34 333
## 2 106. 82 903
## 3 105. 71 580
## 4 149. 36 964
## 5 55.9 68 331
## 6 80.2 77 1151
## 7 21.0 37 203
## 8 71.4 87 872
## 9 15.1 66 279
## 10 71.1 41 1350
## # ... with 390 more rows
The purpose of Example 3 is to select all variables in the data that contain the sequence of letters ‘arr’. Using DPLYR (a) this is done simulataneously using both the select and matches functions. Using Base R (b) this is done by subsetting the dataset and using the grepl function.
Example 3(a): DPLYR
dplyr::select(flights, matches("arr"))
## # A tibble: 336,776 x 4
## arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <chr>
## 1 830 819 11 UA
## 2 850 830 20 UA
## 3 923 850 33 AA
## 4 1004 1022 -18 B6
## 5 812 837 -25 DL
## 6 740 728 12 UA
## 7 913 854 19 B6
## 8 709 723 -14 EV
## 9 838 846 -8 B6
## 10 753 745 8 AA
## # ... with 336,766 more rows
Example 3(b): Base R
flights[,grepl("arr", colnames(flights))]
## # A tibble: 336,776 x 4
## arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <chr>
## 1 830 819 11 UA
## 2 850 830 20 UA
## 3 923 850 33 AA
## 4 1004 1022 -18 B6
## 5 812 837 -25 DL
## 6 740 728 12 UA
## 7 913 854 19 B6
## 8 709 723 -14 EV
## 9 838 846 -8 B6
## 10 753 745 8 AA
## # ... with 336,766 more rows
The purpose of Example 4 is to select all variables in the data that contain the sequence of letters ‘time’. Using DPLYR (a) this is done simulataneously using both the select and contains functions. Using Base R (b) this is done by subsetting the dataset and using the grepl function.
Example 4(a): DPLYR
dplyr::select(flights, contains("time"))
## # A tibble: 336,776 x 6
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ... with 336,766 more rows, and 1 more variable: time_hour <dttm>
Example 4(b): Base R
flights[,grepl("time", colnames(flights))]
## # A tibble: 336,776 x 6
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ... with 336,766 more rows, and 1 more variable: time_hour <dttm>
In this example, suppose that the column names of a dataset have a similar formatting but differ based on a numerical range, e.g- x01, x02, x03, x04, etc. The purpose of Example 5 is to select only variables that have column names between the numerical range of 8-11. Using DPLYR (a) this is done simulataneously using both the select and num_range functions. Using Base R (b) this is done by subsetting the dataset and using the which function.
Example 5(a): DPLYR
flight_data <- flights
colnames(flight_data) <- sprintf("x%d", 1:19)
dplyr::select(flight_data, num_range("x", 8:11))
## # A tibble: 336,776 x 4
## x8 x9 x10 x11
## <int> <dbl> <chr> <int>
## 1 819 11 UA 1545
## 2 830 20 UA 1714
## 3 850 33 AA 1141
## 4 1022 -18 B6 725
## 5 837 -25 DL 461
## 6 728 12 UA 1696
## 7 854 19 B6 507
## 8 723 -14 EV 5708
## 9 846 -8 B6 79
## 10 745 8 AA 301
## # ... with 336,766 more rows
Example 5(b): Base R
flight_data <- flights
colnames(flight_data) <- sprintf("x%d", 1:19)
flight_data[,c(which(colnames(flight_data)=="x8"), which(colnames(flight_data)=="x9"),
which(colnames(flight_data)=="x10"), which(colnames(flight_data)=="x11"))]
## # A tibble: 336,776 x 4
## x8 x9 x10 x11
## <int> <dbl> <chr> <int>
## 1 819 11 UA 1545
## 2 830 20 UA 1714
## 3 850 33 AA 1141
## 4 1022 -18 B6 725
## 5 837 -25 DL 461
## 6 728 12 UA 1696
## 7 854 19 B6 507
## 8 723 -14 EV 5708
## 9 846 -8 B6 79
## 10 745 8 AA 301
## # ... with 336,766 more rows
The purpose of Example 1 is to extract data pertaining to all rows (airquality days) in the month of September. Using DPLYR (a) this is done using the filter function. Using Base R (b) this is done by simultaneously subsetting and specifying the condition of interest.
Example 1(a): DPLYR
dplyr::filter(airquality, Month == 9)
## # A tibble: 30 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 96 167 6.9 91 9 1
## 2 78 197 5.1 92 9 2
## 3 73 183 2.8 93 9 3
## 4 91 189 4.6 93 9 4
## 5 47 95 7.4 87 9 5
## 6 32 92 15.5 84 9 6
## 7 20 252 10.9 80 9 7
## 8 23 220 10.3 78 9 8
## 9 21 230 10.9 75 9 9
## 10 24 259 9.7 73 9 10
## # ... with 20 more rows
Example 1(b): Base R
airquality[airquality$Month == 9, ]
## # A tibble: 30 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 96 167 6.9 91 9 1
## 2 78 197 5.1 92 9 2
## 3 73 183 2.8 93 9 3
## 4 91 189 4.6 93 9 4
## 5 47 95 7.4 87 9 5
## 6 32 92 15.5 84 9 6
## 7 20 252 10.9 80 9 7
## 8 23 220 10.3 78 9 8
## 9 21 230 10.9 75 9 9
## 10 24 259 9.7 73 9 10
## # ... with 20 more rows
The purpose of Example 2 is to extract data pertaining to all rows, also known as airquality days, in the month of September. Using DPLYR (a) this is done using the filter function. Using Base R (b) this is done by simultaneously subsetting and specifying the condition of interest.
Example 2(a): DPLYR
dplyr::filter(airquality, Month == 9, Solar.R > 100)
## # A tibble: 23 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 96 167 6.9 91 9 1
## 2 78 197 5.1 92 9 2
## 3 73 183 2.8 93 9 3
## 4 91 189 4.6 93 9 4
## 5 20 252 10.9 80 9 7
## 6 23 220 10.3 78 9 8
## 7 21 230 10.9 75 9 9
## 8 24 259 9.7 73 9 10
## 9 44 236 14.9 81 9 11
## 10 21 259 15.5 76 9 12
## # ... with 13 more rows
Example 2(b): Base R
airquality[airquality$Month == 9 & airquality$Solar.R > 100, ]
## # A tibble: 23 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 96 167 6.9 91 9 1
## 2 78 197 5.1 92 9 2
## 3 73 183 2.8 93 9 3
## 4 91 189 4.6 93 9 4
## 5 20 252 10.9 80 9 7
## 6 23 220 10.3 78 9 8
## 7 21 230 10.9 75 9 9
## 8 24 259 9.7 73 9 10
## 9 44 236 14.9 81 9 11
## 10 21 259 15.5 76 9 12
## # ... with 13 more rows
The purpose of Example 1 is to rearrange the original airquality dataset so that the variables Month and Day are displayed in ascending order. Using DPLYR (a) this is done using the arrange function. Using Base R (b) this is done by subsetting and using the order function. Keep in mind, for both techniques, the ordering of the variables are prioritized left to right. In other words, Month is arranged in ascending first, and then Day is arranged in ascending order within the Month variables.
Example 1(a): DPLYR
dplyr::arrange(airquality, Month, Day)
## # A tibble: 153 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## # ... with 143 more rows
Example 1(b): Base R
airquality[order(airquality$Month, airquality$Day),]
## # A tibble: 153 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## # ... with 143 more rows
The purpose of Example 2 is to rearrange the original airquality dataset so that the variables Day and Month are displayed in ascending order. Using DPLYR (a) this is done using the arrange function. Using Base R (b) this is done by subsetting and using the order function. The difference between Example 2 compared to Example 1 is that the ordering of the variables are prioritized differently. Specifically, Day is arranged in ascending first, and then Month is arranged in ascending order within the Day variables.
Example 2(a): DPLYR
dplyr::arrange(airquality, Day, Month)
## # A tibble: 153 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 NA 286 8.6 78 6 1
## 3 135 269 4.1 84 7 1
## 4 39 83 6.9 81 8 1
## 5 96 167 6.9 91 9 1
## 6 36 118 8 72 5 2
## 7 NA 287 9.7 74 6 2
## 8 49 248 9.2 85 7 2
## 9 9 24 13.8 81 8 2
## 10 78 197 5.1 92 9 2
## # ... with 143 more rows
Example 2(b): Base R
airquality[order(airquality$Day, airquality$Month),]
## # A tibble: 153 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 NA 286 8.6 78 6 1
## 3 135 269 4.1 84 7 1
## 4 39 83 6.9 81 8 1
## 5 96 167 6.9 91 9 1
## 6 36 118 8 72 5 2
## 7 NA 287 9.7 74 6 2
## 8 49 248 9.2 85 7 2
## 9 9 24 13.8 81 8 2
## 10 78 197 5.1 92 9 2
## # ... with 143 more rows
The purpose of Example 3 is to rearrange the original airquality dataset so that the variables Month and Day are displayed in descending order. Using DPLYR (a) this is done using the arrange and desc function. Using Base R (b) this is done by subsetting and using the order & desc functions. Keep in mind, for both techniques, the ordering of the variables are prioritized left to right.
Example 3(a): DPLYR
dplyr::arrange(airquality, desc(Month), desc(Day))
## # A tibble: 153 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 20 223 11.5 68 9 30
## 2 18 131 8 76 9 29
## 3 14 191 14.3 75 9 28
## 4 NA 145 13.2 77 9 27
## 5 30 193 6.9 70 9 26
## 6 14 20 16.6 63 9 25
## 7 7 49 10.3 69 9 24
## 8 36 139 10.3 81 9 23
## 9 23 14 9.2 71 9 22
## 10 13 238 12.6 64 9 21
## # ... with 143 more rows
Example 3(b): Base R
airquality[order(-airquality$Month, -airquality$Day),]
## # A tibble: 153 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 20 223 11.5 68 9 30
## 2 18 131 8 76 9 29
## 3 14 191 14.3 75 9 28
## 4 NA 145 13.2 77 9 27
## 5 30 193 6.9 70 9 26
## 6 14 20 16.6 63 9 25
## 7 7 49 10.3 69 9 24
## 8 36 139 10.3 81 9 23
## 9 23 14 9.2 71 9 22
## 10 13 238 12.6 64 9 21
## # ... with 143 more rows
The purpose of Example 4 is to rearrange the original airquality dataset so that the variable Ozone is displayed in ascending order. Using DPLYR (a) this is done using the arrange function. Using Base R (b) this is done by subsetting and using the order function.
Example 4(a): DPLYR
dplyr::arrange(airquality, Ozone)
## # A tibble: 153 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 1 8 9.7 59 5 21
## 2 4 25 9.7 61 5 23
## 3 6 78 18.4 57 5 18
## 4 7 NA 6.9 74 5 11
## 5 7 48 14.3 80 7 15
## 6 7 49 10.3 69 9 24
## 7 8 19 20.1 61 5 9
## 8 9 24 13.8 81 8 2
## 9 9 36 14.3 72 8 22
## 10 9 24 10.9 71 9 14
## # ... with 143 more rows
Example 4(b): Base R
airquality[order(airquality$Ozone),]
## # A tibble: 153 x 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 1 8 9.7 59 5 21
## 2 4 25 9.7 61 5 23
## 3 6 78 18.4 57 5 18
## 4 7 NA 6.9 74 5 11
## 5 7 48 14.3 80 7 15
## 6 7 49 10.3 69 9 24
## 7 8 19 20.1 61 5 9
## 8 9 24 13.8 81 8 2
## 9 9 36 14.3 72 8 22
## 10 9 24 10.9 71 9 14
## # ... with 143 more rows
The purpose of Example 1 is to add new variables, which are functions of current variables, to the existing dataset and then display the modified dataset. For instance, in Example 1, the variables gain, gain_per_hour, and speed are added to the dataset. Using DPLYR (a) this is done using the mutate function. Using Base R (b) this is done manually.
Example 1(a): DPLYR
mutate(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60),
speed = distance / air_time * 60)
## # A tibble: 336,776 x 22
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 15 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, gain <dbl>, gain_per_hour <dbl>,
## # speed <dbl>
Example 1(b): Base R
flights_data <- flights
flights_data$gain = flights_data$arr_delay - flights_data$dep_delay
flights_data$gain_per_hour = flights_data$gain / (flights_data$air_time / 60)
flights_data$speed = flights_data$distance / (flights_data$air_time * 60)
flights_data
## # A tibble: 336,776 x 22
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 15 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, gain <dbl>, gain_per_hour <dbl>,
## # speed <dbl>
The purpose of Example 1 is to add new variables, which are functions of current variables, to the existing dataset and then display only the new variables. For instance, in Example 1, the variables gain and gain_per_hour are added to the dataset. Using DPLYR (a) this is done using the transmute function. Using Base R (b) this is done manually.
Example 1(a): DPLYR
transmute(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60))
## # A tibble: 336,776 x 2
## gain gain_per_hour
## <dbl> <dbl>
## 1 9 2.38
## 2 16 4.23
## 3 31 11.6
## 4 -17 -5.57
## 5 -19 -9.83
## 6 16 6.4
## 7 24 9.11
## 8 -11 -12.5
## 9 -5 -2.14
## 10 10 4.35
## # ... with 336,766 more rows
Example 1(b): Base R
flights$gain = flights$arr_delay - flights$dep_delay
flights$gain_per_hour = flights$gain / (flights$air_time / 60)
flights[,c(which(colnames(flights)=="gain"), which(colnames(flights)=="gain_per_hour"))]
## # A tibble: 336,776 x 2
## gain gain_per_hour
## <dbl> <dbl>
## 1 9 2.38
## 2 16 4.23
## 3 31 11.6
## 4 -17 -5.57
## 5 -19 -9.83
## 6 16 6.4
## 7 24 9.11
## 8 -11 -12.5
## 9 -5 -2.14
## 10 10 4.35
## # ... with 336,766 more rows
The purpose of Example 1 is to categorize people based on their unique Education levels. Within the Credit data there are 16 unique Education levels, represented by years of Education. In this example, the 400 people are arranged into these 16 groups. Then for each group three functions are performed: total count within each group, mean total limit for each group, and mean total balance for each group. Using DPLYR (a) this is done using the group_by and summarise function. Using Base R (b) this is done manually using a for loop.
Example 1(a): DPLYR
sort(unique(Credit$Education))
## [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
by_Education <- group_by(Credit, Education)
mean_info <- summarise(by_Education,
count = n(),
mean_total_limit = mean(Limit, na.rm = TRUE),
mean_total_balance = mean(Balance ,na.rm = TRUE))
mean_info
## # A tibble: 16 x 4
## Education count mean_total_limit mean_total_balance
## <int> <int> <dbl> <dbl>
## 1 5 1 5666 863
## 2 6 5 5330 555
## 3 7 8 4911. 520
## 4 8 14 5272. 493.
## 5 9 25 5119. 692.
## 6 10 24 4145 459.
## 7 11 33 4883. 441.
## 8 12 37 4971. 599.
## 9 13 38 4124. 396.
## 10 14 48 4718. 493.
## 11 15 49 4523. 486.
## 12 16 50 4732. 584.
## 13 17 34 5319. 604.
## 14 18 22 4478. 443.
## 15 19 10 4638. 544.
## 16 20 2 4584. 421
Example 1(b): Base R
unique_id <- sort(unique(Credit$Education))
count_n <- rep(0, length(unique(Credit$Education)))
total_balance <- rep(0, length(unique(Credit$Education)))
total_limit <- rep(0, length(unique(Credit$Education)))
for(i in 1:length(unique(Credit$Education))){
for(j in 1:nrow(Credit)){
if(unique_id[i]==Credit$Education[j]){
count_n[i] = count_n[i] + 1
total_balance[i] = Credit$Balance[j] + total_balance[i]
total_limit[i] = Credit$Limit[j] + total_limit[i]
}
}
}
mean_total_limit <- round(total_limit/count_n,0)
mean_total_balance <- round(total_balance/count_n,0)
df <- data.frame(unique_id, count_n, mean_total_limit, mean_total_balance)
colnames(df) <- c("ID", "Count", "Mean Total Limit", "Mean Total Balance")
df
## ID Count Mean Total Limit Mean Total Balance
## 1 5 1 5666 863
## 2 6 5 5330 555
## 3 7 8 4911 520
## 4 8 14 5272 493
## 5 9 25 5119 692
## 6 10 24 4145 459
## 7 11 33 4883 441
## 8 12 37 4971 599
## 9 13 38 4124 396
## 10 14 48 4718 493
## 11 15 49 4523 486
## 12 16 50 4732 584
## 13 17 34 5319 604
## 14 18 22 4478 443
## 15 19 10 4638 544
## 16 20 2 4584 421
The purpose of Example 1 is to randomly sample a total of n rows from the dataset. Using DPLYR (a) this is done using the sample_n function. Using Base R (b) this is done by subsetting the dataset and using the sample function.
Example 1(a): DPLYR
set.seed(123)
sample_n(Credit, 10)
## # A tibble: 10 x 12
## ID Income Limit Rating Cards Age Education Gender Student Married
## <int> <dbl> <int> <int> <int> <int> <int> <fct> <fct> <fct>
## 1 116 40.4 4828 369 5 81 8 Female No No
## 2 315 116. 9272 656 2 69 14 " Male" No No
## 3 163 63.8 7530 515 1 56 12 " Male" No Yes
## 4 351 30.0 1561 155 4 70 13 Female No Yes
## 5 373 19.8 3782 293 2 46 16 Female Yes No
## 6 18 36.5 4378 339 3 69 15 Female No Yes
## 7 209 51.3 4327 320 3 46 15 " Male" No No
## 8 397 13.4 3838 296 5 65 17 " Male" No No
## 9 217 15.5 2762 215 3 60 18 " Male" No No
## 10 179 28.3 4391 316 2 29 10 Female No No
## # ... with 2 more variables: Ethnicity <fct>, Balance <int>
dim(sample_n(Credit, 10))
## [1] 10 12
Example 1(b): Base R
set.seed(123)
numb <- c(1:nrow(Credit))
Credit[sample(numb, size = 10, replace = FALSE),]
## # A tibble: 10 x 12
## ID Income Limit Rating Cards Age Education Gender Student Married
## <int> <dbl> <int> <int> <int> <int> <int> <fct> <fct> <fct>
## 1 116 40.4 4828 369 5 81 8 Female No No
## 2 315 116. 9272 656 2 69 14 " Male" No No
## 3 163 63.8 7530 515 1 56 12 " Male" No Yes
## 4 351 30.0 1561 155 4 70 13 Female No Yes
## 5 373 19.8 3782 293 2 46 16 Female Yes No
## 6 18 36.5 4378 339 3 69 15 Female No Yes
## 7 209 51.3 4327 320 3 46 15 " Male" No No
## 8 397 13.4 3838 296 5 65 17 " Male" No No
## 9 217 15.5 2762 215 3 60 18 " Male" No No
## 10 179 28.3 4391 316 2 29 10 Female No No
## # ... with 2 more variables: Ethnicity <fct>, Balance <int>
dim(Credit[sample(numb, size = 10, replace = FALSE),])
## [1] 10 12
The purpose of Example 2 is to randomly sample a fraction of rows from the dataset. Using DPLYR (a) this is done using the sample_frac function. Using Base R (b) this is done by subsetting the dataset and using the sample function.
Example 2(a): DPLYR
set.seed(123)
sample_frac(Credit, 0.01)
## # A tibble: 4 x 12
## ID Income Limit Rating Cards Age Education Gender Student Married
## <int> <dbl> <int> <int> <int> <int> <int> <fct> <fct> <fct>
## 1 116 40.4 4828 369 5 81 8 Female No No
## 2 315 116. 9272 656 2 69 14 " Male" No No
## 3 163 63.8 7530 515 1 56 12 " Male" No Yes
## 4 351 30.0 1561 155 4 70 13 Female No Yes
## # ... with 2 more variables: Ethnicity <fct>, Balance <int>
dim(sample_frac(Credit, 0.01))
## [1] 4 12
Example 2(b): Base R
frac <- round(0.01*nrow(Credit),0) #evaluate a random 1% of the data
Credit[sample(numb, size = frac, replace = FALSE),]
## # A tibble: 4 x 12
## ID Income Limit Rating Cards Age Education Gender Student Married
## <int> <dbl> <int> <int> <int> <int> <int> <fct> <fct> <fct>
## 1 221 44.8 5765 437 3 53 13 Female Yes No
## 2 183 58.9 6420 459 2 66 9 Female No Yes
## 3 381 115. 7760 538 3 83 14 Female No No
## 4 180 58.0 7499 560 5 67 11 Female No No
## # ... with 2 more variables: Ethnicity <fct>, Balance <int>
dim(Credit[sample(numb, size = frac, replace = FALSE),])
## [1] 4 12
The purpose of Example 1 is to rename the Card variable to number_credit_cards, and then displayed the modified dataset. Using DPLYR (a) this is done using the rename function. Using Base R (b) this is done using the an ifelse statement and the colnames function.
Example 1(a): DPLYR
Credit_data <- rename(Credit, number_credit_cards = Cards)
Credit_data
## # A tibble: 400 x 12
## ID Income Limit Rating number_credit_cards Age Education Gender
## <int> <dbl> <int> <int> <int> <int> <int> <fct>
## 1 1 14.9 3606 283 2 34 11 " Male"
## 2 2 106. 6645 483 3 82 15 Female
## 3 3 105. 7075 514 4 71 11 " Male"
## 4 4 149. 9504 681 3 36 11 Female
## 5 5 55.9 4897 357 2 68 16 " Male"
## 6 6 80.2 8047 569 4 77 10 " Male"
## 7 7 21.0 3388 259 2 37 12 Female
## 8 8 71.4 7114 512 2 87 9 " Male"
## 9 9 15.1 3300 266 5 66 13 Female
## 10 10 71.1 6819 491 3 41 19 Female
## # ... with 390 more rows, and 4 more variables: Student <fct>,
## # Married <fct>, Ethnicity <fct>, Balance <int>
Example 1(b): Base R
colnames(Credit) <- ifelse(colnames(Credit)=="Cards", "number_credit_cards", colnames(Credit))
Credit
## # A tibble: 400 x 12
## ID Income Limit Rating number_credit_cards Age Education Gender
## <int> <dbl> <int> <int> <int> <int> <int> <fct>
## 1 1 14.9 3606 283 2 34 11 " Male"
## 2 2 106. 6645 483 3 82 15 Female
## 3 3 105. 7075 514 4 71 11 " Male"
## 4 4 149. 9504 681 3 36 11 Female
## 5 5 55.9 4897 357 2 68 16 " Male"
## 6 6 80.2 8047 569 4 77 10 " Male"
## 7 7 21.0 3388 259 2 37 12 Female
## 8 8 71.4 7114 512 2 87 9 " Male"
## 9 9 15.1 3300 266 5 66 13 Female
## 10 10 71.1 6819 491 3 41 19 Female
## # ... with 390 more rows, and 4 more variables: Student <fct>,
## # Married <fct>, Ethnicity <fct>, Balance <int>
Base R Functions used for the following Dplyr functions:
Dplyr provides tools to allow for increased speed, greater efficiency, and more simplisitic syntax compared to some of the other leading methods that can perform the techniques discussed in these examples. I hope you found this useful. If you have any questions, feel free to leave a comment or reach out to me via e-mail/Twitter shown on the homepage.