5 Data analysis
dplyr
is a package dedicated to data manipulation, namely for transforming, selecting, grouping and summarizing variables. dplyr
offers a very intuitive syntax at the expense of some efficiency when it comes to large dataset (for which you might want to take a look at the package data.table
instead). Additionally, it enable to translate R code into SQL queries when querying a remote sources (such as a MySQL database) from R.
5.1 The pipe operator
To use dplyr efficiently, we need to familiarize with a specific way of writing code that builds on the pipe operator %>%
. This operator enables to chain functions rather than nesting them, which makes the code easier to process by humans. What the %>%
does, is to pass the value that preceeds %>%
to the first position of the subsequent function (or alternatively anywhere you place a .
). For instance:
mean(c(1:10, NA), na.rm = T)
## [1] 5.5
Can be written as
c(1:10, NA) %>%
mean(na.rm = T)
## [1] 5.5
Which is equivalent to
c(1:10, NA) %>%
mean(. , na.rm = T) #note the dot
5.2 dplyr verbs
dplyr has a set of seven fundamental verbs for data manipulation.
Manipulate rows (manipulateRows):
filter()
: select rowsdistinct
: select unique set of rowsarrange()
: sort rows
Manipulate columns (manipulateCols):
select()
: select/drop columnsmutate()
: add columns
Grouping (manipulateGroups):
group_by()
: group by a variable(s)count()
: sum grouping by a variable(s)summarise()
: run a function producing 1 output for each grouping level
Manipulate dataframes:
join()
: merge multiple dataframe together
5.3 Manipulating rows
5.3.1 Filter
Filter retrieves all rows for which the conditions are evaluated to TRUE
. For example, to retrieve all the Spanish males:
selfiesCasualties %>%
filter(country == 'Spain' & gender == 'Male')
## # A tibble: 2 x 7
## class country gender age nationality month year
## <chr> <chr> <fctr> <chr> <fctr> <chr> <int>
## 1 Electricity Spain Male 21 Spain March 2014
## 2 Animal Spain Male 32 Spain August 2015
5.3.2 Distinct
distinct
retains unique rows from the input table grouping by the variables specified. For instance, the combination of unique country-gender combinations:
selfiesCasualties %>%
filter(country == 'Spain' & gender == 'Male') %>%
distinct(country, gender)
## # A tibble: 1 x 2
## country gender
## <chr> <fctr>
## 1 Spain Male
Note that distinct
will also automatically drop all columns that are not passed into distinct
. To keep all columns in the dataset, you need to specify .keep_all = FALSE
.
5.3.3 Arrange
arrange
sort the rows by the grouping variables you pass to the function:
selfiesCasualties %>%
filter(country == 'Spain' & gender == 'Male') %>%
arrange(class)
## # A tibble: 2 x 7
## class country gender age nationality month year
## <chr> <chr> <fctr> <chr> <fctr> <chr> <int>
## 1 Animal Spain Male 32 Spain August 2015
## 2 Electricity Spain Male 21 Spain March 2014
5.4 Manipulate Columns
5.4.1 Select and mutate
select
keeps only the variables you mention (or drops them using -
). For example to select only class
and country
:
selfiesCasualties %>%
filter(country == 'Spain') %>%
select(class, country)
## # A tibble: 3 x 2
## class country
## <chr> <chr>
## 1 Electricity Spain
## 2 Height Spain
## 3 Animal Spain
To select multiple columns, you can use class:age
(from variable class to variable gender), or one of the ?select_helpers
.
mutate()
instead adds a new variable. For example, we can add the birthdate of the victims by subtracting their age from the year of the accident:
selfiesCasualties %>%
mutate(birthDate = year - as.numeric(age))
## # A tibble: 85 x 8
## class country gender age nationality month year
## <chr> <chr> <fctr> <chr> <fctr> <chr> <int>
## 1 Electricity Spain Male 21 Spain March 2014
## 2 Height Russia Female 17 Russia April 2014
## 3 Vehicles USA Female 32 USA April 2014
## 4 Vehicles USA Male 29 USA May 2014
## 5 Train India Male 15 India May 2014
## 6 Height Italy Female 16 Italy June 2014
## 7 Height Philippines Female 14 Philippines July 2014
## 8 Height Portugal Male unknown Poland August 2014
## 9 Electricity India Male 14 India August 2014
## 10 Weapons Mexico Male 21 Mexico August 2014
## # ... with 75 more rows, and 1 more variables: birthDate <dbl>
5.4.2 Grouping
group_by
converts your tibble into subgroups depending on the grouping variable(s). It is usually used to group data before summarising with summarise
. For instance, to count the number of female and male victims by country:
selfiesCasualties %>%
group_by(country, gender) %>%
summarise(tot = n())
n()
is the count operator in dplyr (it works only within a sumarise()
, mutate()
or filter()
), and counts the rows by group level without taking any argument. Note that every time you use summarise()
, your tibble drops one grouping level. For instance in the latter example, data are still grouped by country, and any further dplyr verb would apply groupwise unless we ungroup()
first.
Because counting observations is a very common task, dplyr offers the wrapper count
, which is equivalent to group_by %>% summarise( n() )
except that it ungroup()
after.
selfiesCasualties %>%
count(country, gender)
## # A tibble: 26 x 3
## country gender n
## <chr> <fctr> <int>
## 1 Australia Male 1
## 2 Chile Female 1
## 3 China Female 1
## 4 China Male 2
## 5 Hong Kong Male 1
## 6 India Female 6
## 7 India Male 38
## 8 Indonesia Male 2
## 9 Italy Female 1
## 10 Mexico Male 1
## # ... with 16 more rows
5.4.3 Join functions
Join types work similarly the logic you familiarized with when writing SQL code. Joins do not necessarily need to run on remote sources, in fact you can use them on local dataset too. For instance, you can use joins to merge datasets, such the following two:
twitter_users
## # A tibble: 6 x 2
## Year millions
## <dbl> <dbl>
## 1 2015 52.5
## 2 2016 56.8
## 3 2017 60.9
## 4 2018 64.9
## 5 2019 67.9
## 6 2020 70.7
fbMobile_users
## # A tibble: 6 x 2
## Year millions
## <dbl> <dbl>
## 1 2015 132.9
## 2 2016 143.5
## 3 2017 150.8
## 4 2018 158.0
## 5 2019 163.0
## 6 2020 167.4
twitter_users %>% inner_join(fbMobile_users, by = 'Year')
## # A tibble: 6 x 3
## Year millions.x millions.y
## <dbl> <dbl> <dbl>
## 1 2015 52.5 132.9
## 2 2016 56.8 143.5
## 3 2017 60.9 150.8
## 4 2018 64.9 158.0
## 5 2019 67.9 163.0
## 6 2020 70.7 167.4
Note that if you do not specify a key or a key-pair for the attribute by
, the default will do a natural join on all variables with common names across the two tables. This is typically useful when joining on relational databases, because it makes faster to join using primary-foreign key pairs, but not in this case.
When querying relational databases you might run into some inconsitencies between dplyr and SQL. The main difference you should be aware of, is the behavior of count()
on NA
values. In fact, count()
will retrieve the number of lines grouping for certain variables regardless of their values, thus even NA
values count as 1; instead COUNT
in SQL will count NULL
as 0.
For instance, consider the following query:
Report the total number of reviews received by each hotel. Order (descending) by total number of review
One way to write this query in SQL would be:
SELECT hotel.hotelId, hotelName, COUNT(reviewId) AS totReview
FROM hotel
LEFT JOIN review ON hotel.hotelId = review.hotelId
GROUP BY hotel.hotelId ORDER BY totReview, hotelName
hotelId | hotelName | totReview |
---|---|---|
1641833 | 108 Motel | 0 |
252870 | 346 Inn | 0 |
313947 | 4 East Madison Inn Boutique Hotel | 0 |
119950 | A Bar D Motel | 0 |
254281 | A Hospitality Company, Inc. | 0 |
95305 | A Victory Inn & Suites | 0 |
1901088 | A Victory Inn-East Dearborn | 0 |
1631657 | Aaron’s Motel | 0 |
1630782 | Abel’s Motel | 0 |
774805 | Achieve Guest House | 0 |
In dplyr
, you might be tempted to run something like below, which however would retrieve the wrong result:
tbl(con, 'hotel') %>% left_join(tbl(con, 'review') ) %>%
select(hotelId, hotelName, reviewId) %>%
count(hotelId) %>% arrange(n, hotelName)
In fact, while COUNT
in sql will consider NULL
reviewdId
as a 0, dplyr::count()
will count the number of lines for each level of hotelId
, regardless of their value. One way you can achieve a result consistent with COUNT
is to set the attribute wt
within the count()
equal to !is.na(reviewId)
, which returns TRUE
for non-missing values, and FALSE
for missing ones. This will be equivalent to sum TRUE/FALSE values, where TRUE=1 and FALSE=0.
Thus your dplyr query looks like:
tbl(con, 'hotel') %>% left_join(tbl(con, 'review') ) %>%
select(hotelId, hotelName, reviewId) %>%
count(hotelId, hotelName, wt = !is.na(reviewId)) %>% arrange(n, hotelName)
## # Source: lazy query [?? x 3]
## # Database: mysql 5.6.19-log
## # [dario@ba-isdsclass-programdev.lsu.edu:/hotelreviews]
## # Groups: hotelId
## # Ordered by: n, hotelName
## hotelId hotelName n
## <dbl> <chr> <dbl>
## 1 1641833 108 Motel 0
## 2 252870 346 Inn 0
## 3 313947 4 East Madison Inn Boutique Hotel 0
## 4 119950 A Bar D Motel 0
## 5 254281 A Hospitality Company, Inc. 0
## 6 95305 A Victory Inn & Suites 0
## 7 1901088 A Victory Inn-East Dearborn 0
## 8 1631657 Aaron's Motel 0
## 9 1630782 Abel's Motel 0
## 10 774805 Achieve Guest House 0
## # ... with more rows