5 Data analysis

dplyr is a package dedicated to data manipulation, namely for transforming, selecting, grouping and summarizing variables. dplyr offers a very intuitive syntax at the expense of some efficiency when it comes to large dataset (for which you might want to take a look at the package data.table instead). Additionally, it enable to translate R code into SQL queries when querying a remote sources (such as a MySQL database) from R.

5.1 The pipe operator

To use dplyr efficiently, we need to familiarize with a specific way of writing code that builds on the pipe operator %>%. This operator enables to chain functions rather than nesting them, which makes the code easier to process by humans. What the %>% does, is to pass the value that preceeds %>% to the first position of the subsequent function (or alternatively anywhere you place a .). For instance:

mean(c(1:10, NA), na.rm = T)

## [1] 5.5

Can be written as

c(1:10, NA) %>% 
  mean(na.rm = T)

## [1] 5.5

Which is equivalent to

c(1:10, NA) %>% 
  mean(. , na.rm = T) #note the dot

5.2 dplyr verbs

dplyr has a set of seven fundamental verbs for data manipulation.

Manipulate rows (manipulateRows):

filter(): select rows
distinct: select unique set of rows
arrange(): sort rows

Manipulate columns (manipulateCols):

select(): select/drop columns
mutate(): add columns

Grouping (manipulateGroups):

group_by(): group by a variable(s)
count(): sum grouping by a variable(s)
summarise(): run a function producing 1 output for each grouping level

Manipulate dataframes:

join(): merge multiple dataframe together

5.3 Manipulating rows

5.3.1 Filter

Filter retrieves all rows for which the conditions are evaluated to TRUE. For example, to retrieve all the Spanish males:

selfiesCasualties %>% 
  filter(country == 'Spain' & gender == 'Male')

## # A tibble: 2 x 7
##         class country gender   age nationality  month  year
##         <chr>   <chr> <fctr> <chr>      <fctr>  <chr> <int>
## 1 Electricity   Spain   Male    21       Spain  March  2014
## 2      Animal   Spain   Male    32       Spain August  2015

5.3.2 Distinct

distinct retains unique rows from the input table grouping by the variables specified. For instance, the combination of unique country-gender combinations:

selfiesCasualties %>% 
  filter(country == 'Spain' & gender == 'Male') %>% 
  distinct(country, gender)

## # A tibble: 1 x 2
##   country gender
##     <chr> <fctr>
## 1   Spain   Male

Note that distinct will also automatically drop all columns that are not passed into distinct. To keep all columns in the dataset, you need to specify .keep_all = FALSE.

5.3.3 Arrange

arrange sort the rows by the grouping variables you pass to the function:

selfiesCasualties %>% 
  filter(country == 'Spain' & gender == 'Male') %>% 
  arrange(class)

## # A tibble: 2 x 7
##         class country gender   age nationality  month  year
##         <chr>   <chr> <fctr> <chr>      <fctr>  <chr> <int>
## 1      Animal   Spain   Male    32       Spain August  2015
## 2 Electricity   Spain   Male    21       Spain  March  2014

5.4 Manipulate Columns

5.4.1 Select and mutate

select keeps only the variables you mention (or drops them using -). For example to select only class and country:

selfiesCasualties %>% 
  filter(country == 'Spain') %>% 
  select(class, country)

## # A tibble: 3 x 2
##         class country
##         <chr>   <chr>
## 1 Electricity   Spain
## 2      Height   Spain
## 3      Animal   Spain

To select multiple columns, you can use class:age (from variable class to variable gender), or one of the ?select_helpers.

mutate() instead adds a new variable. For example, we can add the birthdate of the victims by subtracting their age from the year of the accident:

selfiesCasualties %>% 
  mutate(birthDate = year - as.numeric(age))

## # A tibble: 85 x 8
##          class     country gender     age nationality  month  year
##          <chr>       <chr> <fctr>   <chr>      <fctr>  <chr> <int>
##  1 Electricity       Spain   Male      21       Spain  March  2014
##  2      Height      Russia Female      17      Russia  April  2014
##  3    Vehicles         USA Female      32         USA  April  2014
##  4    Vehicles         USA   Male      29         USA    May  2014
##  5       Train       India   Male      15       India    May  2014
##  6      Height       Italy Female      16       Italy   June  2014
##  7      Height Philippines Female      14 Philippines   July  2014
##  8      Height    Portugal   Male unknown      Poland August  2014
##  9 Electricity       India   Male      14       India August  2014
## 10     Weapons      Mexico   Male      21      Mexico August  2014
## # ... with 75 more rows, and 1 more variables: birthDate <dbl>

5.4.2 Grouping

group_by converts your tibble into subgroups depending on the grouping variable(s). It is usually used to group data before summarising with summarise. For instance, to count the number of female and male victims by country:

selfiesCasualties %>% 
  group_by(country, gender) %>% 
  summarise(tot = n())

n() is the count operator in dplyr (it works only within a sumarise(), mutate() or filter()), and counts the rows by group level without taking any argument. Note that every time you use summarise(), your tibble drops one grouping level. For instance in the latter example, data are still grouped by country, and any further dplyr verb would apply groupwise unless we ungroup() first.

Because counting observations is a very common task, dplyr offers the wrapper count, which is equivalent to group_by %>% summarise( n() ) except that it ungroup() after.

selfiesCasualties %>% 
  count(country, gender)

## # A tibble: 26 x 3
##      country gender     n
##        <chr> <fctr> <int>
##  1 Australia   Male     1
##  2     Chile Female     1
##  3     China Female     1
##  4     China   Male     2
##  5 Hong Kong   Male     1
##  6     India Female     6
##  7     India   Male    38
##  8 Indonesia   Male     2
##  9     Italy Female     1
## 10    Mexico   Male     1
## # ... with 16 more rows

5.4.3 Join functions

Join types work similarly the logic you familiarized with when writing SQL code. Joins do not necessarily need to run on remote sources, in fact you can use them on local dataset too. For instance, you can use joins to merge datasets, such the following two:

twitter_users

## # A tibble: 6 x 2
##    Year millions
##   <dbl>    <dbl>
## 1  2015     52.5
## 2  2016     56.8
## 3  2017     60.9
## 4  2018     64.9
## 5  2019     67.9
## 6  2020     70.7

fbMobile_users

## # A tibble: 6 x 2
##    Year millions
##   <dbl>    <dbl>
## 1  2015    132.9
## 2  2016    143.5
## 3  2017    150.8
## 4  2018    158.0
## 5  2019    163.0
## 6  2020    167.4

twitter_users %>% inner_join(fbMobile_users, by = 'Year')

## # A tibble: 6 x 3
##    Year millions.x millions.y
##   <dbl>      <dbl>      <dbl>
## 1  2015       52.5      132.9
## 2  2016       56.8      143.5
## 3  2017       60.9      150.8
## 4  2018       64.9      158.0
## 5  2019       67.9      163.0
## 6  2020       70.7      167.4

Note that if you do not specify a key or a key-pair for the attribute by, the default will do a natural join on all variables with common names across the two tables. This is typically useful when joining on relational databases, because it makes faster to join using primary-foreign key pairs, but not in this case.

When querying relational databases you might run into some inconsitencies between dplyr and SQL. The main difference you should be aware of, is the behavior of count() on NA values. In fact, count() will retrieve the number of lines grouping for certain variables regardless of their values, thus even NA values count as 1; instead COUNT in SQL will count NULL as 0.

For instance, consider the following query:

Report the total number of reviews received by each hotel. Order (descending) by total number of review

One way to write this query in SQL would be:

SELECT hotel.hotelId, hotelName, COUNT(reviewId) AS totReview
FROM hotel 
LEFT JOIN review ON hotel.hotelId = review.hotelId 
GROUP BY hotel.hotelId ORDER BY totReview, hotelName

Table 5.1: Displaying records 1 - 10
hotelId	hotelName	totReview
1641833	108 Motel	0
252870	346 Inn	0
313947	4 East Madison Inn Boutique Hotel	0
119950	A Bar D Motel	0
254281	A Hospitality Company, Inc.	0
95305	A Victory Inn & Suites	0
1901088	A Victory Inn-East Dearborn	0
1631657	Aaron’s Motel	0
1630782	Abel’s Motel	0
774805	Achieve Guest House	0

In dplyr, you might be tempted to run something like below, which however would retrieve the wrong result:

tbl(con, 'hotel') %>% left_join(tbl(con, 'review') ) %>% 
  select(hotelId, hotelName, reviewId) %>% 
  count(hotelId) %>% arrange(n, hotelName)

In fact, while COUNT in sql will consider NULL reviewdId as a 0, dplyr::count() will count the number of lines for each level of hotelId, regardless of their value. One way you can achieve a result consistent with COUNT is to set the attribute wt within the count() equal to !is.na(reviewId), which returns TRUE for non-missing values, and FALSE for missing ones. This will be equivalent to sum TRUE/FALSE values, where TRUE=1 and FALSE=0.

Thus your dplyr query looks like:

tbl(con, 'hotel') %>% left_join(tbl(con, 'review') ) %>% 
  select(hotelId, hotelName, reviewId) %>% 
  count(hotelId, hotelName, wt = !is.na(reviewId)) %>% arrange(n, hotelName)

## # Source:     lazy query [?? x 3]
## # Database:   mysql 5.6.19-log
## #   [dario@ba-isdsclass-programdev.lsu.edu:/hotelreviews]
## # Groups:     hotelId
## # Ordered by: n, hotelName
##    hotelId                         hotelName     n
##      <dbl>                             <chr> <dbl>
##  1 1641833                         108 Motel     0
##  2  252870                           346 Inn     0
##  3  313947 4 East Madison Inn Boutique Hotel     0
##  4  119950                     A Bar D Motel     0
##  5  254281       A Hospitality Company, Inc.     0
##  6   95305            A Victory Inn & Suites     0
##  7 1901088       A Victory Inn-East Dearborn     0
##  8 1631657                     Aaron's Motel     0
##  9 1630782                      Abel's Motel     0
## 10  774805               Achieve Guest House     0
## # ... with more rows