3 Programming concepts

So far we have been working with the selfiesCasualties dataset knowing little about its nature and structure. However, a good understanding of R fundamentals is crucial to work effectively with R. Investing a good amount of time at learning about the R fundamentals arguably pays off in the long term, although new users might be prone to begin by hacking or artistically reusing code already available. This is a very short-sighted strategy, since you will find extremely hard to wrap your head around error messages, or cryptic default behavior such as the wrong definition of the y-axis in Figure 4.4. A good understanding of data structures instead, For instance, some functions accept only certain structures (e.g., dataframes 3.2.3, vectors 3.2.1) and data types (e.g., character, numeric), or their output might differ dending on the latters. For this reason, the goal of this chapter is to give you a set of tools to investigate the structure of objecst you work with. R is an object oriented language, which implise any time you run a function, its output will not permanently be stored in memory unless you assign it to an object. The way you assign content to objects is using the <- operator.

3.1 Functions

A function is an R object that performs an action on other R objects. In fact, R is object oriented, meaning that every element in as an object; functions make no exception since they are also objects in their own right. In fact, objects such as geom_bar from Chapter @:dataVisualization are nothing else but containers for instructions on what to when other objects (e.g., selfiesCasualties) are passed to an argument. In this training we are not covering custom functions, instead we either use functions from base R or the tidyverse suite (e.g., mean(), sum(), gather()). Use ? followed by the function to access the documentation, for instance by typing ?mean(). In particular, you want to learn about the parameters required for running a function and their format. Parameters cam be passed ordinally or by declaring the argumnent they are mapping to. For instace any vector passed to the function mean() will automatically be mapped to its x (first element). Note that apart from the minimum number of parameters required, functions might attribute default values to some other parameters (e.g., mean() has na.rm = FALSE). Make sure you undertsand what those default values imply by riding the documentation.

3.2 Data structures

3.2.1 Vectors

A vector is an adimensional collection of homogeneus elements, and the ultimate constitutive part of R data structures. To create a vector, use the function c() and comma-separate your elements.

It is R’s ultimate constitutive part, because R does not contemplate scalars, and even single numbers are ultimately vectors of length one, in fact:

is.atomic(1)

## [1] TRUE

Note that is.vector() instead testes if the vector is a vector and has no other attributes than names. For instance:

is.atomic(factor(c('a')))

## [1] TRUE

is.vector(factor(c('a')))

## [1] FALSE

Second, vectors are homogeneus because all elements belong to the same datatype. Table 3.1 presents the four most relevant datatypes:

Table 3.1: The four most relevant datatypes
code	typeof
c(‘1’,‘2’, ‘text’)	character
c(1L ,2L)	integer
c(1.5, 2.1)	double
c(TRUE, FALSE)	logical

Categorical variables can be represented as an integer vector where each number corrisponds to a level. For instance, the vector:

factor(c('Pizza', 'Lasagne', 'Maccheroni'))

## [1] Pizza      Lasagne    Maccheroni
## Levels: Lasagne Maccheroni Pizza

can be coherced to an integer using, with integer assigned by sorting the factor levels alphabetically:

as.integer(factor(c('Pizza', 'Lasagne', 'Maccheroni')))

## [1] 3 1 2

Third, vectors are adimensional:

dim(c(1, 10))

## NULL

To subset a vector, use [ combined with a vector of the indexes you want to subest. Remember that in R the first element has index 1 (and not 0), thus to retrieve the 1st and the 3rd elements:

letters[c(1, 3)]

## [1] "a" "c"

You can also use indexing to overwrite elements or to append new ones, in combination with the assignment operator <-. For instance:

letters[c(1, 3)] <- c('New value', 'Another new value')
letters[c(1, 3)]

## [1] "New value"         "Another new value"

When appending a value to a position which falls out of bounds, R will fill the blanks up to that position with NA (missig values) without giving you any warnings. For instance letters has 26 elements, yet if we assign an element to position 100:

letters[100] <- 'position100'
length(letters)

## [1] 100

3.2.2 Data types

As Table 3.1 shows, the way you pass elements in c() determine the type of your vector; e.g. numbers are interpreted as numeric values, while quoted elements become text strings. However, you can always convert an object type into another using the functions as.*. For instance, to coherce from character to double:

as.double( c('1', '3','20', '100'))

## [1]   1   3  20 100

Using the appropriate data type is crucial to avoid error and unexpected behaviors when running functions. For instance, from ?sort() we know that the argument x in sort() should be:

object with a class or a numeric, complex, character or logical vector.

Despite both numeric and character vectors are acceptable, using a character vector when we really want to order numbers is very misleading. In fact, if the vector of digits is stored as a character vector, sort() orders the values alphabetically:

sort(c('1', '3','20', '100', '5', '4'))

## [1] "1"   "100" "20"  "3"   "4"   "5"

3.2.3 Lists and dataframes

A data frame is the most readable structure for storing data, consisting in the columwise combination of vectors of equal length but any type. The dataset selfiesCasualties is an example of dataframe. Use str() to inspect its structure:

str(selfiesCasualties)

## Classes 'tbl_df', 'tbl' and 'data.frame':    85 obs. of  7 variables:
##  $ class      : chr  "Electricity" "Height" "Vehicles" "Vehicles" ...
##  $ country    : chr  "Spain" "Russia" "USA" "USA" ...
##  $ gender     : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 1 1 2 2 2 ...
##  $ age        : chr  "21" "17" "32" "29" ...
##  $ nationality: Factor w/ 24 levels "Australia","Bulgaria",..: 22 17 24 24 6 9 14 15 6 11 ...
##  $ month      : chr  "March" "April" "April" "May" ...
##  $ year       : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...

The very first line from the output shows that selfiesCausalties is of class data.frame, and has 85 observations (rows) and 7 variables (columns). In fact:

nrow(selfiesCasualties)

## [1] 85

ncol(selfiesCasualties)

## [1] 7

dim(selfiesCasualties)

## [1] 85  7

To select a variable from selfiesCasualties there are different options depending on the desired structure of the output. For example:

to extract age, namely to return a vector with the values of the variable age:
- $ operator selfiesCasualties$age (or selfiesCasualties$'age')
- [[ operator selfiesCasualties[['age']]
to subset age, namely to return a subset of selfiesCasualties with only the variable age:
- [ operator selfiesCasualties['age']

Rows and columns could also be subset using positional indexing of the row-column combinations we wan to retrieve. For instance, to retrieve the first 10 records from columns 1 to 3:

selfiesCasualties[c(1, 10), 1:3]

## # A tibble: 2 x 3
##         class country gender
##         <chr>   <chr> <fctr>
## 1 Electricity   Spain   Male
## 2     Weapons  Mexico   Male

Or equivalently, we can combine positional and nominal indexing:

selfiesCasualties[c(1, 10), c('class', 'country', 'gender')]

## # A tibble: 2 x 3
##         class country gender
##         <chr>   <chr> <fctr>
## 1 Electricity   Spain   Male
## 2     Weapons  Mexico   Male

Under the hood, a data frame is a special case of a list, where each element is a vector of equal length. In fact:

typeof(selfiesCasualties)

## [1] "list"

In a list, every element can have a different structure and dimensions.

Figure 3.1: A cutboard with objects of different types, namely the best metaphorical representation of a list I could come up with

onTheTable <- list(dishes = 'milk cup', 
                 cutlery = 'spoon', 
                 appleWeight = sample(1:20, 15, T), 
                 kiwiWeight = sample(1:20, 10, T),
                 mandarin = data.frame(sliceId = 1:10, weight = sample(1:20, 10, T)))
onTheTable

## $dishes
## [1] "milk cup"
## 
## $cutlery
## [1] "spoon"
## 
## $appleWeight
##  [1] 12  1 12  2  4 11  6  9 10  6  4 17 16 19  5
## 
## $kiwiWeight
##  [1]  8 14 19 13  8 15 20 10 20 11
## 
## $mandarin
##    sliceId weight
## 1        1     12
## 2        2      8
## 3        3     13
## 4        4     16
## 5        5     15
## 6        6     11
## 7        7      8
## 8        8      2
## 9        9      1
## 10      10     15

For example the object onTheTable contains both vectors of different lengths and a data frame; but virtually, each element could be another list containing more elements, and so forth.

To grasp the concept of list and how to index on these data structures is relevant because many functions (e.g. lm() for linear regression) output a list, which you might need to manipulate further. For instance, suppose we want to calculate the sum of squares from the following linear model:

fit <- lm(speed ~ dist, data = cars)
fit

## 
## Call:
## lm(formula = speed ~ dist, data = cars)
## 
## Coefficients:
## (Intercept)         dist  
##      8.2839       0.1656

The residuals are stored in the element residuals within the object fit (a list). However, to extract the vector of the residuals, it is important to understand the difference between the [ and [[ operators. In fact:

sum(fit['residuals']^2)

## Error in fit["residuals"]^2: non-numeric argument to binary operator

Because fit['residuals'] subsets the list, but does not extract the numeric vector inside the list itself. In fact:

str(fit['residuals'])

## List of 1
##  $ residuals: Named num [1:50] -4.62 -5.94 -1.95 -4.93 -2.93 ...
##   ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...

Instead, to extract an element (in this case a vector) from fit you need [[:

sum(fit[['residuals']]^2)

## [1] 478.0212

3.2.4 Tibbles

Tibble is a recent data structure that combines the flexibility of lists (different structures in the same object), with the readability of dataframes (keep data in a rectangular format). Tibbles are easier to manipulate using tidyverse, in particular when working with dplyr and purrr (Chapter 5. To transform a data.frame into a tibble:

selfiesCasualties <- as_data_frame(selfiesCasualties)
selfiesCasualties

## # A tibble: 85 x 7
##          class     country gender     age nationality  month  year
##          <chr>       <chr> <fctr>   <chr>      <fctr>  <chr> <int>
##  1 Electricity       Spain   Male      21       Spain  March  2014
##  2      Height      Russia Female      17      Russia  April  2014
##  3    Vehicles         USA Female      32         USA  April  2014
##  4    Vehicles         USA   Male      29         USA    May  2014
##  5       Train       India   Male      15       India    May  2014
##  6      Height       Italy Female      16       Italy   June  2014
##  7      Height Philippines Female      14 Philippines   July  2014
##  8      Height    Portugal   Male unknown      Poland August  2014
##  9 Electricity       India   Male      14       India August  2014
## 10     Weapons      Mexico   Male      21      Mexico August  2014
## # ... with 75 more rows

Note that:

class(selfiesCasualties)

## [1] "tbl_df"     "tbl"        "data.frame"

which means that tibbles are also data frames, but they can accomodate column-lists too, instead of column-vectors only. For instance:

tib <- tibble(Values = list(1:10, 1:5), 
          SequenceName = c('First sequence', 'Second sequence'))
tib

## # A tibble: 2 x 2
##       Values    SequenceName
##       <list>           <chr>
## 1 <int [10]>  First sequence
## 2  <int [5]> Second sequence

str(tib)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
##  $ Values      :List of 2
##   ..$ : int  1 2 3 4 5 6 7 8 9 10
##   ..$ : int  1 2 3 4 5
##  $ SequenceName: chr  "First sequence" "Second sequence"

3.3 Logical and Mathematical Operators

Logical and mathematical operators are extremely useful conditioanl operations, such as indexing. Logical operators follows mathematical logic and are:

! NOT
& AND
| OR

Mathematical operators instead:

== indicates equality (note that = is equivalent to <- instead)
>= at least or <= no bigger than
> strictly bigger < strickly smaller

Other logical operators:

%in% matching values

For example, to retrieve a subset of males who died in a foreign country (namely their nationality is not from the country they died in):

selfiesCasualties[selfiesCasualties$gender=='Male' & selfiesCasualties$country!=selfiesCasualties$nationality,]

## # A tibble: 7 x 7
##    class   country gender     age nationality     month  year
##    <chr>     <chr> <fctr>   <chr>      <fctr>     <chr> <int>
## 1 Height  Portugal   Male unknown      Poland    August  2014
## 2  Train     India   Male      24      Israel  February  2015
## 3 Height Indonesia   Male      21   Singapore       May  2015
## 4 Height     India   Male unknown       Japan September  2015
## 5   <NA>      <NA>   <NA>    <NA>        <NA>      <NA>    NA
## 6 Height      Peru   Male      28 South Korea      June  2016
## 7 Height      Peru   Male      51     Germany      June  2016

Keep in mind that logical comparisons of vector elements require a single logical operator (&, | or !), whereas a double operator (e.g., &&) compares only the first element of two vectors.

With operations that resolve to a logical value, dealing with NA might become tricky because of how R handles NA. In fact NA is not a zero, but a placeholder for a missing value. With logical operators, R retrieves an NA any time that the result is ambiguous, but a logical value when the output is certain regardless than NA.

code	output
TRUE & NA	NA
FALSE \| NA	NA
NA == NA	NA
FALSE & NA	FALSE
TRUE \| NA	TRUE

With mathematical operators instead, the results is always NA. Thus, to test whether a value equals NA we use is.na() instead of ==.