D Data Manipulation

For questions a to e use the tibble family that you created in the first practice section.

Extract the vector of genders from family. Use the appropriate function to show that the output is in fact a vector.

pull(family, gender)
typeof(pull(family, gender))

Create a subset of family with gender only. Does selecting only one column turn your tibble in a vector?

select(family, gender)

Retrieve a vector of the ages of females in your family.

family %>% filter(gender == 'F') %>% pull(age)

Retrieve a vector of the age of any male older than 30

family %>% filter(gender == 'M' & age > 30) %>% pull(age)

Use inline r code for dynamically printing the ages of your family members.

For the next set of question, we use the cityParishSalary.csv dataset, a publicly available dataset City-Parish employees’ annual salaries. More information about the dataset are avaialable here.

Creating a tibble object:

Read the text file titled cityParishSalary.csv using read_csv() and assign it to a variable called df (make sure df is a tibble).

df <- read_csv('HWdatasets/cityParishSalary.csv')

Use glimpse() to view the variable in df and adjust variable class in needed.

Selecting variables:

The function select() allows to filter columns of a tibble object.

Anonymize the dataset by dropping any variables containing names. Use the output to overwrite df.

df <- df %>% 
        select(-first_name, -last_name, -middle_init)

Report base pay and department name

df %>% 
  select(base_pay, department_name)

## # A tibble: 1,000 x 2
##     base_pay          department_name
##        <dbl>                    <chr>
##  1  84958.11          PARISH ATTORNEY
##  2  36950.50        POLICE DEPARTMENT
##  3 102278.28          PARISH ATTORNEY
##  4  40606.79          FIRE DEPARTMENT
##  5  17201.60 LIBRARY BOARD OF CONTROL
##  6  39379.76          FIRE DEPARTMENT
##  7  75488.50               CITY COURT
##  8  72550.14          FIRE DEPARTMENT
##  9  43806.10    ANIMAL CONTROL CENTER
## 10  42303.68        POLICE DEPARTMENT
## # ... with 990 more rows

Use the appropriate wrapper to select all variables that refer to hourly rates or pay. Hint: because their naming is consistent, take a look to ?select() and how you can use select helpers to select multiple variables efficiently

df %>% 
  select(ends_with('rate'), ends_with('pay'))

## # A tibble: 1,000 x 6
##    base_hourly_rate overtime_hourly_rate total_hourly_rate  base_pay
##               <dbl>                <dbl>             <dbl>     <dbl>
##  1          40.8452              61.2678           40.8452  84958.11
##  2          17.3250              30.3144           17.3250  36950.50
##  3          46.8645              70.2968           46.8645 102278.28
##  4          13.7583              23.7282           13.7583  40606.79
##  5          44.5799              66.8699           44.5799  17201.60
##  6          13.3575              23.1269           13.3575  39379.76
##  7          34.6116              51.9174           34.6116  75488.50
##  8          24.3991              40.4619           24.9142  72550.14
##  9          21.0606              31.5909           21.0606  43806.10
## 10          19.4991              33.5757           19.4991  42303.68
## # ... with 990 more rows, and 2 more variables: gross_pay <dbl>,
## #   total_overtime_pay <dbl>

Filtering observations:

The function filter() allows to subset a tibble by row.

Filter only observations for the POLICE DEPARTMENT

df %>% 
  filter(department_name == 'POLICE DEPARTMENT')

## # A tibble: 196 x 20
##    base_hourly_rate base_pay   department_name department_num
##               <dbl>    <dbl>             <chr>          <chr>
##  1          17.3250 36950.50 POLICE DEPARTMENT             50
##  2          19.4991 42303.68 POLICE DEPARTMENT             50
##  3           7.2500  4814.00 POLICE DEPARTMENT             50
##  4          16.3303 36436.10 POLICE DEPARTMENT             50
##  5          27.0880 57658.92 POLICE DEPARTMENT             50
##  6          23.2832 50765.59 POLICE DEPARTMENT             50
##  7          20.6871 43029.22 POLICE DEPARTMENT             50
##  8          22.6049 49066.32 POLICE DEPARTMENT             50
##  9          16.3303 34937.92 POLICE DEPARTMENT             50
## 10           7.2500  3262.50 POLICE DEPARTMENT             50
## # ... with 186 more rows, and 16 more variables: division_name <chr>,
## #   division_num <chr>, employee_num <chr>, end_date <dttm>,
## #   gross_pay <dbl>, hire_date <dttm>, job_code <int>,
## #   job_code_description <chr>, longevity_percentage <dbl>,
## #   overtime_hourly_rate <dbl>, payroll_status <int>,
## #   scheduled_hours <int>, total_hourly_rate <dbl>,
## #   total_overtime_hours <dbl>, total_overtime_pay <dbl>, year <int>

From the observations from the POLICE DEPARTMENT, keep only those records where the total overtime hours worked is 0 (you can use logical operators to evaluate both condition within the same filter()). Report only the variables for division name and job description, arranging alphabetically by division name and job description.

df %>% 
    filter(department_name == 'POLICE DEPARTMENT' & total_overtime_hours == 0) %>% 
    select(division_name, job_code_description) %>% 
    arrange(division_name, job_code_description)

## # A tibble: 41 x 2
##                               division_name
##                                       <chr>
##  1         POLICE DEPARTMENT-ADMINISTRATION
##  2         POLICE DEPARTMENT-ADMINISTRATION
##  3 POLICE DEPARTMENT-CRIMINAL INVEST BUREAU
##  4 POLICE DEPARTMENT-CRIMINAL INVEST BUREAU
##  5 POLICE DEPARTMENT-CRIMINAL INVEST BUREAU
##  6  POLICE DEPARTMENT-OPERATION SERV BUREAU
##  7  POLICE DEPARTMENT-OPERATION SERV BUREAU
##  8  POLICE DEPARTMENT-OPERATION SERV BUREAU
##  9  POLICE DEPARTMENT-OPERATION SERV BUREAU
## 10  POLICE DEPARTMENT-OPERATION SERV BUREAU
## # ... with 31 more rows, and 1 more variables: job_code_description <chr>

Filter all the employees who retired or quit in 2016 (namely those who have a date for end_date). Remember that any value == NA will always return NA.

df %>% 
  filter(!is.na(end_date))

## # A tibble: 157 x 20
##    base_hourly_rate base_pay               department_name department_num
##               <dbl>    <dbl>                         <chr>          <chr>
##  1          44.5799 17201.60      LIBRARY BOARD OF CONTROL             12
##  2           8.0000   744.00 OFFICE OF THE MAYOR-PRESIDENT             40
##  3           8.0000   744.00 OFFICE OF THE MAYOR-PRESIDENT             40
##  4           7.2500  1348.51          INFORMATION SERVICES             44
##  5           8.0000   744.00 OFFICE OF THE MAYOR-PRESIDENT             40
##  6           8.0000   744.00 OFFICE OF THE MAYOR-PRESIDENT             40
##  7          18.1154 15000.39    EMERGENCY MEDICAL SERVICES             52
##  8           8.0000   744.00 OFFICE OF THE MAYOR-PRESIDENT             40
##  9           8.0000   752.00 OFFICE OF THE MAYOR-PRESIDENT             40
## 10           8.0000   732.00 OFFICE OF THE MAYOR-PRESIDENT             40
## # ... with 147 more rows, and 16 more variables: division_name <chr>,
## #   division_num <chr>, employee_num <chr>, end_date <dttm>,
## #   gross_pay <dbl>, hire_date <dttm>, job_code <int>,
## #   job_code_description <chr>, longevity_percentage <dbl>,
## #   overtime_hourly_rate <dbl>, payroll_status <int>,
## #   scheduled_hours <int>, total_hourly_rate <dbl>,
## #   total_overtime_hours <dbl>, total_overtime_pay <dbl>, year <int>

Sort the records descending by base pay. Then report department name, job description and base pay for the 5 highest base pays.

df %>% 
  arrange(desc(base_pay)) %>% 
  select(base_pay, department_name, job_code_description) %>% 
  slice(1:5)

## # A tibble: 5 x 3
##   base_pay                     department_name
##      <dbl>                               <chr>
## 1 146077.3               COUNCIL ADMINISTRATOR
## 2 144585.0 DEPT OF TRANSPORTATION AND DRAINAGE
## 3 123834.1                     HUMAN RESOURCES
## 4 123242.1                 DEPT OF DEVELOPMENT
## 5 115963.3                     PARISH ATTORNEY
## # ... with 1 more variables: job_code_description <chr>

Create new variables and summarise variables:

Use mutate() to create a new variable called weekly_hours containing the average of hours worked per week considering the amount of biweekly hours (scheduled_hours) and the total amount of overtime hours per year. Select the 5 job codes with the highest amount of hours worked per week and report job code and weekly_hours

df %>% 
  mutate(weekly_hours = (scheduled_hours*26 + total_overtime_hours )/52) %>% 
  arrange(desc(weekly_hours)) %>% 
  top_n(5) %>% 
  select(job_code_description, weekly_hours)

## # A tibble: 5 x 2
##          job_code_description weekly_hours
##                         <chr>        <dbl>
## 1    HEAVY EQUIPMENT OPERATOR     78.61538
## 2 TREATMENT PLANT OPERATOR II     72.33654
## 3                FIRE CAPTAIN     63.86538
## 4             POLICE SERGEANT     62.24038
## 5 TREATMENT PLANT OPERATOR II     62.05288

Count the total number of observations for each department. Sort descending by total number of employees.

df %>% 
  count(department_name) %>% 
  arrange(desc(n))

## # A tibble: 28 x 2
##                   department_name     n
##                             <chr> <int>
##  1              POLICE DEPARTMENT   196
##  2       LIBRARY BOARD OF CONTROL   151
##  3                FIRE DEPARTMENT    97
##  4 HUMAN DEVELOPMENT AND SERVICES    65
##  5     EMERGENCY MEDICAL SERVICES    63
##  6     DEPARTMENT OF PUBLIC WORKS    55
##  7            DEPT OF MAINTENANCE    52
##  8             FINANCE DEPARTMENT    34
##  9  OFFICE OF THE MAYOR-PRESIDENT    33
## 10                     CITY COURT    32
## # ... with 18 more rows

Create a new variable percent and calculate the percent of employees allocated to each department. To calculate percents, make sure you understand the difference between n() and sum().

df %>% 
  count(department_name) %>% 
  arrange(desc(n)) %>% 
  mutate(prop = n/sum(n) * 100)

## # A tibble: 28 x 3
##                   department_name     n  prop
##                             <chr> <int> <dbl>
##  1              POLICE DEPARTMENT   196  19.6
##  2       LIBRARY BOARD OF CONTROL   151  15.1
##  3                FIRE DEPARTMENT    97   9.7
##  4 HUMAN DEVELOPMENT AND SERVICES    65   6.5
##  5     EMERGENCY MEDICAL SERVICES    63   6.3
##  6     DEPARTMENT OF PUBLIC WORKS    55   5.5
##  7            DEPT OF MAINTENANCE    52   5.2
##  8             FINANCE DEPARTMENT    34   3.4
##  9  OFFICE OF THE MAYOR-PRESIDENT    33   3.3
## 10                     CITY COURT    32   3.2
## # ... with 18 more rows

Calculate the total cost by department as the sum of all base pays within each department. Then add two variables for the total number of employees by department and the average cost by employee. Order descending by employee’s average cost.

df %>% 
  group_by(department_name) %>% 
  summarise(tot_cost = sum(base_pay), count = n(), avgCost = tot_cost/count) %>% 
  arrange(desc(avgCost))

## # A tibble: 28 x 4
##                        department_name  tot_cost count  avgCost
##                                  <chr>     <dbl> <int>    <dbl>
##  1               COUNCIL ADMINISTRATOR  307209.2     5 61441.84
##  2                     FIRE DEPARTMENT 4990410.5    97 51447.53
##  3                      CITY CONSTABLE  354128.5     7 50589.78
##  4                     HUMAN RESOURCES  505287.8    10 50528.78
##  5       DEPT OF BUILDINGS AND GROUNDS  561186.6    12 46765.55
##  6 DEPT OF TRANSPORTATION AND DRAINAGE 1006870.4    22 45766.84
##  7                 DEPT OF DEVELOPMENT  813328.1    18 45184.89
##  8                     PARISH ATTORNEY 1243457.6    28 44409.20
##  9               ANIMAL CONTROL CENTER  335165.3     8 41895.67
## 10                   POLICE DEPARTMENT 8026539.7   196 40951.73
## # ... with 18 more rows

The gather() and spread() function:

Report the average hourly rate and base pay by department into a table where each column represents a department name, and with only two rows (avgHourlyRate and avgBasePay) for the department with the highest average hourly rate. Calculate the average base and hourly pay, then use a combination of gather() and spread() to reshape accordingly.

df %>% 
  group_by(department_name) %>% 
  summarise(avgBase = mean(base_pay), avgHour = mean(base_hourly_rate)) %>% 
  arrange(desc(avgHour)) %>% top_n(3) %>% 
  gather(salary, amount, avgBase, avgHour) %>% 
  spread(department_name, amount)

## # A tibble: 2 x 4
##    salary `COUNCIL ADMINISTRATOR` `HUMAN RESOURCES` `PARISH ATTORNEY`
## *   <chr>                   <dbl>             <dbl>             <dbl>
## 1 avgBase              61441.8420       50528.78200       44409.19929
## 2 avgHour                 30.7714          26.43851          26.90328

The separate() function.

Some records in division_name seem to be a concatenatenation of department_name and the actual division name, separated by a -. Use separate() to separate the strings in division_name on the -. Then store the values into three new column variables, naming the third division_name (thus replacing the current division_name variable). Drop the other two columns

df %>% separate(division_name, into = c('a', 'division_name', 'b'), sep = '-') %>% select(-a, -b) %>% 
  distinct(division_name)

## # A tibble: 67 x 1
##                    division_name
##                            <chr>
##  1                ADMINISTRATION
##  2         UNIFORM PATROL BUREAU
##  3 FIRE SUPPRESSION & PREVENTION
##  4                          <NA>
##  5        CRIMINAL INVEST BUREAU
##  6         SUMMER YOUTH EMP PROG
##  7            SPECIAL OPERATIONS
##  8  EBRP COMMUNICATIONS DISTRICT
##  9            OPERATIONS & MAINT
## 10         WASTEWATER COLLECTION
## # ... with 57 more rows