D Data Manipulation
For questions a to e use the tibble family
that you created in the first practice section.
- Extract the vector of genders from
family
. Use the appropriate function to show that the output is in fact a vector.
pull(family, gender)
typeof(pull(family, gender))
- Create a subset of
family
withgender
only. Does selecting only one column turn your tibble in a vector?
select(family, gender)
- Retrieve a vector of the ages of females in your family.
family %>% filter(gender == 'F') %>% pull(age)
- Retrieve a vector of the age of any male older than 30
family %>% filter(gender == 'M' & age > 30) %>% pull(age)
- Use inline r code for dynamically printing the ages of your family members.
For the next set of question, we use the cityParishSalary.csv
dataset, a publicly available dataset City-Parish employees’ annual salaries. More information about the dataset are avaialable here.
- Creating a tibble object:
Read the text file titled cityParishSalary.csv
using read_csv()
and assign it to a variable called df
(make sure df
is a tibble).
df <- read_csv('HWdatasets/cityParishSalary.csv')
Use glimpse()
to view the variable in df
and adjust variable class in needed.
- Selecting variables:
The function select()
allows to filter columns of a tibble object.
- Anonymize the dataset by dropping any variables containing names. Use the output to overwrite
df
.
df <- df %>%
select(-first_name, -last_name, -middle_init)
- Report base pay and department name
df %>%
select(base_pay, department_name)
## # A tibble: 1,000 x 2
## base_pay department_name
## <dbl> <chr>
## 1 84958.11 PARISH ATTORNEY
## 2 36950.50 POLICE DEPARTMENT
## 3 102278.28 PARISH ATTORNEY
## 4 40606.79 FIRE DEPARTMENT
## 5 17201.60 LIBRARY BOARD OF CONTROL
## 6 39379.76 FIRE DEPARTMENT
## 7 75488.50 CITY COURT
## 8 72550.14 FIRE DEPARTMENT
## 9 43806.10 ANIMAL CONTROL CENTER
## 10 42303.68 POLICE DEPARTMENT
## # ... with 990 more rows
- Use the appropriate wrapper to select all variables that refer to hourly rates or pay. Hint: because their naming is consistent, take a look to
?select()
and how you can use select helpers to select multiple variables efficiently
df %>%
select(ends_with('rate'), ends_with('pay'))
## # A tibble: 1,000 x 6
## base_hourly_rate overtime_hourly_rate total_hourly_rate base_pay
## <dbl> <dbl> <dbl> <dbl>
## 1 40.8452 61.2678 40.8452 84958.11
## 2 17.3250 30.3144 17.3250 36950.50
## 3 46.8645 70.2968 46.8645 102278.28
## 4 13.7583 23.7282 13.7583 40606.79
## 5 44.5799 66.8699 44.5799 17201.60
## 6 13.3575 23.1269 13.3575 39379.76
## 7 34.6116 51.9174 34.6116 75488.50
## 8 24.3991 40.4619 24.9142 72550.14
## 9 21.0606 31.5909 21.0606 43806.10
## 10 19.4991 33.5757 19.4991 42303.68
## # ... with 990 more rows, and 2 more variables: gross_pay <dbl>,
## # total_overtime_pay <dbl>
- Filtering observations:
The function filter()
allows to subset a tibble by row.
- Filter only observations for the POLICE DEPARTMENT
df %>%
filter(department_name == 'POLICE DEPARTMENT')
## # A tibble: 196 x 20
## base_hourly_rate base_pay department_name department_num
## <dbl> <dbl> <chr> <chr>
## 1 17.3250 36950.50 POLICE DEPARTMENT 50
## 2 19.4991 42303.68 POLICE DEPARTMENT 50
## 3 7.2500 4814.00 POLICE DEPARTMENT 50
## 4 16.3303 36436.10 POLICE DEPARTMENT 50
## 5 27.0880 57658.92 POLICE DEPARTMENT 50
## 6 23.2832 50765.59 POLICE DEPARTMENT 50
## 7 20.6871 43029.22 POLICE DEPARTMENT 50
## 8 22.6049 49066.32 POLICE DEPARTMENT 50
## 9 16.3303 34937.92 POLICE DEPARTMENT 50
## 10 7.2500 3262.50 POLICE DEPARTMENT 50
## # ... with 186 more rows, and 16 more variables: division_name <chr>,
## # division_num <chr>, employee_num <chr>, end_date <dttm>,
## # gross_pay <dbl>, hire_date <dttm>, job_code <int>,
## # job_code_description <chr>, longevity_percentage <dbl>,
## # overtime_hourly_rate <dbl>, payroll_status <int>,
## # scheduled_hours <int>, total_hourly_rate <dbl>,
## # total_overtime_hours <dbl>, total_overtime_pay <dbl>, year <int>
- From the observations from the
POLICE DEPARTMENT
, keep only those records where the total overtime hours worked is 0 (you can use logical operators to evaluate both condition within the samefilter()
). Report only the variables for division name and job description, arranging alphabetically by division name and job description.
df %>%
filter(department_name == 'POLICE DEPARTMENT' & total_overtime_hours == 0) %>%
select(division_name, job_code_description) %>%
arrange(division_name, job_code_description)
## # A tibble: 41 x 2
## division_name
## <chr>
## 1 POLICE DEPARTMENT-ADMINISTRATION
## 2 POLICE DEPARTMENT-ADMINISTRATION
## 3 POLICE DEPARTMENT-CRIMINAL INVEST BUREAU
## 4 POLICE DEPARTMENT-CRIMINAL INVEST BUREAU
## 5 POLICE DEPARTMENT-CRIMINAL INVEST BUREAU
## 6 POLICE DEPARTMENT-OPERATION SERV BUREAU
## 7 POLICE DEPARTMENT-OPERATION SERV BUREAU
## 8 POLICE DEPARTMENT-OPERATION SERV BUREAU
## 9 POLICE DEPARTMENT-OPERATION SERV BUREAU
## 10 POLICE DEPARTMENT-OPERATION SERV BUREAU
## # ... with 31 more rows, and 1 more variables: job_code_description <chr>
- Filter all the employees who retired or quit in 2016 (namely those who have a date for
end_date
). Remember that any value== NA
will always returnNA
.
df %>%
filter(!is.na(end_date))
## # A tibble: 157 x 20
## base_hourly_rate base_pay department_name department_num
## <dbl> <dbl> <chr> <chr>
## 1 44.5799 17201.60 LIBRARY BOARD OF CONTROL 12
## 2 8.0000 744.00 OFFICE OF THE MAYOR-PRESIDENT 40
## 3 8.0000 744.00 OFFICE OF THE MAYOR-PRESIDENT 40
## 4 7.2500 1348.51 INFORMATION SERVICES 44
## 5 8.0000 744.00 OFFICE OF THE MAYOR-PRESIDENT 40
## 6 8.0000 744.00 OFFICE OF THE MAYOR-PRESIDENT 40
## 7 18.1154 15000.39 EMERGENCY MEDICAL SERVICES 52
## 8 8.0000 744.00 OFFICE OF THE MAYOR-PRESIDENT 40
## 9 8.0000 752.00 OFFICE OF THE MAYOR-PRESIDENT 40
## 10 8.0000 732.00 OFFICE OF THE MAYOR-PRESIDENT 40
## # ... with 147 more rows, and 16 more variables: division_name <chr>,
## # division_num <chr>, employee_num <chr>, end_date <dttm>,
## # gross_pay <dbl>, hire_date <dttm>, job_code <int>,
## # job_code_description <chr>, longevity_percentage <dbl>,
## # overtime_hourly_rate <dbl>, payroll_status <int>,
## # scheduled_hours <int>, total_hourly_rate <dbl>,
## # total_overtime_hours <dbl>, total_overtime_pay <dbl>, year <int>
- Sort the records descending by base pay. Then report department name, job description and base pay for the 5 highest base pays.
df %>%
arrange(desc(base_pay)) %>%
select(base_pay, department_name, job_code_description) %>%
slice(1:5)
## # A tibble: 5 x 3
## base_pay department_name
## <dbl> <chr>
## 1 146077.3 COUNCIL ADMINISTRATOR
## 2 144585.0 DEPT OF TRANSPORTATION AND DRAINAGE
## 3 123834.1 HUMAN RESOURCES
## 4 123242.1 DEPT OF DEVELOPMENT
## 5 115963.3 PARISH ATTORNEY
## # ... with 1 more variables: job_code_description <chr>
- Create new variables and summarise variables:
- Use
mutate()
to create a new variable calledweekly_hours
containing the average of hours worked per week considering the amount of biweekly hours (scheduled_hours
) and the total amount of overtime hours per year. Select the 5 job codes with the highest amount of hours worked per week and report job code andweekly_hours
df %>%
mutate(weekly_hours = (scheduled_hours*26 + total_overtime_hours )/52) %>%
arrange(desc(weekly_hours)) %>%
top_n(5) %>%
select(job_code_description, weekly_hours)
## # A tibble: 5 x 2
## job_code_description weekly_hours
## <chr> <dbl>
## 1 HEAVY EQUIPMENT OPERATOR 78.61538
## 2 TREATMENT PLANT OPERATOR II 72.33654
## 3 FIRE CAPTAIN 63.86538
## 4 POLICE SERGEANT 62.24038
## 5 TREATMENT PLANT OPERATOR II 62.05288
- Count the total number of observations for each department. Sort descending by total number of employees.
df %>%
count(department_name) %>%
arrange(desc(n))
## # A tibble: 28 x 2
## department_name n
## <chr> <int>
## 1 POLICE DEPARTMENT 196
## 2 LIBRARY BOARD OF CONTROL 151
## 3 FIRE DEPARTMENT 97
## 4 HUMAN DEVELOPMENT AND SERVICES 65
## 5 EMERGENCY MEDICAL SERVICES 63
## 6 DEPARTMENT OF PUBLIC WORKS 55
## 7 DEPT OF MAINTENANCE 52
## 8 FINANCE DEPARTMENT 34
## 9 OFFICE OF THE MAYOR-PRESIDENT 33
## 10 CITY COURT 32
## # ... with 18 more rows
- Create a new variable
percent
and calculate the percent of employees allocated to each department. To calculate percents, make sure you understand the difference betweenn()
andsum()
.
df %>%
count(department_name) %>%
arrange(desc(n)) %>%
mutate(prop = n/sum(n) * 100)
## # A tibble: 28 x 3
## department_name n prop
## <chr> <int> <dbl>
## 1 POLICE DEPARTMENT 196 19.6
## 2 LIBRARY BOARD OF CONTROL 151 15.1
## 3 FIRE DEPARTMENT 97 9.7
## 4 HUMAN DEVELOPMENT AND SERVICES 65 6.5
## 5 EMERGENCY MEDICAL SERVICES 63 6.3
## 6 DEPARTMENT OF PUBLIC WORKS 55 5.5
## 7 DEPT OF MAINTENANCE 52 5.2
## 8 FINANCE DEPARTMENT 34 3.4
## 9 OFFICE OF THE MAYOR-PRESIDENT 33 3.3
## 10 CITY COURT 32 3.2
## # ... with 18 more rows
- Calculate the total cost by department as the sum of all base pays within each department. Then add two variables for the total number of employees by department and the average cost by employee. Order descending by employee’s average cost.
df %>%
group_by(department_name) %>%
summarise(tot_cost = sum(base_pay), count = n(), avgCost = tot_cost/count) %>%
arrange(desc(avgCost))
## # A tibble: 28 x 4
## department_name tot_cost count avgCost
## <chr> <dbl> <int> <dbl>
## 1 COUNCIL ADMINISTRATOR 307209.2 5 61441.84
## 2 FIRE DEPARTMENT 4990410.5 97 51447.53
## 3 CITY CONSTABLE 354128.5 7 50589.78
## 4 HUMAN RESOURCES 505287.8 10 50528.78
## 5 DEPT OF BUILDINGS AND GROUNDS 561186.6 12 46765.55
## 6 DEPT OF TRANSPORTATION AND DRAINAGE 1006870.4 22 45766.84
## 7 DEPT OF DEVELOPMENT 813328.1 18 45184.89
## 8 PARISH ATTORNEY 1243457.6 28 44409.20
## 9 ANIMAL CONTROL CENTER 335165.3 8 41895.67
## 10 POLICE DEPARTMENT 8026539.7 196 40951.73
## # ... with 18 more rows
- The
gather()
andspread()
function:
- Report the average hourly rate and base pay by department into a table where each column represents a department name, and with only two rows (avgHourlyRate and avgBasePay) for the department with the highest average hourly rate. Calculate the average base and hourly pay, then use a combination of
gather()
andspread()
to reshape accordingly.
df %>%
group_by(department_name) %>%
summarise(avgBase = mean(base_pay), avgHour = mean(base_hourly_rate)) %>%
arrange(desc(avgHour)) %>% top_n(3) %>%
gather(salary, amount, avgBase, avgHour) %>%
spread(department_name, amount)
## # A tibble: 2 x 4
## salary `COUNCIL ADMINISTRATOR` `HUMAN RESOURCES` `PARISH ATTORNEY`
## * <chr> <dbl> <dbl> <dbl>
## 1 avgBase 61441.8420 50528.78200 44409.19929
## 2 avgHour 30.7714 26.43851 26.90328
- The
separate()
function.
- Some records in
division_name
seem to be a concatenatenation ofdepartment_name
and the actual division name, separated by a-
. Useseparate()
to separate the strings indivision_name
on the-
. Then store the values into three new column variables, naming the thirddivision_name
(thus replacing the currentdivision_name
variable). Drop the other two columns
df %>% separate(division_name, into = c('a', 'division_name', 'b'), sep = '-') %>% select(-a, -b) %>%
distinct(division_name)
## # A tibble: 67 x 1
## division_name
## <chr>
## 1 ADMINISTRATION
## 2 UNIFORM PATROL BUREAU
## 3 FIRE SUPPRESSION & PREVENTION
## 4 <NA>
## 5 CRIMINAL INVEST BUREAU
## 6 SUMMER YOUTH EMP PROG
## 7 SPECIAL OPERATIONS
## 8 EBRP COMMUNICATIONS DISTRICT
## 9 OPERATIONS & MAINT
## 10 WASTEWATER COLLECTION
## # ... with 57 more rows